Contents
This tutorial will quickly get you started with our relational database. You will learn more if you run each of these queries yourself and experiment with your own variations. Once you have done this you will be ready to solve the practical exercises.
The database used in this course contains information about movies and people, taken from the Internet Movie Database (IMDb).
We use SQL as query language. SQL is a vast language, and this tutorial teaches you a small subset through examples. However, we do try to cover the most commonly used constructs in the query language.
The database management system is HyperSQL Database (HSQLDB), which has the advantage of being quite easy to set up. The things you will learn in these exercises apply equally to more common databases you might have heard of, such as MySQL or PostgreSQL.
If you run into any problems, please post questions (or answers!) on the course's Moodle page.
You can use your own computer, or a college or university computer with Java installed. The examples should work equally well on Linux, Windows or Mac OS.
Follow the instructions at HyperSQL Database (HSQLDB) in order to dowload the system. Once you have unzipped it, you should have a directory hsqldb-2.5.0 containing a subdirectory hsqldb. Within hsqldb is a subdirectory lib where the *.jar files reside.
Download the data from movies-relational.zip and unzip this to create the directory movies-relational (put it anywhere you want).
Next, download the configuration file db1a.rc. Follow the instructions in this file to add the full path to your movies-relational data. This file sets you up as the administrator for the db1a database with an empty password.
Now you are ready to run the HyperSQL shell with the following command line:
java -jar <full path to directory with jar files>sqltool.jar --rcfile=db1a.rc db1a
This should start a read-eval-print loop in which you can enter SQL expressions (terminated with a ;) and see the results. Type ? for help and q to quit.
Our movie database contains the following tables
In terms of the Entity-Relationshp model described in lectures, the first three tables contain entities while the last three implement relationships.
Let's start by using variations on the select ... from construct to explore the contents of the movies table.
The following query will list 10 records from this table:
The * in the query specifies that every column from the table will be printed. We are using limit 10 since there are thousands of records in this table! Note that most systems today are not case sensitive but for some (strange) reason it has become a tradition to write SQL's keywords using CAPITALS, so you many people would write this query as
We will write all queries using lower case. Both queries will produce the same output:
MOVIE_ID TITLE YEAR TYPE MINUTES RATING VOTES --------- ------------------------------------------------- ---- ----- ------- ------ ------- tt0118694 In the Mood for Love 2000 movie 98 8.1E0 107729 tt0120630 Chicken Run 2000 movie 84 7.0E0 165659 tt0120679 Frida 2002 movie 123 7.4E0 75287 tt0120737 The Lord of the Rings: The Fellowship of the Ring 2001 movie 178 8.8E0 1530145 tt0120903 X-Men 2000 movie 104 7.4E0 542044 tt0120917 The Emperor's New Groove 2000 movie 78 7.3E0 163952 tt0121164 Corpse Bride 2005 movie 77 7.3E0 225638 tt0121766 Star Wars: Episode III - Revenge of the Sith 2005 movie 140 7.5E0 647896 tt0126029 Shrek 2001 movie 90 7.8E0 569925 tt0133240 Treasure Planet 2002 movie 95 7.2E0 92041
The movies table has the following columns:
Rather than using * we can be explicit about the which columns we want to see and in what order. For example:
This query produces the following output:
YEAR TITLE RATING ---- ------------------------------------------------- ------ 2000 In the Mood for Love 8.1E0 2000 Chicken Run 7.0E0 2002 Frida 7.4E0 2001 The Lord of the Rings: The Fellowship of the Ring 8.8E0 2000 X-Men 7.4E0 2000 The Emperor's New Groove 7.3E0 2005 Corpse Bride 7.3E0 2005 Star Wars: Episode III - Revenge of the Sith 7.5E0 2001 Shrek 7.8E0 2002 Treasure Planet 7.2E0
The where construct allows us to select only those records that satisfy a given condition. Here we use it to select only movies released in 2017:
The output:
TITLE RATING ------------------------ ------ Wonder Woman 7.4E0 It 7.4E0 The Greatest Showman 7.6E0 The Foreigner 7.0E0 A Dog's Purpose 7.1E0 Blade Runner 2049 8.0E0 Spider-Man: Homecoming 7.5E0 Coco 8.4E0 Star Wars: The Last Jedi 7.1E0 Wonder 8.0E0
SQL has many reserved words. Our column name year happens to be an SQL reserved word that gets highlighted by the automation we are using. Sigh. Happily, most SQL systems are very liberal in allowing many reserved words to be used as column or table names.
The condition of the where clause can contain a complex boolean-valued constraint. For example, if we want to list movies from either 2016 or 2017 with titles containing the word Star we can write
The output:
TITLE YEAR RATING --------------------------------------------------------- ---- ------ Star Wars: The Last Jedi 2017 7.1E0 Star Trek Beyond 2016 7.1E0 Rogue One: A Star Wars Story 2016 7.8E0 Bright Lights: Starring Carrie Fisher and Debbie Reynolds 2016 8.0E0
Here the like operator matches strings against some pattern. The percent sign % in the pattern means that "any sub-string can appear here". Thus, title like '%Star%' matches only titles that contain the word "Star" as a sub-string.
How many movies are in our database? The following query will tell us:
The output:
1236
We can use the group by construct to see how many records are associated with each year:
The output:
YEAR TOTAL ---- ----- 2000 51 2002 54 2001 52 2005 58 2003 63 2008 67 2006 70 2004 72 2007 88 2009 74 2013 71 2015 60 2010 68 2011 71 2019 20 2012 63 2017 58 2014 65 2016 70 2018 41
The group by year clause tells the database system to group together all of the records having the same value for the year column and to then count the number of records in each group. Note that we have given the name total to the count(*) column using the as construct.
Your system probably did not list the years in order. We can fix that using the order by construct!
The output:
YEAR TOTAL ---- ----- 2000 51 2001 52 2002 54 2003 63 2004 72 2005 58 2006 70 2007 88 2008 67 2009 74 2010 68 2011 71 2012 63 2013 71 2014 65 2015 60 2016 70 2017 58 2018 41 2019 20
If we want to list the output in reverse order we use the desc keyword:
The output:
YEAR TOTAL ---- ----- 2019 20 2018 41 2017 58 2016 70 2015 60 2014 65 2013 71 2012 63 2011 71 2010 68 2009 74 2008 67 2007 88 2006 70 2005 58 2004 72 2003 63 2002 54 2001 52 2000 51
SQL contains many aggregate functions like count. Here is a query that uses the most common ones:
select type,
count(*) as movie_count,
min(votes) as min_votes,
max(votes) as max_votes,
sum(votes) as sum_of_votes,
avg(votes) as averge_votes
from movies
group by type;
The output:
TYPE MOVIE_COUNT MIN_VOTES MAX_VOTES SUM_OF_VOTES AVERGE_VOTES ------- ----------- --------- --------- ------------ ------------ movie 947 50210 2093718 242717815 256301 tvMovie 289 1011 31496 1131873 3916
We can also use arithmetic in expressions such as
select sum(minutes)/1440 as days, sum(votes)/count(*) as average, 17 * (max(votes) - min(votes)) as nonsense
from movies
where type = 'movie';
The output:
DAYS AVERAGE NONSENSE ---- ------- -------- 78 256301 34739636
Before we go on to learn more SQL let's first use our current knowledge to explore the other tables in our database.
How many records are in the people table?
The output:
6284
Let's look at 10 records:
The output:
PERSON_ID NAME BIRTHYEAR DEATHYEAR --------- ------------------- --------- --------- nm0000002 Lauren Bacall 1924 2014 nm0000005 Ingmar Bergman 1918 2007 nm0000032 Charlton Heston 1923 2008 nm0000056 Paul Newman 1925 2008 nm0000090 Armin Mueller-Stahl 1930 nm0000092 John Cleese 1939 nm0000093 Brad Pitt 1963 nm0000095 Woody Allen 1935 nm0000096 Gillian Anderson 1968 nm0000098 Jennifer Aniston 1969
The people table has the following columns:
The blank columns represent null values (more on those below), so deathyear will be null if the person is still alive or if their deathyear has not been entered into the database.
Let's take a look at the contents of the genres table with this query.
The output:
GENRE_ID GENRE -------- ----------- 1 Animation 2 Drama 3 Fantasy 4 Action 5 Romance 6 Sci-Fi 7 Comedy 8 Thriller 9 War 10 Mystery 11 Musical 12 Adventure 13 Music 14 Family 15 Biography 16 Crime 17 Horror 18 Documentary 19 History 20 Sport 21 Western 22 News
This table has the following columns:
The has_genre table implements a relationship between movies and genres.
The output:
MOVIE_ID GENRE_ID --------- -------- tt0118694 2 tt0118694 5 tt0120630 1 tt0120630 7 tt0120630 12 tt0120679 2 tt0120679 5 tt0120679 15 tt0120737 2 tt0120737 3
This table has the following columns:
Both columns together form the key. Why is that? Because it implements a many-to-many relationship (see lectures).
Let's look at 10 records in the has_position table:
The output:
PERSON_ID MOVIE_ID POSITION --------- --------- -------- nm0000002 tt0276919 actor nm0000005 tt0299478 director nm0000032 tt0310793 self nm0000056 tt0317219 actor nm0000090 tt0765443 actor nm0000092 tt0481499 writer nm0000093 tt0208092 actor nm0000093 tt0240772 actor nm0000093 tt0266987 actor nm0000093 tt0332452 actor
The has_position table has the following columns:
Since this table implements are relationship between movies and people, the key formed of the pair movie_id and person_id. (Note: The key should really be all three columns since people can have multiple positions in the same movie. For some reason the IMDb data source has been constructed in a way that only one position per movie/person is listed. I hope they fix this in the future ....)
Let's see what values the position column contains:
The output:
POSITION TOTAL -------- ----- actor 4675 producer 2149 writer 2114 director 1324 self 229
Let's look at 10 records in the plays_role table:
which produces the output:
PERSON_ID MOVIE_ID ROLE --------- --------- --------------- nm0000002 tt0276919 Ma Ginger nm0000032 tt0310793 Himself nm0000056 tt0317219 Doc Hudson nm0000090 tt0765443 Semyon nm0000093 tt0208092 Mickey O'Neil nm0000093 tt0240772 Rusty Ryan nm0000093 tt0266987 Tom Bishop nm0000093 tt0332452 Achilles nm0000093 tt0361748 Lt. Aldo Raine nm0000093 tt0421715 Benjamin Button
The plays_role table has the following columns:
Here the key is a triple: all columns together form the key. Why is that?
Let's do our first join! This query will "join together" records in the movies and has_genre tables in order to list movie titles from 2017 together with their genres:
select title, genre
from movies
join has_genre on has_genre.movie_id = movies.movie_id
join genres on genres.genre_id = has_genre.genre_id
where year = 2017
limit 20;
The output:
TITLE GENRE ---------------------- --------- Wonder Woman Fantasy Wonder Woman Action Wonder Woman Adventure It Horror The Greatest Showman Drama The Greatest Showman Musical The Greatest Showman Biography The Foreigner Action The Foreigner Thriller A Dog's Purpose Drama A Dog's Purpose Comedy A Dog's Purpose Adventure Blade Runner 2049 Drama Blade Runner 2049 Action Blade Runner 2049 Mystery Spider-Man: Homecoming Action Spider-Man: Homecoming Sci-Fi Spider-Man: Homecoming Adventure Coco Animation Coco Adventure
As we can see, some movies are associated with multiple genres. This answers the question of why movie_id and genre together form the key for the has_genre table! (The same reasoning goes for the plays_role table --- an actor can play more than one role in a movie.)
We have used the join .. on construct. Our query looks for movie records with year = 2017 and then joins together each such record with every record in the genres table that has the same movie_id value. Note the condition has_genre.movie_id = movies.movie_id disambiguates which instance of the column movie_id we are referring to since both the movies and has_genre tables share a column with the same name.
We can write the same query using the as construct that gives a local abbreviation for a table's name:
select title, genre
from movies as m
join has_genre as hg on hg.movie_id = m.movie_id
join genres as g on g.genre_id = hg.genre_id
where year = 2017
limit 20;
This kind of abbreviation will be essential when our joins contain more than one instance of the same table. For example, suppose we want to list the popular (votes > 100000) romantic comedies (RomComs)--- those movies that are associated with the genres Romance and Comedy:
select title, year, rating, votes
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where m.votes > 100000 and g1.genre = 'Romance' and g2.genre = 'Comedy'
order by votes desc;
The output:
TITLE YEAR RATING VOTES ------------------------------ ---- ------ ------ Amélie 2001 8.3E0 659080 Silver Linings Playbook 2012 7.7E0 625643 500 Days of Summer 2009 7.7E0 445253 Crazy, Stupid, Love. 2011 7.4E0 440746 Love Actually 2003 7.6E0 388881 The Terminal 2004 7.3E0 370467 The 40-Year-Old Virgin 2005 7.1E0 368827 Midnight in Paris 2011 7.7E0 360153 Easy A 2010 7.1E0 334211 Up in the Air 2009 7.4E0 303024 50/50 2011 7.6E0 302970 Meet the Parents 2000 7.0E0 295100 Moonrise Kingdom 2012 7.8E0 294643 Pitch Perfect 2012 7.2E0 259275 Forgetting Sarah Marshall 2008 7.1E0 249843 Vicky Cristina Barcelona 2008 7.1E0 229401 The Artist 2011 7.9E0 218524 Garden State 2004 7.4E0 201946 P.S. I Love You 2007 7.0E0 192577 The Lobster 2015 7.1E0 187144 I Love You, Man 2009 7.0E0 186634 Sideways 2004 7.5E0 170885 About a Boy 2002 7.0E0 163016 Definitely, Maybe 2008 7.1E0 147430 The Spectacular Now 2013 7.1E0 135417 Punch-Drunk Love 2002 7.3E0 134438 Lars and the Real Girl 2007 7.3E0 132212 Good Bye Lenin! 2003 7.7E0 130105 It's Kind of a Funny Story 2010 7.1E0 126506 The Kids Are All Right 2010 7.0E0 121248 Me and Earl and the Dying Girl 2015 7.7E0 114081 Safety Not Guaranteed 2012 7.0E0 112692 Love, Rosie 2014 7.2E0 107473 The Big Sick 2017 7.6E0 106480
Now suppose we want to see what other genres RomComs are associated with. One way to do this is with a nested query:
select m1.title, m1.year, g.genre, m1.rating, m1.votes
from movies as m1
join has_genre as hg on hg.movie_id = m1.movie_id
join genres as g on g.genre_id = hg.genre_id
where m1.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
and m1.movie_id in
(select m2.movie_id
from movies as m2
join has_genre as hg1 on hg1.movie_id = m2.movie_id
join has_genre as hg2 on hg2.movie_id = m2.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where g1.genre = 'Romance' and g2.genre = 'Comedy')
order by m1.votes desc
limit 10;
The output:
TITLE YEAR GENRE RATING VOTES ----------------------- ---- ------- ------ ------ Silver Linings Playbook 2012 Drama 7.7E0 625643 500 Days of Summer 2009 Drama 7.7E0 445253 Crazy, Stupid, Love. 2011 Drama 7.4E0 440746 Love Actually 2003 Drama 7.6E0 388881 The Terminal 2004 Drama 7.3E0 370467 Midnight in Paris 2011 Fantasy 7.7E0 360153 Easy A 2010 Drama 7.1E0 334211 Up in the Air 2009 Drama 7.4E0 303024 50/50 2011 Drama 7.6E0 302970 Moonrise Kingdom 2012 Drama 7.8E0 294643
Here (select m2.movie_id ...) is called a nested query. We check if our movie is a RomCom, and then check if there is another genre that is not equal to either Romance or Comedy.
We can introduce a name that represents a query using views. For example, the following SQL defines a view romcom_ids to contain all those movie_id values for RomComs. Executing the statement create view romcom_ids ... will place this view definition in the database. The name romcom_ids can then be used as if it were the name of a data base table.
drop view if exists romcom_ids;
create view romcom_ids as
select m.movie_id as movie_id
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where g1.genre = 'Romance' and g2.genre = 'Comedy';
select m.title, m.year, g.genre, m.rating, m.votes
from romcom_ids as r
join has_genre as hg on hg.movie_id = r.movie_id
join genres as g on g.genre_id = hg.genre_id
join movies as m on m.movie_id = r.movie_id
where m.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
order by m.votes desc
limit 10;
The query above will return the same result as the previous version:
TITLE YEAR GENRE RATING VOTES ----------------------- ---- ------- ------ ------ Silver Linings Playbook 2012 Drama 7.7E0 625643 500 Days of Summer 2009 Drama 7.7E0 445253 Crazy, Stupid, Love. 2011 Drama 7.4E0 440746 Love Actually 2003 Drama 7.6E0 388881 The Terminal 2004 Drama 7.3E0 370467 Midnight in Paris 2011 Fantasy 7.7E0 360153 Easy A 2010 Drama 7.1E0 334211 Up in the Air 2009 Drama 7.4E0 303024 50/50 2011 Drama 7.6E0 302970 Moonrise Kingdom 2012 Drama 7.8E0 294643
In general, when you create a table or view into a database it is best to be careful and make sure you first delete any older versions. That is the reason for the statement drop view if exists romcom_ids that appears just before the view definition. (Please don't use views in your solutions to the relational tick!)
SQL has a special value null that is used when a value is not known or is somehow not relevant. We saw null values for the column deathYear in the people table:
The output:
PERSON_ID NAME BIRTHYEAR DEATHYEAR --------- ------------------- --------- --------- nm0000002 Lauren Bacall 1924 2014 nm0000005 Ingmar Bergman 1918 2007 nm0000032 Charlton Heston 1923 2008 nm0000056 Paul Newman 1925 2008 nm0000090 Armin Mueller-Stahl 1930 nm0000092 John Cleese 1939 nm0000093 Brad Pitt 1963 nm0000095 Woody Allen 1935 nm0000096 Gillian Anderson 1968 nm0000098 Jennifer Aniston 1969
The HyperSQL system will usually print null as a blank (other systems may behave differently).
Let's try to count the number of people records with a null in the deathYear column:
The output:
0
Hmmm. What is going on here? As will be explained further in lectures, with the null we enter the world of 3-valued logic. For two expressions e1 and e2 the expression e1 = e2 can evaluate to true or false or null (which can be read as we don't know the answer). Equality returns null when either e1 or e2 evaluates to null. For example, column = null always returns null. Furthermore, the SELECT ... WHERE construct only returns records for those cases where the where clause returns true.
To get around this problem SQL introduced the is null boolean predicate:
The output:
5919
Sometimes we want to generate null values when we are joing tables together. Let's look at an example.
First, here are the the people credited with work on "Silver Linings Playbook":
select name, position
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ----------------- -------- Robert De Niro actor Bruce Cohen producer Bradley Cooper actor Donna Gigliotti producer Jonathan Gordon producer David O. Russell director Jacki Weaver actor Jennifer Lawrence actor Matthew Quick writer
Let's modify that query in order to see what roles the actors played in that movie:
select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ROLE ----------------- -------- ------- Robert De Niro actor Pat Sr. Bradley Cooper actor Pat Jacki Weaver actor Dolores Jennifer Lawrence actor Tiffany
Well, that's nice but we lost the records associated with non-actors since they do not join with records in the plays_role table! How can we retain them?
We solve this problem by using the left join construct:
select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ROLE ----------------- -------- ------- Robert De Niro actor Pat Sr. Bruce Cohen producer [null] Bradley Cooper actor Pat Donna Gigliotti producer [null] Jonathan Gordon producer [null] David O. Russell director [null] Jacki Weaver actor Dolores Jennifer Lawrence actor Tiffany Matthew Quick writer [null]
Note that in this case our user-interface displays [null] instead of whitespace. Why? That's because the type of this column is string, and this display wants us to be able to distinguish between a null value and the empty string!
The left join construct.
When there is no matchng record in plays_role, the results (role in this case) will be given the null value.
Let's try to write a query that returns the movies in which Jennifer Lawrence played the same role:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
and r1.role = r2.role
order by r1.role, m1.title, m1.year;
The output:
ROLE TITLE YEAR ----------------- ------------------------------- ---- Aurora Lane Passengers 2016 Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games: Catching Fire 2013 Katniss Everdeen The Hunger Games: Catching Fire 2013 Mystique X-Men: First Class 2011 Raven X-Men: First Class 2011 Ree Winter's Bone 2010 Rosalyn Rosenfeld American Hustle 2013 Tiffany Silver Linings Playbook 2012
Hmmm. This didn't work -- we have every role Jennifer Lawrence played in any movie, with duplicate rows as well. Why? Suppose our query finds two movies, say A and B. If they are the same movie (A=B), then we we return the role played in that movie. Otherwise, there are two orders in which the query could process these movies -- first A then B, and first B then A. In each case we will return the same result, and this explains the duplicates rows.
Let's avoid the first problem by a modification to the where clause by adding m1.movie_id <> m2.movie_id:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by r1.role, m1.title, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------- ---- Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games: Catching Fire 2013
Wow! That seems to have solved both our problems! But did it really? As a sanity check let's try another actor:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------------- ---- Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009
Ooops! We still have duplicate rows! What's going on? (Bofore reading on, see if you can figure out the answer.)
In this case there are three movies involved, say say A, B, and C. So, how many ways can I find a pair of distinct movies with A first? Well, just A, B and A, C. So the query returns two identical rows for movie A. The same goes for B and C. In general, if there are k movies with the same role, then we will get k-1 copies of each! This explains what happened with Jennifer_Lawrence. In that case k = 2 and k-1 = 1, so no duplicate rows! We just got lucky!
All of this is due to the fact that SQL is not based on sets of records, but rather on multisets of records (sets containing duplicates, also called bags). If we want a set rather than a multiset, we can use the select distinct construct:
select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------------- ---- Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009
Yippee! This seems correct. Just to be sure let's sanity check again and try another actor:
select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Scarlett Johansson'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ----------------------------------- ---- Black Widow Captain America: Civil War 2016 Natasha Romanoff Captain America: Civil War 2016 Black Widow Captain America: The Winter Soldier 2014 Natasha Romanoff Captain America: The Winter Soldier 2014 Black Widow The Avengers 2012 Natasha Romanoff The Avengers 2012
Did this result surprised you? Is it correct? It turns out that Scarlett Johansson played two distinct roles in each of these movies. So it does seem correct. But it also forces us to admit that the original English specification, returns the movies in which so-and-so played the same role, was rather vague!
Now you should know enough to attempt the practical questions, which appear on a separate page.