Contents
This tutorial will quickly get you started with our relational database. You will learn more if you run each of these queries yourself and experiment with your own variations. Once you have done this you will be ready to solve the practical exercises.
The database used in this course contains information about movies and people, taken from the Internet Movie Database (IMDb).
We use SQL as query language. SQL is a vast language, and this tutorial teaches you a small subset through examples. However, we do try to cover the most commonly used constructs in the query language.
The database management system is HyperSQL Database (HSQLDB), which has the advantage of being quite easy to set up. The things you will learn in these exercises apply equally to more common databases you might have heard of, such as MySQL or PostgreSQL.
If you run into any problems, please post questions (or answers!) on the course's Moodle page.
You can use your own computer, or a college or university computer with Java installed. The examples should work equally well on Linux, Windows or Mac OS.
Follow the instructions at HyperSQL Database (HSQLDB) in order to download the system. Once you have unzipped it, you should have a directory hsqldb-2.5.1 containing a subdirectory hsqldb. Within hsqldb is a subdirectory lib where the *.jar files reside.
Download the data from movies-relational.zip and unzip this to create the directory movies-relational (put it anywhere you want).
Next, download the configuration file db1a.rc. Follow the instructions in this file to add the full path to your movies-relational data. This file sets you up as the administrator for the db1a database with an empty password.
Now you are ready to run the HyperSQL shell with the following command line:
java -jar <full path to directory with jar files>sqltool.jar --rcfile=db1a.rc db1a
This should start a read-eval-print loop in which you can enter SQL expressions (terminated with a ;) and see the results. Type \? for help and \q to quit.
Our movie database contains the following tables
In terms of the Entity-Relationship model described in lectures, the first five tables contain entity sets while the last four implement relationships.
Let's start by using variations on the select ... from construct to explore the contents of the movies table.
The following query will list 10 records from this table:
The * in the query specifies that every column from the table will be printed. We are using limit 10 since there are thousands of records in this table! Note that most systems today are not case sensitive but for some (strange) reason it has become a tradition to write SQL's keywords using CAPITALS, so you many people would write this query as
We will write all queries using lower case. Both queries will produce the same output:
MOVIE_ID TITLE YEAR TYPE MINUTES RATING VOTES --------- ------------------------------------------------- ---- ----- ------- ------ ------- tt0118694 In the Mood for Love 2000 movie 98 8.1E0 119514 tt0120630 Chicken Run 2000 movie 84 7.0E0 174216 tt0120679 Frida 2002 movie 123 7.4E0 79818 tt0120737 The Lord of the Rings: The Fellowship of the Ring 2001 movie 178 8.8E0 1622297 tt0120903 X-Men 2000 movie 104 7.4E0 562639 tt0120917 The Emperor's New Groove 2000 movie 78 7.3E0 175564 tt0121164 Corpse Bride 2005 movie 77 7.3E0 237744 tt0121766 Star Wars: Episode III - Revenge of the Sith 2005 movie 140 7.5E0 695155 tt0126029 Shrek 2001 movie 90 7.8E0 600991 tt0133240 Treasure Planet 2002 movie 95 7.2E0 99708
The movies table has the following columns:
Rather than using * we can be explicit about the which columns we want to see and in what order. For example:
This query produces the following output:
YEAR TITLE RATING ---- ------------------------------------------------- ------ 2000 In the Mood for Love 8.1E0 2000 Chicken Run 7.0E0 2002 Frida 7.4E0 2001 The Lord of the Rings: The Fellowship of the Ring 8.8E0 2000 X-Men 7.4E0 2000 The Emperor's New Groove 7.3E0 2005 Corpse Bride 7.3E0 2005 Star Wars: Episode III - Revenge of the Sith 7.5E0 2001 Shrek 7.8E0 2002 Treasure Planet 7.2E0
The where construct allows us to select only those records that satisfy a given condition. Here we use it to select only movies released in 2017:
The output:
TITLE RATING --------------------------------------- ------ Wonder Woman 7.4E0 It 7.3E0 The Greatest Showman 7.6E0 The Foreigner 7.0E0 A Dog's Purpose 7.2E0 Blade Runner 2049 8.0E0 Spider-Man: Homecoming 7.4E0 Coco 8.4E0 Star Wars: Episode VIII - The Last Jedi 7.0E0 Wonder 8.0E0
SQL has many reserved words. Our column name year happens to be an SQL reserved word that gets highlighted by the automation we are using. Sigh. Happily, most SQL systems are very liberal in allowing many reserved words to be used as column or table names.
The condition of the where clause can contain a complex boolean-valued constraint. For example, if we want to list movies from either 2016 or 2017 with titles containing the word Star we can write
The output:
TITLE YEAR RATING --------------------------------------- ---- ------ Star Wars: Episode VIII - The Last Jedi 2017 7.0E0 Star Trek Beyond 2016 7.1E0 Rogue One: A Star Wars Story 2016 7.8E0
Here the like operator matches strings against some pattern. The percent sign % in the pattern means that "any sub-string can appear here". Thus, title like '%Star%' matches only titles that contain the word "Star" as a sub-string.
How many movies are in our database? The following query will tell us:
The output:
1324
We can use the group by construct to see how many records are associated with each year:
The output:
YEAR TOTAL ---- ----- 2000 53 2002 54 2001 53 2005 61 2003 66 2008 67 2006 70 2004 73 2007 89 2009 76 2013 74 2015 65 2010 70 2011 71 2019 48 2012 63 2017 63 2014 69 2016 78 2020 9 2018 52
The group by year clause tells the database system to group together all of the records having the same value for the year column and to then count the number of records in each group. Note that we have given the name total to the count(*) column using the as construct.
Your system probably did not list the years in order. We can fix that using the order by construct!
The output:
YEAR TOTAL ---- ----- 2000 53 2001 53 2002 54 2003 66 2004 73 2005 61 2006 70 2007 89 2008 67 2009 76 2010 70 2011 71 2012 63 2013 74 2014 69 2015 65 2016 78 2017 63 2018 52 2019 48 2020 9
If we want to list the output in reverse order we use the desc keyword:
The output:
YEAR TOTAL ---- ----- 2020 9 2019 48 2018 52 2017 63 2016 78 2015 65 2014 69 2013 74 2012 63 2011 71 2010 70 2009 76 2008 67 2007 89 2006 70 2005 61 2004 73 2003 66 2002 54 2001 53 2000 53
SQL contains many aggregate functions like count. Here is a query that uses the most common ones:
select type,
count(*) as movie_count,
min(votes) as min_votes,
max(votes) as max_votes,
sum(votes) as sum_of_votes,
avg(votes) as averge_votes
from movies
group by type;
The output:
TYPE MOVIE_COUNT MIN_VOTES MAX_VOTES SUM_OF_VOTES AVERGE_VOTES ------- ----------- --------- --------- ------------ ------------ movie 1002 50096 2246851 265213662 264684 tvMovie 322 1001 109891 1389654 4315
We can also use arithmetic in expressions such as
select sum(minutes)/1440 as days, sum(votes)/count(*) as average, 17 * (max(votes) - min(votes)) as nonsense
from movies
where type = 'movie';
The output:
DAYS AVERAGE NONSENSE ---- ------- -------- 82 264684 37344835
Before we go on to learn more SQL let's first use our current knowledge to explore the other tables in our database.
How many records are in the people table?
The output:
6745
Let's look at 10 records:
The output:
PERSON_ID NAME BIRTHYEAR DEATHYEAR --------- ------------------- --------- --------- nm0000002 Lauren Bacall 1924 2014 nm0000005 Ingmar Bergman 1918 2007 nm0000032 Charlton Heston 1923 2008 nm0000056 Paul Newman 1925 2008 nm0000090 Armin Mueller-Stahl 1930 nm0000092 John Cleese 1939 nm0000093 Brad Pitt 1963 nm0000095 Woody Allen 1935 nm0000096 Gillian Anderson 1968 nm0000098 Jennifer Aniston 1969
The people table has the following columns:
The blank columns represent null values (more on those below), so deathyear will be null if the person is still alive or if their deathyear has not been entered into the database.
Let's take a look at the contents of the genres table with this query.
The output:
GENRE_ID GENRE -------- ----------- 1 Adventure 2 Family 3 Fantasy 4 Drama 5 Romance 6 Comedy 7 History 8 War 9 Action 10 Mystery 11 Thriller 12 Biography 13 Documentary 14 Crime 15 Horror 16 Sci-Fi 17 Animation 18 Western 19 Music 20 Sport 21 Musical 22 News
This table has the following columns:
The has_genre table implements a relationship between movies and genres.
The output:
MOVIE_ID GENRE_ID --------- -------- tt0118694 4 tt0118694 5 tt0120630 1 tt0120630 6 tt0120630 17 tt0120679 4 tt0120679 5 tt0120679 12 tt0120737 1 tt0120737 4
This table has the following columns:
Both columns together form the key. Why is that? Because it implements a many-to-many relationship (see lectures).
Let's look at 10 records in the has_position table:
The output:
PERSON_ID MOVIE_ID POSITION --------- --------- -------- nm0000002 tt0276919 actor nm0000005 tt0299478 director nm0000032 tt0310793 self nm0000056 tt0317219 actor nm0000090 tt0765443 actor nm0000092 tt0481499 writer nm0000093 tt0208092 actor nm0000093 tt0240772 actor nm0000093 tt0266987 actor nm0000093 tt0332452 actor
The has_position table has the following columns:
Since this table implements are relationship between movies and people, the key formed of the pair movie_id and person_id. (Note: The key should really be all three columns since people can have multiple positions in the same movie. For some reason the IMDb data source has been constructed in a way that only one position per movie/person is listed. I hope they fix this in the future ....)
Let's see what values the position column contains:
The output:
POSITION TOTAL -------- ----- actor 4950 producer 2300 writer 2215 director 1422 self 293
Let's look at 10 records in the plays_role table:
which produces the output:
PERSON_ID MOVIE_ID ROLE --------- --------- --------------- nm0000002 tt0276919 Ma Ginger nm0000032 tt0310793 Self nm0000056 tt0317219 Doc Hudson nm0000090 tt0765443 Semyon nm0000093 tt0208092 Mickey O'Neil nm0000093 tt0240772 Rusty Ryan nm0000093 tt0266987 Tom Bishop nm0000093 tt0332452 Achilles nm0000093 tt0361748 Lt. Aldo Raine nm0000093 tt0421715 Benjamin Button
The plays_role table has the following columns:
Here the key is a triple: all columns together form the key. Why is that?
Let's do our first join! This query will "join together" records in the movies and has_genre tables in order to list movie titles from 2017 together with their genres:
select title, genre
from movies
join has_genre on has_genre.movie_id = movies.movie_id
join genres on genres.genre_id = has_genre.genre_id
where year = 2017
limit 20;
The output:
TITLE GENRE ---------------------- --------- Wonder Woman Adventure Wonder Woman Fantasy Wonder Woman Action It Horror The Greatest Showman Drama The Greatest Showman Biography The Greatest Showman Musical The Foreigner Action The Foreigner Thriller A Dog's Purpose Adventure A Dog's Purpose Drama A Dog's Purpose Comedy Blade Runner 2049 Drama Blade Runner 2049 Action Blade Runner 2049 Mystery Spider-Man: Homecoming Adventure Spider-Man: Homecoming Action Spider-Man: Homecoming Sci-Fi Coco Adventure Coco Family
As we can see, some movies are associated with multiple genres. This answers the question of why movie_id and genre together form the key for the has_genre table! (The same reasoning goes for the plays_role table --- an actor can play more than one role in a movie.)
We have used the join .. on construct. Our query looks for movie records with year = 2017 and then joins together each such record with every record in the genres table that has the same movie_id value. Note the condition has_genre.movie_id = movies.movie_id disambiguates which instance of the column movie_id we are referring to since both the movies and has_genre tables share a column with the same name.
We can write the same query using the as construct that gives a local abbreviation for a table's name:
select title, genre
from movies as m
join has_genre as hg on hg.movie_id = m.movie_id
join genres as g on g.genre_id = hg.genre_id
where year = 2017
limit 20;
This kind of abbreviation will be essential when our joins contain more than one instance of the same table. For example, suppose we want to list the popular (votes > 100000) romantic comedies (RomComs)--- those movies that are associated with the genres Romance and Comedy:
select title, year, rating, votes
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where m.votes > 100000 and g1.genre = 'Romance' and g2.genre = 'Comedy'
order by votes desc;
The output:
TITLE YEAR RATING VOTES -------------------------- ---- ------ ------ Amélie 2001 8.3E0 691357 Silver Linings Playbook 2012 7.7E0 651782 500 Days of Summer 2009 7.7E0 464977 Crazy, Stupid, Love 2011 7.4E0 463379 Love Actually 2003 7.6E0 416123 The Terminal 2004 7.4E0 397297 The 40-Year-Old Virgin 2005 7.1E0 388460 Midnight in Paris 2011 7.7E0 380697 Easy A 2010 7.0E0 350826 Up in the Air 2009 7.4E0 314549 50/50 2011 7.6E0 312229 Moonrise Kingdom 2012 7.8E0 312121 Meet the Parents 2000 7.0E0 306783 Pitch Perfect 2012 7.2E0 272149 Forgetting Sarah Marshall 2008 7.1E0 259576 Vicky Cristina Barcelona 2008 7.1E0 239132 Marriage Story 2019 8.0E0 228649 The Artist 2011 7.9E0 227567 The Lobster 2015 7.2E0 210723 Garden State 2004 7.4E0 205392 P.S. I Love You 2007 7.0E0 198138 I Love You, Man 2009 7.0E0 191881 Sideways 2004 7.5E0 177222 About a Boy 2002 7.0E0 169491 Definitely, Maybe 2008 7.1E0 152907 Punch-Drunk Love 2002 7.3E0 144518 The Spectacular Now 2013 7.1E0 142220 Lars and the Real Girl 2007 7.3E0 136627 Good Bye Lenin! 2003 7.7E0 135849 It's Kind of a Funny Story 2010 7.1E0 132703 The Kids Are All Right 2010 7.0E0 124457 The Big Sick 2017 7.5E0 117784 Love, Rosie 2014 7.2E0 117667 Safety Not Guaranteed 2012 7.0E0 117250 Little Children 2006 7.5E0 104217 Dil Bechara 2020 8.8E0 102695
The nested query above checked that only a single value was in a relation. Multiple value can be checked as the following example illustrates (note the use of parentheses around the two values pr.person_id and pr.movie_id):
select count(*) as total
from plays_role as pr
where (pr.person_id, pr.movie_id)
not in (select hp.person_id, hp.movie_id
from has_position as hp
where hp.position = 'actor' or hp.position = 'self');
The output:
0
Interesting! Does this tell us that there is some (unwanted) redundancy in our IMDb data?
Now suppose we want to see what other genres RomComs are associated with. One way to do this is with a nested query:
select m1.title, m1.year, g.genre, m1.rating, m1.votes
from movies as m1
join has_genre as hg on hg.movie_id = m1.movie_id
join genres as g on g.genre_id = hg.genre_id
where m1.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
and m1.movie_id in
(select m2.movie_id
from movies as m2
join has_genre as hg1 on hg1.movie_id = m2.movie_id
join has_genre as hg2 on hg2.movie_id = m2.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where g1.genre = 'Romance' and g2.genre = 'Comedy')
order by m1.votes desc
limit 10;
The output:
TITLE YEAR GENRE RATING VOTES ----------------------- ---- ------- ------ ------ Silver Linings Playbook 2012 Drama 7.7E0 651782 500 Days of Summer 2009 Drama 7.7E0 464977 Crazy, Stupid, Love 2011 Drama 7.4E0 463379 Love Actually 2003 Drama 7.6E0 416123 The Terminal 2004 Drama 7.4E0 397297 Midnight in Paris 2011 Fantasy 7.7E0 380697 Easy A 2010 Drama 7.0E0 350826 Up in the Air 2009 Drama 7.4E0 314549 50/50 2011 Drama 7.6E0 312229 Moonrise Kingdom 2012 Drama 7.8E0 312121
Here (select m2.movie_id ...) is called a nested query. We check if our movie is a RomCom, and then check if there is another genre that is not equal to either Romance or Comedy.
We can introduce a name that represents a query using views. For example, the following SQL defines a view romcom_ids to contain all those movie_id values for RomComs. This view was placed in your database by the command create view romcom_ids ...:
create view romcom_ids as
select m.movie_id as movie_id
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where g1.genre = 'Romance' and g2.genre = 'Comedy';
The name romcom_ids can now be used as if it were the name of a data base table. For example:
select m.title, m.year, g.genre, m.rating, m.votes
from romcom_ids as r
join has_genre as hg on hg.movie_id = r.movie_id
join genres as g on g.genre_id = hg.genre_id
join movies as m on m.movie_id = r.movie_id
where m.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
order by m.votes desc
limit 10;
This query will will return the same result as the nested RomCom query from above:
TITLE YEAR GENRE RATING VOTES ----------------------- ---- ------- ------ ------ Silver Linings Playbook 2012 Drama 7.7E0 651782 500 Days of Summer 2009 Drama 7.7E0 464977 Crazy, Stupid, Love 2011 Drama 7.4E0 463379 Love Actually 2003 Drama 7.6E0 416123 The Terminal 2004 Drama 7.4E0 397297 Midnight in Paris 2011 Fantasy 7.7E0 380697 Easy A 2010 Drama 7.0E0 350826 Up in the Air 2009 Drama 7.4E0 314549 50/50 2011 Drama 7.6E0 312229 Moonrise Kingdom 2012 Drama 7.8E0 312121
As mentioned above, the has_position table implements multiple many-to-many relationships. Your database contains these five views that can simplify some of your queries over has_position:
create view acts_in as
select person_id, movie_id
from has_position
where position = 'actor';
create view produced as
select person_id, movie_id
from has_position
where position = 'producer';
create view wrote as
select person_id, movie_id
from has_position
where position = 'writer';
create view directed as
select person_id, movie_id
from has_position
where position = 'director';
create view is_self as
select person_id, movie_id
from has_position
where position = 'self';
To see a list of all views defined, type \dv and to see all tables type \dt.
SQL has a special value null that is used when a value is not known or is somehow not relevant. We saw null values for the column deathYear in the people table:
The output:
PERSON_ID NAME BIRTHYEAR DEATHYEAR --------- ------------------- --------- --------- nm0000002 Lauren Bacall 1924 2014 nm0000005 Ingmar Bergman 1918 2007 nm0000032 Charlton Heston 1923 2008 nm0000056 Paul Newman 1925 2008 nm0000090 Armin Mueller-Stahl 1930 nm0000092 John Cleese 1939 nm0000093 Brad Pitt 1963 nm0000095 Woody Allen 1935 nm0000096 Gillian Anderson 1968 nm0000098 Jennifer Aniston 1969
The HyperSQL system will usually print null as a blank (other systems may behave differently).
Let's try to count the number of people records with a null in the deathYear column:
The output:
0
Hmmm. What is going on here? As will be explained further in lectures, with the null we enter the world of 3-valued logic. For two expressions e1 and e2 the expression e1 = e2 can evaluate to true or false or null (which can be read as we don't know the answer). Equality returns null when either e1 or e2 evaluates to null. For example, column = null always returns null. Furthermore, the SELECT ... WHERE construct only returns records for those cases where the where clause returns true.
To get around this problem SQL introduced the is null boolean predicate:
The output:
6344
Sometimes we want to generate null values when we are joining tables together. Let's look at an example.
First, here are the the people credited with work on "Silver Linings Playbook":
select name, position
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ----------------- -------- Robert De Niro actor Bruce Cohen producer Bradley Cooper actor Donna Gigliotti producer Jonathan Gordon producer David O. Russell director Jacki Weaver actor Jennifer Lawrence actor Matthew Quick writer
Let's modify that query in order to see what roles the actors played in that movie:
select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ROLE ----------------- -------- ---------------- Robert De Niro actor Pat Sr. Bradley Cooper actor Pat Jacki Weaver actor Dolores Solitano Jennifer Lawrence actor Tiffany
Well, that's nice but we lost the records associated with non-actors since they do not join with records in the plays_role table! How can we retain them?
We solve this problem by using the left join construct:
select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ROLE ----------------- -------- ---------------- Robert De Niro actor Pat Sr. Bruce Cohen producer [null] Bradley Cooper actor Pat Donna Gigliotti producer [null] Jonathan Gordon producer [null] David O. Russell director [null] Jacki Weaver actor Dolores Solitano Jennifer Lawrence actor Tiffany Matthew Quick writer [null]
Note that in this case our user-interface displays [null] instead of whitespace. Why? That's because the type of this column is string, and this display wants us to be able to distinguish between a null value and the empty string!
The left join construct.
When there is no matching record in plays_role, the results (role in this case) will be given the null value.
Let's try to write a query that returns the movies in which Jennifer Lawrence played the same role:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
and r1.role = r2.role
order by r1.role, m1.title, m1.year;
The output:
ROLE TITLE YEAR ----------------- ------------------------------- ---- Aurora Lane Passengers 2016 Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games: Catching Fire 2013 Katniss Everdeen The Hunger Games: Catching Fire 2013 Mystique X-Men: First Class 2011 Raven X-Men: First Class 2011 Ree Winter's Bone 2010 Rosalyn Rosenfeld American Hustle 2013 Tiffany Silver Linings Playbook 2012
Hmmm. This didn't work -- we have every role Jennifer Lawrence played in any movie, with duplicate rows as well. Why? Suppose our query finds two movies, say A and B. If they are the same movie (A=B), then we we return the role played in that movie. Otherwise, there are two orders in which the query could process these movies -- first A then B, and first B then A. In each case we will return the same result, and this explains the duplicates rows.
Let's avoid the first problem by a modification to the where clause by adding m1.movie_id <> m2.movie_id:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by r1.role, m1.title, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------- ---- Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games: Catching Fire 2013
Wow! That seems to have solved both our problems! But did it really? As a sanity check let's try another actor:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------------- ---- Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009
Ooops! We still have duplicate rows! What's going on? (Before reading on, see if you can figure out the answer.)
In this case there are three movies involved, say say A, B, and C. So, how many ways can I find a pair of distinct movies with A first? Well, just A, B and A, C. So the query returns two identical rows for movie A. The same goes for B and C. In general, if there are k movies with the same role, then we will get k-1 copies of each! This explains what happened with Jennifer_Lawrence. In that case k = 2 and k-1 = 1, so no duplicate rows! We just got lucky!
All of this is due to the fact that SQL is not based on sets of records, but rather on multisets of records (sets containing duplicates, also called bags). If we want a set rather than a multiset, we can use the select distinct construct:
select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------------- ---- Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009
Yippee! This seems correct. Just to be sure let's sanity check again and try another actor:
select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Scarlett Johansson'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ----------------------------------- ---- Black Widow Captain America: Civil War 2016 Natasha Romanoff Captain America: Civil War 2016 Black Widow Captain America: The Winter Soldier 2014 Natasha Romanoff Captain America: The Winter Soldier 2014 Black Widow The Avengers 2012 Natasha Romanoff The Avengers 2012
Did this result surprised you? Is it correct? It turns out that Scarlett Johansson played two distinct roles in each of these movies. So it does seem correct. But it also forces us to admit that the original English specification, returns the movies in which so-and-so played the same role, was rather vague!
Note: These tables have been added for the Michaelmas term of 2020. Just for fun!
It turns out that many movies have alternative titles in different parts of the world and in different languages. These are recorded in the has_alternative table. This table really represents both a one-to-many relationship together with a (weak) entity set that have been implemented as a single table.
Let's take a look at 20 rows:
The output:
MOVIE_ID ALT_ID TITLE COUNTRY_CODE LANGUAGE_CODE IS_ORIGINAL --------- ------ ------------------------------------------ ------------ ------------- ----------- tt0118694 0 Szerelemre hangolva HU [null] FALSE tt0118694 1 Con ánimo de amar PE [null] FALSE tt0118694 2 Fa yeung nin wah AU [null] FALSE tt0118694 3 Fa Yeung Nin Wah CA en FALSE tt0118694 4 Raspoloženi za ljubav HR [null] FALSE tt0118694 5 Disponível Para Amar PT [null] FALSE tt0118694 6 Les silences du désir CA fr FALSE tt0118694 7 Aşk Zamanı TR tr FALSE tt0118694 8 In the Mood for Love IN hi FALSE tt0118694 9 Dar hal va havaye eshgh IR fa FALSE tt0118694 10 In the Mood for Love - Der Klang der Liebe DE [null] FALSE tt0118694 11 Spragnieni miłości PL [null] FALSE tt0118694 12 In the Mood for Love HK en FALSE tt0118694 13 Desitjant estimar ES ca FALSE tt0118694 14 In the Mood for Love DK [null] FALSE tt0118694 15 Любовное настроение RU [null] FALSE tt0118694 16 Meilės laukimas LT [null] FALSE tt0118694 17 Stvořeni pro lásku CZ [null] FALSE tt0118694 18 Kayô nenka JP [null] FALSE tt0118694 19 In the Mood for Love XWW en FALSE
The has_alternative table has the following columns:
The IMDb data does not include a mapping for language and country codes, but we have supplied these in the languages and countries tables.
Let's take a look at 10 rows of the languages table:
The output:
CODE NAME ---- ---------------------------- ar Arabic bg Bulgarian bn Bengali; Bangla bs Bosnian ca Catalan; Valencian cmn Mandarin Chinese cs Czech de German el Greek, Modern (1453-); Greek en English
Let's take a look at 10 rows of the countries table:
The output:
CODE NAME ---- ---------------------- AL Albania AM Armenia AR Argentina AT Austria AU Australia AW Aruba AZ Azerbaijan BA Bosnia and Herzegovina BD Bangladesh BE Belgium
Now let's use joins to look at the alternative titles of the very popular movie In the Mood for Love (movie_id is tt0118694):
select m.title as title,
a.title as alt_title,
a.is_original as orig,
c.name as country,
l.name as language
from movies as m
join has_alternative as a on a.movie_id = m.movie_id
left join countries as c on a.country_code = c.code
left join languages as l on a.language_code = l.code
where m.movie_id = 'tt0118694';
The output:
TITLE ALT_TITLE ORIG COUNTRY LANGUAGE -------------------- ------------------------------------------ ----- ---------------------------------------------------- ------------------ In the Mood for Love Szerelemre hangolva FALSE Hungary [null] In the Mood for Love Con ánimo de amar FALSE Peru [null] In the Mood for Love Fa yeung nin wah FALSE Australia [null] In the Mood for Love Fa Yeung Nin Wah FALSE Canada English In the Mood for Love Raspoloženi za ljubav FALSE Croatia [null] In the Mood for Love Disponível Para Amar FALSE Portugal [null] In the Mood for Love Les silences du désir FALSE Canada French In the Mood for Love Aşk Zamanı FALSE Turkey Turkish In the Mood for Love In the Mood for Love FALSE India Hindi In the Mood for Love Dar hal va havaye eshgh FALSE Iran, Islamic Republic of Persian; Farsi In the Mood for Love In the Mood for Love - Der Klang der Liebe FALSE Germany [null] In the Mood for Love Spragnieni miłości FALSE Poland [null] In the Mood for Love In the Mood for Love FALSE Hong Kong English In the Mood for Love Desitjant estimar FALSE Spain Catalan; Valencian In the Mood for Love In the Mood for Love FALSE Denmark [null] In the Mood for Love Любовное настроение FALSE Russian Federation [null] In the Mood for Love Meilės laukimas FALSE Lithuania [null] In the Mood for Love Stvořeni pro lásku FALSE Czechia [null] In the Mood for Love Kayô nenka FALSE Japan [null] In the Mood for Love In the Mood for Love FALSE Not an ISO-3166 country? English In the Mood for Love In the Mood for Love FALSE Australia [null] In the Mood for Love Любовний настрій FALSE Ukraine [null] In the Mood for Love Flower Like Years FALSE Hong Kong English In the Mood for Love Fa yeung nin wah TRUE [null] [null] In the Mood for Love Ερωτική επιθυμία FALSE Greece [null] In the Mood for Love In the Mood for Love FALSE Italy [null] In the Mood for Love In the Mood for Love FALSE France [null] In the Mood for Love Deseando amar FALSE Spain [null] In the Mood for Love Hua yang nian hua FALSE Hong Kong Mandarin Chinese In the Mood for Love Расположени за љубав FALSE Serbia [null] In the Mood for Love Deseando amar FALSE Mexico [null] In the Mood for Love In the Mood for Love FALSE United States of America [null] In the Mood for Love Amor à Flor da Pele FALSE Brazil [null] In the Mood for Love Con ánimo de amar FALSE Argentina [null] In the Mood for Love 花様年華 FALSE Japan Japanese In the Mood for Love Con ánimo de amar FALSE Uruguay [null] In the Mood for Love 花樣年華 FALSE Taiwan, Province of China [null] In the Mood for Love In the Mood for Love FALSE Spain [null] In the Mood for Love Fi mezaj lelhob FALSE Egypt Arabic In the Mood for Love Tâm Trạng Khi Yêu FALSE Viet Nam [null] In the Mood for Love O iubire imposibila FALSE Romania [null] In the Mood for Love Fa yeung nin wah FALSE Hong Kong Yue Chinese In the Mood for Love In the Mood for Love FALSE Finland [null] In the Mood for Love Three Stories About Food FALSE Not an ISO-3166 country? English In the Mood for Love In the Mood for Love FALSE United Kingdom of Great Britain and Northern Ireland [null] In the Mood for Love Rakkaus käy taloksi FALSE Finland [null] In the Mood for Love Beijing Summer FALSE Hong Kong English In the Mood for Love In the Mood for Love FALSE Sweden [null] In the Mood for Love In the Mood for Love FALSE Norway [null]
We can see that the data is a little messy (like many real-world tables!). There are quite a few missing languages as well as some countries with codes that don't seem to be in the ISO-3666 standard.
Let's use this movie to explore the meaning of the is_original column:
select m.title as title,
a.title as alt_title,
a.is_original as orig,
c.name as country,
l.name as language
from movies as m
join has_alternative as a on a.movie_id = m.movie_id
left join countries as c on a.country_code = c.code
left join languages as l on a.language_code = l.code
where m.movie_id = 'tt0118694'
and (a.is_original = TRUE or c.name= 'Hong Kong')
order by a.title;
The output:
TITLE ALT_TITLE ORIG COUNTRY LANGUAGE -------------------- -------------------- ----- --------- ---------------- In the Mood for Love Beijing Summer FALSE Hong Kong English In the Mood for Love Fa yeung nin wah TRUE [null] [null] In the Mood for Love Fa yeung nin wah FALSE Hong Kong Yue Chinese In the Mood for Love Flower Like Years FALSE Hong Kong English In the Mood for Love Hua yang nian hua FALSE Hong Kong Mandarin Chinese In the Mood for Love In the Mood for Love FALSE Hong Kong English
This is a bit hard to interpret! In the IBMd data, whenever a record has TRUE for is_original we find that the country_code and the language_code are NULL. And here we have another record with the original title (Fa yeung nin wah), but it has an is_original value of FALSE. We can at least conclude that IMDb is doing something that is not well documented!
Now you should know enough about this database and SQL to attempt the practical questions, which appear on a separate page.