University of Cambridge, Computer Laboratory.
Contents
This tutorial will quickly get you started with our relational database. You will learn more if you run each of these queries yourself and experiment with your own variations.
The database used in this course contains information about movies and people, taken from the Internet Movie Database (IMDb).
We use SQL as query language. SQL is a vast language, and this tutorial teaches you a small subset through examples. However, we do try to cover the most commonly used constructs in the query language.
The database management system is HyperSQL Database (HSQLDB), which has the advantage of being quite easy to set up. The things you will learn in these exercises apply equally to more common databases you might have heard of, such as MySQL or PostgreSQL.
If you run into any problems, please post questions (or answers!) on the course's Moodle page.
You can use your own computer, or a college or university computer with Java installed. The examples should work equally well on Linux, Windows or Mac OS.
Follow the instructions at HyperSQL Database (HSQLDB) in order to download the system. Once you have unzipped it, you should have a directory hsqldb-2.6.0 containing a subdirectory hsqldb. Within hsqldb is a subdirectory lib where the *.jar files reside.
Download the data from movies-relational.zip and unzip this to create the directory movies-relational (put it anywhere you want).
Next, download the configuration file db1a.rc. Follow the instructions in this file to add the full path to your movies-relational data. This file sets you up as the administrator for the db1a database with an empty password.
Now you are ready to run the HyperSQL shell with the following command line:
java -jar <full path to directory with jar files>sqltool.jar --rcfile=db1a.rc db1a
This should start a read-eval-print loop in which you can enter SQL expressions (terminated with a ;) and see the results. Type \? for help and \q to quit.
Our movie database contains the following tables
In terms of the Entity-Relationship model described in lectures, the first five tables contain entity sets while the last four implement relationships.
Let's start by using variations on the select ... from construct to explore the contents of the movies table.
The following query will list 10 records from this table:
The * in the query specifies that every column from the table will be printed. We are using limit 10 since there are thousands of records in this table! Note that most systems today are not case sensitive but for some (strange) reason it has become a tradition to write SQL's keywords using CAPITALS, so you many people would write this query as
We will write all queries using lower case. Both queries will produce the same output:
MOVIE_ID TITLE YEAR TYPE MINUTES RATING VOTES --------- ------------------------------------------------- ---- ----- ------- ------ ------- tt0118694 In the Mood for Love 2000 movie 98 8.1E0 136535 tt0120630 Chicken Run 2000 movie 84 7.0E0 183142 tt0120679 Frida 2002 movie 123 7.4E0 85132 tt0120737 The Lord of the Rings: The Fellowship of the Ring 2001 movie 178 8.8E0 1730797 tt0120903 X-Men 2000 movie 104 7.4E0 587971 tt0120917 The Emperor's New Groove 2000 movie 78 7.4E0 189721 tt0121164 Corpse Bride 2005 movie 77 7.3E0 254444 tt0121766 Star Wars: Episode III - Revenge of the Sith 2005 movie 140 7.5E0 737660 tt0126029 Shrek 2001 movie 90 7.9E0 638466 tt0133240 Treasure Planet 2002 movie 95 7.2E0 109490
The movies table has the following columns:
Rather than using * we can be explicit about the which columns we want to see and in what order. For example:
This query produces the following output:
YEAR TITLE RATING ---- ------------------------------------------------- ------ 2000 In the Mood for Love 8.1E0 2000 Chicken Run 7.0E0 2002 Frida 7.4E0 2001 The Lord of the Rings: The Fellowship of the Ring 8.8E0 2000 X-Men 7.4E0 2000 The Emperor's New Groove 7.4E0 2005 Corpse Bride 7.3E0 2005 Star Wars: Episode III - Revenge of the Sith 7.5E0 2001 Shrek 7.9E0 2002 Treasure Planet 7.2E0
The where construct allows us to select only those records that satisfy a given condition. Here we use it to select only movies released in 2017:
The output:
TITLE RATING --------------------------------------- ------ Wonder Woman 7.4E0 It 7.3E0 The Greatest Showman 7.6E0 The Foreigner 7.0E0 A Dog's Purpose 7.2E0 Blade Runner 2049 8.0E0 The Upside 7.0E0 Spider-Man: Homecoming 7.4E0 Coco 8.4E0 Star Wars: Episode VIII - The Last Jedi 7.0E0
SQL has many reserved words. Our column name year happens to be an SQL reserved word that gets highlighted by the automation we are using. Sigh. Happily, most SQL systems are very liberal in allowing many reserved words to be used as column or table names.
The condition of the where clause can contain a complex boolean-valued constraint. For example, if we want to list movies from either 2016 or 2017 with titles containing the word Star we can write
The output:
TITLE YEAR RATING --------------------------------------------------------- ---- ------ Star Wars: Episode VIII - The Last Jedi 2017 7.0E0 Star Trek Beyond 2016 7.0E0 Rogue One: A Star Wars Story 2016 7.8E0 Bright Lights: Starring Carrie Fisher and Debbie Reynolds 2016 8.0E0
Here the like operator matches strings against some pattern. The percent sign % in the pattern means that "any sub-string can appear here". Thus, title like '%Star%' matches only titles that contain the word "Star" as a sub-string.
How many movies are in our database? The following query will tell us:
The output:
1480
We can use the group by construct to see how many records are associated with each year:
The output:
YEAR TOTAL ---- ----- 2000 56 2002 55 2001 56 2005 63 2003 69 2008 72 2006 74 2004 79 2007 95 2009 81 2013 77 2015 75 2010 71 2011 79 2019 68 2012 68 2017 67 2014 79 2016 80 2020 36 2021 18 2018 62
The group by year clause tells the database system to group together all of the records having the same value for the year column and to then count the number of records in each group. Note that we have given the name total to the count(*) column using the as construct.
Your system probably did not list the years in order. We can fix that using the order by construct!
The output:
YEAR TOTAL ---- ----- 2000 56 2001 56 2002 55 2003 69 2004 79 2005 63 2006 74 2007 95 2008 72 2009 81 2010 71 2011 79 2012 68 2013 77 2014 79 2015 75 2016 80 2017 67 2018 62 2019 68 2020 36 2021 18
If we want to list the output in reverse order we use the desc keyword:
The output:
YEAR TOTAL ---- ----- 2021 18 2020 36 2019 68 2018 62 2017 67 2016 80 2015 75 2014 79 2013 77 2012 68 2011 79 2010 71 2009 81 2008 72 2007 95 2006 74 2005 63 2004 79 2003 69 2002 55 2001 56 2000 56
SQL contains many aggregate functions like count. Here is a query that uses the most common ones:
select type,
count(*) as movie_count,
min(votes) as min_votes,
max(votes) as max_votes,
sum(votes) as sum_of_votes,
avg(votes) as averge_votes
from movies
group by type;
The output:
TYPE MOVIE_COUNT MIN_VOTES MAX_VOTES SUM_OF_VOTES AVERGE_VOTES ------- ----------- --------- --------- ------------ ------------ movie 1094 50082 2418636 292732180 267579 tvMovie 386 1000 34902 1523907 3947
We can also use arithmetic in expressions such as
select sum(minutes)/1440 as days, sum(votes)/count(*) as average, 17 * (max(votes) - min(votes)) as nonsense
from movies
where type = 'movie';
The output:
DAYS AVERAGE NONSENSE ---- ------- -------- 90 267579 40265418
Before we go on to learn more SQL let's first use our current knowledge to explore the other tables in our database.
How many records are in the people table?
The output:
7583
Let's look at 10 records:
The output:
PERSON_ID NAME BIRTHYEAR DEATHYEAR --------- ------------------- --------- --------- nm0000002 Lauren Bacall 1924 2014 nm0000005 Ingmar Bergman 1918 2007 nm0000032 Charlton Heston 1923 2008 nm0000056 Paul Newman 1925 2008 nm0000084 Gong Li 1965 nm0000090 Armin Mueller-Stahl 1930 nm0000092 John Cleese 1939 nm0000093 Brad Pitt 1963 nm0000095 Woody Allen 1935 nm0000096 Gillian Anderson 1968
The people table has the following columns:
The blank columns represent null values (more on those below), so deathyear will be null if the person is still alive or if their deathyear has not been entered into the database.
Let's take a look at the contents of the genres table with this query.
The output:
GENRE_ID GENRE -------- ----------- 1 Comedy 2 Crime 3 Drama 4 Action 5 Adventure 6 Animation 7 Sci-Fi 8 Musical 9 Thriller 10 Music 11 Horror 12 Mystery 13 Biography 14 Documentary 15 Sport 16 History 17 Romance 18 Family 19 Fantasy 20 War 21 Western 22 News
This table has the following columns:
The has_genre table implements a relationship between movies and genres.
The output:
MOVIE_ID GENRE_ID --------- -------- tt0118694 3 tt0118694 17 tt0120630 1 tt0120630 5 tt0120630 6 tt0120679 3 tt0120679 13 tt0120679 17 tt0120737 3 tt0120737 4
This table has the following columns:
Both columns together form the key. Why is that? Because it implements a many-to-many relationship (see lectures).
Let's look at 10 records in the has_position table:
The output:
PERSON_ID MOVIE_ID POSITION --------- --------- -------- nm0000002 tt0276919 actor nm0000005 tt0299478 director nm0000032 tt0310793 self nm0000056 tt0317219 actor nm0000084 tt0212712 actor nm0000090 tt0765443 actor nm0000092 tt0481499 writer nm0000093 tt0208092 actor nm0000093 tt0240772 actor nm0000093 tt0266987 actor
The has_position table has the following columns:
Since this table implements are relationship between movies and people, the key formed of the pair movie_id and person_id. (Note: The key should really be all three columns since people can have multiple positions in the same movie. For some reason the IMDb data source has been constructed in a way that only one position per movie/person is listed. I hope they fix this in the future ....)
Let's see what values the position column contains:
The output:
POSITION TOTAL -------- ----- actor 5514 producer 2552 writer 2499 director 1600 self 339
Let's look at 10 records in the plays_role table:
which produces the output:
PERSON_ID MOVIE_ID ROLE --------- --------- -------------- nm0000002 tt0276919 Ma Ginger nm0000032 tt0310793 Self nm0000056 tt0317219 Doc Hudson nm0000084 tt0212712 Su Li-zhen nm0000090 tt0765443 Semyon nm0000093 tt0208092 Mickey O'Neil nm0000093 tt0240772 Rusty Ryan nm0000093 tt0266987 Tom Bishop nm0000093 tt0332452 Achilles nm0000093 tt0361748 Lt. Aldo Raine
The plays_role table has the following columns:
Here the key is a triple: all columns together form the key. Why is that?
Let's do our first join! This query will "join together" records in the movies and has_genre tables in order to list movie titles from 2017 together with their genres:
select title, genre
from movies
join has_genre on has_genre.movie_id = movies.movie_id
join genres on genres.genre_id = has_genre.genre_id
where year = 2017
limit 20;
The output:
TITLE GENRE ---------------------- --------- Wonder Woman Action Wonder Woman Adventure Wonder Woman Fantasy It Horror The Greatest Showman Drama The Greatest Showman Musical The Greatest Showman Biography The Foreigner Action The Foreigner Thriller A Dog's Purpose Comedy A Dog's Purpose Drama A Dog's Purpose Adventure Blade Runner 2049 Drama Blade Runner 2049 Action Blade Runner 2049 Mystery The Upside Comedy The Upside Drama Spider-Man: Homecoming Action Spider-Man: Homecoming Adventure Spider-Man: Homecoming Sci-Fi
As we can see, some movies are associated with multiple genres. This answers the question of why movie_id and genre together form the key for the has_genre table! (The same reasoning goes for the plays_role table --- an actor can play more than one role in a movie.)
We have used the join .. on construct. Our query looks for movie records with year = 2017 and then joins together each such record with every record in the genres table that has the same movie_id value. Note the condition has_genre.movie_id = movies.movie_id disambiguates which instance of the column movie_id we are referring to since both the movies and has_genre tables share a column with the same name.
We can write the same query using the as construct that gives a local abbreviation for a table's name:
select title, genre
from movies as m
join has_genre as hg on hg.movie_id = m.movie_id
join genres as g on g.genre_id = hg.genre_id
where year = 2017
limit 20;
This kind of abbreviation will be essential when our joins contain more than one instance of the same table. For example, suppose we want to list the popular (votes > 100000) romantic comedies (RomComs)--- those movies that are associated with the genres Romance and Comedy:
select title, year, rating, votes
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where m.votes > 100000 and g1.genre = 'Romance' and g2.genre = 'Comedy'
order by votes desc;
The output:
TITLE YEAR RATING VOTES -------------------------- ---- ------ ------ Amélie 2001 8.3E0 724870 Silver Linings Playbook 2012 7.7E0 680554 Crazy, Stupid, Love. 2011 7.4E0 491840 500 Days of Summer 2009 7.7E0 488294 Love Actually 2003 7.6E0 447788 The Terminal 2004 7.4E0 427126 The 40-Year-Old Virgin 2005 7.1E0 413043 Midnight in Paris 2011 7.7E0 402816 Easy A 2010 7.0E0 372297 Moonrise Kingdom 2012 7.8E0 331243 Up in the Air 2009 7.4E0 326286 50/50 2011 7.6E0 322151 Meet the Parents 2000 7.0E0 320664 Pitch Perfect 2012 7.1E0 289345 Forgetting Sarah Marshall 2008 7.1E0 273480 Marriage Story 2019 7.9E0 272541 Vicky Cristina Barcelona 2008 7.1E0 248490 The Artist 2011 7.9E0 235863 The Lobster 2015 7.2E0 234450 Garden State 2004 7.4E0 211459 P.S. I Love You 2007 7.0E0 204346 I Love You, Man 2009 7.0E0 198908 Sideways 2004 7.5E0 185086 About a Boy 2002 7.1E0 178772 Definitely, Maybe 2008 7.1E0 160408 Punch-Drunk Love 2002 7.3E0 155100 The Spectacular Now 2013 7.1E0 148293 Lars and the Real Girl 2007 7.3E0 141754 Good Bye Lenin! 2003 7.7E0 141680 It's Kind of a Funny Story 2010 7.1E0 139813 Love, Rosie 2014 7.2E0 129885 The Kids Are All Right 2010 7.0E0 128165 The Big Sick 2017 7.5E0 127743 Dil Bechara 2020 8.1E0 124240 Safety Not Guaranteed 2012 7.0E0 123673 Love, Simon 2018 7.6E0 105950
The nested query above checked that only a single value was in a relation. Multiple value can be checked as the following example illustrates (note the use of parentheses around the two values pr.person_id and pr.movie_id):
select count(*) as total
from plays_role as pr
where (pr.person_id, pr.movie_id)
not in (select hp.person_id, hp.movie_id
from has_position as hp
where hp.position = 'actor' or hp.position = 'self');
The output:
0
Interesting! Does this tell us that there is some (unwanted) redundancy in our IMDb data?
Now suppose we want to see what other genres RomComs are associated with. One way to do this is with a nested query:
select m1.title, m1.year, g.genre, m1.rating, m1.votes
from movies as m1
join has_genre as hg on hg.movie_id = m1.movie_id
join genres as g on g.genre_id = hg.genre_id
where m1.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
and m1.movie_id in
(select m2.movie_id
from movies as m2
join has_genre as hg1 on hg1.movie_id = m2.movie_id
join has_genre as hg2 on hg2.movie_id = m2.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where g1.genre = 'Romance' and g2.genre = 'Comedy')
order by m1.votes desc
limit 10;
The output:
TITLE YEAR GENRE RATING VOTES ----------------------- ---- ------- ------ ------ Silver Linings Playbook 2012 Drama 7.7E0 680554 Crazy, Stupid, Love. 2011 Drama 7.4E0 491840 500 Days of Summer 2009 Drama 7.7E0 488294 Love Actually 2003 Drama 7.6E0 447788 The Terminal 2004 Drama 7.4E0 427126 Midnight in Paris 2011 Fantasy 7.7E0 402816 Easy A 2010 Drama 7.0E0 372297 Moonrise Kingdom 2012 Drama 7.8E0 331243 Up in the Air 2009 Drama 7.4E0 326286 50/50 2011 Drama 7.6E0 322151
Here (select m2.movie_id ...) is called a nested query. We check if our movie is a RomCom, and then check if there is another genre that is not equal to either Romance or Comedy.
We can introduce a name that represents a query using views. For example, the following SQL defines a view romcom_ids to contain all those movie_id values for RomComs. This view was placed in your database by the command create view romcom_ids ...:
create view romcom_ids as
select m.movie_id as movie_id
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where g1.genre = 'Romance' and g2.genre = 'Comedy';
The name romcom_ids can now be used as if it were the name of a data base table. For example:
select m.title, m.year, g.genre, m.rating, m.votes
from romcom_ids as r
join has_genre as hg on hg.movie_id = r.movie_id
join genres as g on g.genre_id = hg.genre_id
join movies as m on m.movie_id = r.movie_id
where m.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
order by m.votes desc
limit 10;
This query will will return the same result as the nested RomCom query from above:
TITLE YEAR GENRE RATING VOTES ----------------------- ---- ------- ------ ------ Silver Linings Playbook 2012 Drama 7.7E0 680554 Crazy, Stupid, Love. 2011 Drama 7.4E0 491840 500 Days of Summer 2009 Drama 7.7E0 488294 Love Actually 2003 Drama 7.6E0 447788 The Terminal 2004 Drama 7.4E0 427126 Midnight in Paris 2011 Fantasy 7.7E0 402816 Easy A 2010 Drama 7.0E0 372297 Moonrise Kingdom 2012 Drama 7.8E0 331243 Up in the Air 2009 Drama 7.4E0 326286 50/50 2011 Drama 7.6E0 322151
As mentioned above, the has_position table implements multiple many-to-many relationships. Your database contains these five views that can simplify some of your queries over has_position:
create view acts_in as
select person_id, movie_id
from has_position
where position = 'actor';
create view produced as
select person_id, movie_id
from has_position
where position = 'producer';
create view wrote as
select person_id, movie_id
from has_position
where position = 'writer';
create view directed as
select person_id, movie_id
from has_position
where position = 'director';
create view is_self as
select person_id, movie_id
from has_position
where position = 'self';
To see a list of all views defined, type \dv and to see all tables type \dt.
SQL has a special value null that is used when a value is not known or is somehow not relevant. We saw null values for the column deathYear in the people table:
The output:
PERSON_ID NAME BIRTHYEAR DEATHYEAR --------- ------------------- --------- --------- nm0000002 Lauren Bacall 1924 2014 nm0000005 Ingmar Bergman 1918 2007 nm0000032 Charlton Heston 1923 2008 nm0000056 Paul Newman 1925 2008 nm0000084 Gong Li 1965 nm0000090 Armin Mueller-Stahl 1930 nm0000092 John Cleese 1939 nm0000093 Brad Pitt 1963 nm0000095 Woody Allen 1935 nm0000096 Gillian Anderson 1968
The HyperSQL system will usually print null as a blank (other systems may behave differently).
Let's try to count the number of people records with a null in the deathYear column:
The output:
0
Hmmm. What is going on here? As will be explained further in lectures, with the null we enter the world of 3-valued logic. For two expressions e1 and e2 the expression e1 = e2 can evaluate to true or false or null (which can be read as we don't know the answer). Equality returns null when either e1 or e2 evaluates to null. For example, column = null always returns null. Furthermore, the SELECT ... WHERE construct only returns records for those cases where the where clause returns true.
To get around this problem SQL introduced the is null boolean predicate:
The output:
7128
Sometimes we want to generate null values when we are joining tables together. Let's look at an example.
First, here are the the people credited with work on "Silver Linings Playbook":
select name, position
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ----------------- -------- Robert De Niro actor Bruce Cohen producer Bradley Cooper actor Donna Gigliotti producer Jonathan Gordon producer David O. Russell director Jacki Weaver actor Jennifer Lawrence actor Matthew Quick writer
Let's modify that query in order to see what roles the actors played in that movie:
select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ROLE ----------------- -------- ---------------- Robert De Niro actor Pat Sr. Bradley Cooper actor Pat Jacki Weaver actor Dolores Solitano Jennifer Lawrence actor Tiffany
Well, that's nice but we lost the records associated with non-actors since they do not join with records in the plays_role table! How can we retain them?
We solve this problem by using the left join construct:
select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';
The output:
NAME POSITION ROLE ----------------- -------- ---------------- Robert De Niro actor Pat Sr. Bruce Cohen producer [null] Bradley Cooper actor Pat Donna Gigliotti producer [null] Jonathan Gordon producer [null] David O. Russell director [null] Jacki Weaver actor Dolores Solitano Jennifer Lawrence actor Tiffany Matthew Quick writer [null]
Note that in this case our user-interface displays [null] instead of whitespace. Why? That's because the type of this column is string, and this display wants us to be able to distinguish between a null value and the empty string!
The left join construct.
When there is no matching record in plays_role, the results (role in this case) will be given the null value.
Let's try to write a query that returns the movies in which Jennifer Lawrence played the same role:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
and r1.role = r2.role
order by r1.role, m1.title, m1.year;
The output:
ROLE TITLE YEAR ----------------- ------------------------------- ---- Aurora Lane Passengers 2016 Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games: Catching Fire 2013 Katniss Everdeen The Hunger Games: Catching Fire 2013 Mystique X-Men: First Class 2011 Raven X-Men: First Class 2011 Ree Winter's Bone 2010 Rosalyn Rosenfeld American Hustle 2013 Tiffany Silver Linings Playbook 2012
Hmmm. This didn't work -- we have every role Jennifer Lawrence played in any movie, with duplicate rows as well. Why? Suppose our query finds two movies, say A and B. If they are the same movie (A=B), then we we return the role played in that movie. Otherwise, there are two orders in which the query could process these movies -- first A then B, and first B then A. In each case we will return the same result, and this explains the duplicates rows.
Let's avoid the first problem by a modification to the where clause by adding m1.movie_id <> m2.movie_id:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by r1.role, m1.title, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------- ---- Katniss Everdeen The Hunger Games 2012 Katniss Everdeen The Hunger Games: Catching Fire 2013
Wow! That seems to have solved both our problems! But did it really? As a sanity check let's try another actor:
select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------------- ---- Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009
Ooops! We still have duplicate rows! What's going on? (Before reading on, see if you can figure out the answer.)
In this case there are three movies involved, say say A, B, and C. So, how many ways can I find a pair of distinct movies with A first? Well, just A, B and A, C. So the query returns two identical rows for movie A. The same goes for B and C. In general, if there are k movies with the same role, then we will get k-1 copies of each! This explains what happened with Jennifer_Lawrence. In that case k = 2 and k-1 = 1, so no duplicate rows! We just got lucky!
All of this is due to the fact that SQL is not based on sets of records, but rather on multisets of records (sets containing duplicates, also called bags). If we want a set rather than a multiset, we can use the select distinct construct:
select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ------------------------------------- ---- Lisbeth Salander The Girl Who Kicked the Hornet's Nest 2009 Lisbeth Salander The Girl Who Played with Fire 2009 Lisbeth Salander The Girl with the Dragon Tattoo 2009
Yippee! This seems correct. Just to be sure let's sanity check again and try another actor:
select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Scarlett Johansson'
and r1.role = r2.role
and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;
The output:
ROLE TITLE YEAR ---------------- ----------------------------------- ---- Black Widow Captain America: Civil War 2016 Natasha Romanoff Captain America: Civil War 2016 Black Widow Captain America: The Winter Soldier 2014 Natasha Romanoff Captain America: The Winter Soldier 2014 Black Widow The Avengers 2012 Natasha Romanoff The Avengers 2012
Did this result surprised you? Is it correct? It turns out that Scarlett Johansson played two distinct roles in each of these movies. So it does seem correct. But it also forces us to admit that the original English specification, returns the movies in which so-and-so played the same role, was rather vague!
Note: These tables have been added for the Michaelmas term of 2020. Just for fun!
It turns out that many movies have alternative titles in different parts of the world and in different languages. These are recorded in the has_alternative table. This table really represents both a one-to-many relationship together with a (weak) entity set that have been implemented as a single table.
Let's take a look at 20 rows:
The output:
MOVIE_ID ALT_ID TITLE COUNTRY_CODE LANGUAGE_CODE IS_ORIGINAL --------- ------ ------------------------------------------ ------------ ------------- ----------- tt0118694 0 In the Mood for Love SE [null] FALSE tt0118694 1 Faajoeng ninwaa HK yue FALSE tt0118694 2 Fa yeung nin wah ID [null] FALSE tt0118694 3 In the Mood for Love NO [null] FALSE tt0118694 4 Szerelemre hangolva HU [null] FALSE tt0118694 5 Con ánimo de amar PE [null] FALSE tt0118694 6 Fa Yeung Nin Wah IN [null] FALSE tt0118694 7 Raspoloženi za ljubav HR [null] FALSE tt0118694 8 Disponível Para Amar PT [null] FALSE tt0118694 9 Huayang nianhua HK cmn FALSE tt0118694 10 In the Mood for Love - Der Klang der Liebe DE [null] FALSE tt0118694 11 Les silences du désir CA fr FALSE tt0118694 12 Aşk Zamanı TR tr FALSE tt0118694 13 In the Mood for Love IN hi FALSE tt0118694 14 Dar hal va havaye eshgh IR fa FALSE tt0118694 15 Spragnieni miłości PL [null] FALSE tt0118694 16 In the Mood for Love HK en FALSE tt0118694 17 Desitjant estimar ES ca FALSE tt0118694 18 花樣年華 HK yue FALSE tt0118694 19 In the Mood for Love DK [null] FALSE
The has_alternative table has the following columns:
The IMDb data does not include a mapping for language and country codes, but we have supplied these in the languages and countries tables.
Let's take a look at 10 rows of the languages table:
The output:
CODE NAME ---- ---------------------------- af Afrikaans ar Arabic bg Bulgarian bn Bengali; Bangla bs Bosnian ca Catalan; Valencian cmn Mandarin Chinese cs Czech de German el Greek, Modern (1453-); Greek
Let's take a look at 10 rows of the countries table:
The output:
CODE NAME ---- ---------------------- AE United Arab Emirates AL Albania AM Armenia AR Argentina AT Austria AU Australia AW Aruba AZ Azerbaijan BA Bosnia and Herzegovina BD Bangladesh
Now let's use joins to look at the alternative titles of the very popular movie In the Mood for Love (movie_id is tt0118694):
select m.title as title,
a.title as alt_title,
a.is_original as orig,
c.name as country,
l.name as language
from movies as m
join has_alternative as a on a.movie_id = m.movie_id
left join countries as c on a.country_code = c.code
left join languages as l on a.language_code = l.code
where m.movie_id = 'tt0118694';
The output:
TITLE ALT_TITLE ORIG COUNTRY LANGUAGE -------------------- ------------------------------------------ ----- ---------------------------------------------------- ------------------ In the Mood for Love In the Mood for Love FALSE Sweden [null] In the Mood for Love Faajoeng ninwaa FALSE Hong Kong Yue Chinese In the Mood for Love Fa yeung nin wah FALSE Indonesia [null] In the Mood for Love In the Mood for Love FALSE Norway [null] In the Mood for Love Szerelemre hangolva FALSE Hungary [null] In the Mood for Love Con ánimo de amar FALSE Peru [null] In the Mood for Love Fa Yeung Nin Wah FALSE India [null] In the Mood for Love Raspoloženi za ljubav FALSE Croatia [null] In the Mood for Love Disponível Para Amar FALSE Portugal [null] In the Mood for Love Huayang nianhua FALSE Hong Kong Mandarin Chinese In the Mood for Love In the Mood for Love - Der Klang der Liebe FALSE Germany [null] In the Mood for Love Les silences du désir FALSE Canada French In the Mood for Love Aşk Zamanı FALSE Turkey Turkish In the Mood for Love In the Mood for Love FALSE India Hindi In the Mood for Love Dar hal va havaye eshgh FALSE Iran, Islamic Republic of Persian; Farsi In the Mood for Love Spragnieni miłości FALSE Poland [null] In the Mood for Love In the Mood for Love FALSE Hong Kong English In the Mood for Love Desitjant estimar FALSE Spain Catalan; Valencian In the Mood for Love 花樣年華 FALSE Hong Kong Yue Chinese In the Mood for Love In the Mood for Love FALSE Denmark [null] In the Mood for Love Любовное настроение FALSE Russian Federation [null] In the Mood for Love Любовний настрій FALSE Ukraine [null] In the Mood for Love In the Mood for Love FALSE Canada English In the Mood for Love Meilės laukimas FALSE Lithuania [null] In the Mood for Love Stvořeni pro lásku FALSE Czechia [null] In the Mood for Love In the Mood for Love FALSE Not an ISO-3166 country? English In the Mood for Love In the Mood for Love FALSE Australia [null] In the Mood for Love Flower Like Years FALSE Hong Kong English In the Mood for Love Fa yeung nin wah TRUE [null] [null] In the Mood for Love Ερωτική επιθυμία FALSE Greece [null] In the Mood for Love In the Mood for Love FALSE Italy [null] In the Mood for Love In the Mood for Love FALSE France [null] In the Mood for Love Amor à Flor da Pele FALSE Brazil [null] In the Mood for Love Deseando amar FALSE Spain [null] In the Mood for Love O iubire imposibilă FALSE Romania [null] In the Mood for Love Расположени за љубав FALSE Serbia [null] In the Mood for Love Kayô nenka FALSE Japan Japanese In the Mood for Love Deseando amar FALSE Mexico [null] In the Mood for Love In the Mood for Love FALSE United States of America [null] In the Mood for Love Con ánimo de amar FALSE Argentina [null] In the Mood for Love 花様年華 FALSE Japan Japanese In the Mood for Love Tâm trạng khi yêu FALSE Viet Nam [null] In the Mood for Love Con ánimo de amar FALSE Uruguay [null] In the Mood for Love Three Stories About Food FALSE Not an ISO-3166 country? English In the Mood for Love 花樣年華 FALSE Taiwan, Province of China [null] In the Mood for Love Fayeung ninwah FALSE Hong Kong Yue Chinese In the Mood for Love In the Mood for Love FALSE Spain [null] In the Mood for Love Fi mezaj lelhob FALSE Egypt Arabic In the Mood for Love Valmis armastuseks FALSE Estonia [null] In the Mood for Love In the Mood for Love FALSE Finland [null] In the Mood for Love 花样年华 FALSE China Mandarin Chinese In the Mood for Love In the Mood for Love FALSE United Kingdom of Great Britain and Northern Ireland [null] In the Mood for Love Rakkaus käy taloksi FALSE Finland [null] In the Mood for Love Fa Yeung Nin Wah FALSE Canada [null] In the Mood for Love Beijing Summer FALSE Hong Kong English
We can see that the data is a little messy (like many real-world tables!). There are quite a few missing languages as well as some countries with codes that don't seem to be in the ISO-3666 standard.
Let's use this movie to explore the meaning of the is_original column:
select m.title as title,
a.title as alt_title,
a.is_original as orig,
c.name as country,
l.name as language
from movies as m
join has_alternative as a on a.movie_id = m.movie_id
left join countries as c on a.country_code = c.code
left join languages as l on a.language_code = l.code
where m.movie_id = 'tt0118694'
and (a.is_original = TRUE or c.name= 'Hong Kong')
order by a.title;
The output:
TITLE ALT_TITLE ORIG COUNTRY LANGUAGE -------------------- -------------------- ----- --------- ---------------- In the Mood for Love Beijing Summer FALSE Hong Kong English In the Mood for Love Fa yeung nin wah TRUE [null] [null] In the Mood for Love Faajoeng ninwaa FALSE Hong Kong Yue Chinese In the Mood for Love Fayeung ninwah FALSE Hong Kong Yue Chinese In the Mood for Love Flower Like Years FALSE Hong Kong English In the Mood for Love Huayang nianhua FALSE Hong Kong Mandarin Chinese In the Mood for Love In the Mood for Love FALSE Hong Kong English In the Mood for Love 花樣年華 FALSE Hong Kong Yue Chinese
This is a bit hard to interpret! In the IBMd data, whenever a record has TRUE for is_original we find that the country_code and the language_code are NULL. And here we have another record with the original title (Fa yeung nin wah), but it has an is_original value of FALSE. We can at least conclude that IMDb is doing something that is not well documented!
END. (C) TGG 2021, DJG 2022.