Relational Database Tutorial 2022/23

University of Cambridge, Computer Laboratory.

With HyperSQL

Contents

Introduction

This tutorial will quickly get you started with our relational database. You will learn more if you run each of these queries yourself and experiment with your own variations.

The database used in this course contains information about movies and people, taken from the Internet Movie Database (IMDb).

We use SQL as query language. SQL is a vast language, and this tutorial teaches you a small subset through examples. However, we do try to cover the most commonly used constructs in the query language.

Setting up HyperSQL

The database management system is HyperSQL Database (HSQLDB), which has the advantage of being quite easy to set up. The things you will learn in these exercises apply equally to more common databases you might have heard of, such as MySQL or PostgreSQL.

If you run into any problems, please post questions (or answers!) on the course's Moodle page.

You can use your own computer, or a college or university computer with Java installed. The examples should work equally well on Linux, Windows or Mac OS.

Download HyperSQL

Follow the instructions at HyperSQL Database (HSQLDB) in order to download the system. Once you have unzipped it, you should have a directory hsqldb-2.6.0 containing a subdirectory hsqldb. Within hsqldb is a subdirectory lib where the *.jar files reside.

Download the data and configuration file

Download the data from movies-relational.zip and unzip this to create the directory movies-relational (put it anywhere you want).

Next, download the configuration file db1a.rc. Follow the instructions in this file to add the full path to your movies-relational data. This file sets you up as the administrator for the db1a database with an empty password.

HyperSQL Shell

Now you are ready to run the HyperSQL shell with the following command line:

java -jar <full path to directory with jar files>sqltool.jar --rcfile=db1a.rc db1a

This should start a read-eval-print loop in which you can enter SQL expressions (terminated with a ;) and see the results. Type \? for help and \q to quit.

SQL by example

Exploring the movies table with select ... from

Our movie database contains the following tables

movies
people
genres
countries
languages
has_genre
has_position
plays_role
has_alternative

In terms of the Entity-Relationship model described in lectures, the first five tables contain entity sets while the last four implement relationships.

Let's start by using variations on the select ... from construct to explore the contents of the movies table.

The following query will list 10 records from this table:

select * from movies limit 10;

The * in the query specifies that every column from the table will be printed. We are using limit 10 since there are thousands of records in this table! Note that most systems today are not case sensitive but for some (strange) reason it has become a tradition to write SQL's keywords using CAPITALS, so you many people would write this query as

SELECT * FROM movies LIMIT 10;

We will write all queries using lower case. Both queries will produce the same output:

MOVIE_ID   TITLE                                              YEAR  TYPE   MINUTES  RATING    VOTES
---------  -------------------------------------------------  ----  -----  -------  ------  -------
tt0118694  In the Mood for Love                               2000  movie       98   8.1E0   136535
tt0120630  Chicken Run                                        2000  movie       84   7.0E0   183142
tt0120679  Frida                                              2002  movie      123   7.4E0    85132
tt0120737  The Lord of the Rings: The Fellowship of the Ring  2001  movie      178   8.8E0  1730797
tt0120903  X-Men                                              2000  movie      104   7.4E0   587971
tt0120917  The Emperor's New Groove                           2000  movie       78   7.4E0   189721
tt0121164  Corpse Bride                                       2005  movie       77   7.3E0   254444
tt0121766  Star Wars: Episode III - Revenge of the Sith       2005  movie      140   7.5E0   737660
tt0126029  Shrek                                              2001  movie       90   7.9E0   638466
tt0133240  Treasure Planet                                    2002  movie       95   7.2E0   109490

The movies table has the following columns:

movie_id: A unique identifier (key) for each movie.
title: The movie's title.
year: The release year of the movie.
type: Either movie or tvMovie.
minutes: Runtime of movie in minutes.
rating: A weighted average of all the individual user ratings.
votes: The number of votes received (by IMDb) for this movie.

Rather than using * we can be explicit about the which columns we want to see and in what order. For example:

select year, title, rating from movies limit 10;

This query produces the following output:

YEAR  TITLE                                              RATING
----  -------------------------------------------------  ------
2000  In the Mood for Love                                8.1E0
2000  Chicken Run                                         7.0E0
2002  Frida                                               7.4E0
2001  The Lord of the Rings: The Fellowship of the Ring   8.8E0
2000  X-Men                                               7.4E0
2000  The Emperor's New Groove                            7.4E0
2005  Corpse Bride                                        7.3E0
2005  Star Wars: Episode III - Revenge of the Sith        7.5E0
2001  Shrek                                               7.9E0
2002  Treasure Planet                                     7.2E0

The where construct allows us to select only those records that satisfy a given condition. Here we use it to select only movies released in 2017:

select title, rating from movies
where year = 2017
limit 10;

The output:

TITLE                                    RATING
---------------------------------------  ------
Wonder Woman                              7.4E0
It                                        7.3E0
The Greatest Showman                      7.6E0
The Foreigner                             7.0E0
A Dog's Purpose                           7.2E0
Blade Runner 2049                         8.0E0
The Upside                                7.0E0
Spider-Man: Homecoming                    7.4E0
Coco                                      8.4E0
Star Wars: Episode VIII - The Last Jedi   7.0E0

SQL has many reserved words. Our column name year happens to be an SQL reserved word that gets highlighted by the automation we are using. Sigh. Happily, most SQL systems are very liberal in allowing many reserved words to be used as column or table names.

The condition of the where clause can contain a complex boolean-valued constraint. For example, if we want to list movies from either 2016 or 2017 with titles containing the word Star we can write

select title, year, rating
from movies
where (year = 2016 or year = 2017) and title like '%Star%';

The output:

TITLE                                                      YEAR  RATING
---------------------------------------------------------  ----  ------
Star Wars: Episode VIII - The Last Jedi                    2017   7.0E0
Star Trek Beyond                                           2016   7.0E0
Rogue One: A Star Wars Story                               2016   7.8E0
Bright Lights: Starring Carrie Fisher and Debbie Reynolds  2016   8.0E0

Here the like operator matches strings against some pattern. The percent sign % in the pattern means that "any sub-string can appear here". Thus, title like '%Star%' matches only titles that contain the word "Star" as a sub-string.

How many movies are in our database? The following query will tell us:

select count(*) from movies;

The output:

We can use the group by construct to see how many records are associated with each year:

select year, count(*) as total
from movies
group by year;

The output:

YEAR  TOTAL
----  -----
2000     56
2002     55
2001     56
2005     63
2003     69
2008     72
2006     74
2004     79
2007     95
2009     81
2013     77
2015     75
2010     71
2011     79
2019     68
2012     68
2017     67
2014     79
2016     80
2020     36
2021     18
2018     62

The group by year clause tells the database system to group together all of the records having the same value for the year column and to then count the number of records in each group. Note that we have given the name total to the count(*) column using the as construct.

Your system probably did not list the years in order. We can fix that using the order by construct!

select year, count(*) as total
from movies
group by year
order by year;

The output:

YEAR  TOTAL
----  -----
2000     56
2001     56
2002     55
2003     69
2004     79
2005     63
2006     74
2007     95
2008     72
2009     81
2010     71
2011     79
2012     68
2013     77
2014     79
2015     75
2016     80
2017     67
2018     62
2019     68
2020     36
2021     18

If we want to list the output in reverse order we use the desc keyword:

select year, count(*) as total
from movies
group by year
order by year desc;

The output:

YEAR  TOTAL
----  -----
2021     18
2020     36
2019     68
2018     62
2017     67
2016     80
2015     75
2014     79
2013     77
2012     68
2011     79
2010     71
2009     81
2008     72
2007     95
2006     74
2005     63
2004     79
2003     69
2002     55
2001     56
2000     56

SQL contains many aggregate functions like count. Here is a query that uses the most common ones:

select type,
       count(*) as movie_count,
       min(votes) as min_votes,
       max(votes) as max_votes,
       sum(votes) as sum_of_votes,
       avg(votes) as averge_votes
from movies
group by type;

The output:

TYPE     MOVIE_COUNT  MIN_VOTES  MAX_VOTES  SUM_OF_VOTES  AVERGE_VOTES
-------  -----------  ---------  ---------  ------------  ------------
movie           1094      50082    2418636     292732180        267579
tvMovie          386       1000      34902       1523907          3947

We can also use arithmetic in expressions such as

select sum(minutes)/1440 as days, sum(votes)/count(*) as average, 17 * (max(votes) - min(votes)) as nonsense
from movies
where type = 'movie';

The output:

DAYS  AVERAGE  NONSENSE
----  -------  --------
  90   267579  40265418

Exploring the other tables

Before we go on to learn more SQL let's first use our current knowledge to explore the other tables in our database.

The people table

How many records are in the people table?

select count(*) from people;

The output:

Let's look at 10 records:

select * from people limit 10;

The output:

PERSON_ID  NAME                 BIRTHYEAR  DEATHYEAR
---------  -------------------  ---------  ---------
nm0000002  Lauren Bacall             1924       2014
nm0000005  Ingmar Bergman            1918       2007
nm0000032  Charlton Heston           1923       2008
nm0000056  Paul Newman               1925       2008
nm0000084  Gong Li                   1965
nm0000090  Armin Mueller-Stahl       1930
nm0000092  John Cleese               1939
nm0000093  Brad Pitt                 1963
nm0000095  Woody Allen               1935
nm0000096  Gillian Anderson          1968

The people table has the following columns:

person_id: A unique identifier (key) for each record.
name: An associated name.
birthyear: Birth year if the record represents a person.
deathyear: Death year if the record represents a person.

The blank columns represent null values (more on those below), so deathyear will be null if the person is still alive or if their deathyear has not been entered into the database.

The genres table

Let's take a look at the contents of the genres table with this query.

select * from genres;

The output:

GENRE_ID  GENRE
--------  -----------
       1  Comedy
       2  Crime
       3  Drama
       4  Action
       5  Adventure
       6  Animation
       7  Sci-Fi
       8  Musical
       9  Thriller
      10  Music
      11  Horror
      12  Mystery
      13  Biography
      14  Documentary
      15  Sport
      16  History
      17  Romance
      18  Family
      19  Fantasy
      20  War
      21  Western
      22  News

This table has the following columns:

genre_id: A unique key for each genre.
genre: The genre.

The has_genre table

The has_genre table implements a relationship between movies and genres.

select * from has_genre
limit 10;

The output:

MOVIE_ID   GENRE_ID
---------  --------
tt0118694         3
tt0118694        17
tt0120630         1
tt0120630         5
tt0120630         6
tt0120679         3
tt0120679        13
tt0120679        17
tt0120737         3
tt0120737         4

This table has the following columns:

movie_id: A foreign key pointing into the movies table.
genre_id: A foreign key pointing into the genres table.

Both columns together form the key. Why is that? Because it implements a many-to-many relationship (see lectures).

The has_position table

Let's look at 10 records in the has_position table:

select * from has_position limit 10;

The output:

PERSON_ID  MOVIE_ID   POSITION
---------  ---------  --------
nm0000002  tt0276919  actor
nm0000005  tt0299478  director
nm0000032  tt0310793  self
nm0000056  tt0317219  actor
nm0000084  tt0212712  actor
nm0000090  tt0765443  actor
nm0000092  tt0481499  writer
nm0000093  tt0208092  actor
nm0000093  tt0240772  actor
nm0000093  tt0266987  actor

The has_position table has the following columns:

movie_id: A foreign key pointing into the movies table.
person_id: A foreign key pointing into the people table.
position: The primary job of the associated person in the associated movie.

Since this table implements are relationship between movies and people, the key formed of the pair movie_id and person_id. (Note: The key should really be all three columns since people can have multiple positions in the same movie. For some reason the IMDb data source has been constructed in a way that only one position per movie/person is listed. I hope they fix this in the future ....)

Let's see what values the position column contains:

select position, count(*) as total
from has_position
group by position
order by total desc;

The output:

POSITION  TOTAL
--------  -----
actor      5514
producer   2552
writer     2499
director   1600
self        339

The plays_role table

Let's look at 10 records in the plays_role table:

select * from plays_role limit 10;

which produces the output:

PERSON_ID  MOVIE_ID   ROLE
---------  ---------  --------------
nm0000002  tt0276919  Ma Ginger
nm0000032  tt0310793  Self
nm0000056  tt0317219  Doc Hudson
nm0000084  tt0212712  Su Li-zhen
nm0000090  tt0765443  Semyon
nm0000093  tt0208092  Mickey O'Neil
nm0000093  tt0240772  Rusty Ryan
nm0000093  tt0266987  Tom Bishop
nm0000093  tt0332452  Achilles
nm0000093  tt0361748  Lt. Aldo Raine

The plays_role table has the following columns:

movie_id: A foreign key pointing into the movies table.
person_id: A foreign key pointing into the people table.
role: A role associated with the person in the associated movie.

Here the key is a triple: all columns together form the key. Why is that?

The real power of SQL: joining tables together

Let's do our first join! This query will "join together" records in the movies and has_genre tables in order to list movie titles from 2017 together with their genres:

select title, genre
from movies
join has_genre on has_genre.movie_id = movies.movie_id
join genres on genres.genre_id = has_genre.genre_id
where year = 2017
limit 20;

The output:

TITLE                   GENRE
----------------------  ---------
Wonder Woman            Action
Wonder Woman            Adventure
Wonder Woman            Fantasy
It                      Horror
The Greatest Showman    Drama
The Greatest Showman    Musical
The Greatest Showman    Biography
The Foreigner           Action
The Foreigner           Thriller
A Dog's Purpose         Comedy
A Dog's Purpose         Drama
A Dog's Purpose         Adventure
Blade Runner 2049       Drama
Blade Runner 2049       Action
Blade Runner 2049       Mystery
The Upside              Comedy
The Upside              Drama
Spider-Man: Homecoming  Action
Spider-Man: Homecoming  Adventure
Spider-Man: Homecoming  Sci-Fi

As we can see, some movies are associated with multiple genres. This answers the question of why movie_id and genre together form the key for the has_genre table! (The same reasoning goes for the plays_role table --- an actor can play more than one role in a movie.)

We have used the join .. on construct. Our query looks for movie records with year = 2017 and then joins together each such record with every record in the genres table that has the same movie_id value. Note the condition has_genre.movie_id = movies.movie_id disambiguates which instance of the column movie_id we are referring to since both the movies and has_genre tables share a column with the same name.

We can write the same query using the as construct that gives a local abbreviation for a table's name:

select title, genre
from movies as m
join has_genre as hg on hg.movie_id = m.movie_id
join genres as g on g.genre_id = hg.genre_id
where year = 2017
limit 20;

This kind of abbreviation will be essential when our joins contain more than one instance of the same table. For example, suppose we want to list the popular (votes > 100000) romantic comedies (RomComs)--- those movies that are associated with the genres Romance and Comedy:

select title, year, rating, votes
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where m.votes > 100000 and g1.genre = 'Romance' and g2.genre = 'Comedy'
order by votes desc;

The output:

TITLE                       YEAR  RATING   VOTES
--------------------------  ----  ------  ------
Amélie                      2001   8.3E0  724870
Silver Linings Playbook     2012   7.7E0  680554
Crazy, Stupid, Love.        2011   7.4E0  491840
500 Days of Summer          2009   7.7E0  488294
Love Actually               2003   7.6E0  447788
The Terminal                2004   7.4E0  427126
The 40-Year-Old Virgin      2005   7.1E0  413043
Midnight in Paris           2011   7.7E0  402816
Easy A                      2010   7.0E0  372297
Moonrise Kingdom            2012   7.8E0  331243
Up in the Air               2009   7.4E0  326286
50/50                       2011   7.6E0  322151
Meet the Parents            2000   7.0E0  320664
Pitch Perfect               2012   7.1E0  289345
Forgetting Sarah Marshall   2008   7.1E0  273480
Marriage Story              2019   7.9E0  272541
Vicky Cristina Barcelona    2008   7.1E0  248490
The Artist                  2011   7.9E0  235863
The Lobster                 2015   7.2E0  234450
Garden State                2004   7.4E0  211459
P.S. I Love You             2007   7.0E0  204346
I Love You, Man             2009   7.0E0  198908
Sideways                    2004   7.5E0  185086
About a Boy                 2002   7.1E0  178772
Definitely, Maybe           2008   7.1E0  160408
Punch-Drunk Love            2002   7.3E0  155100
The Spectacular Now         2013   7.1E0  148293
Lars and the Real Girl      2007   7.3E0  141754
Good Bye Lenin!             2003   7.7E0  141680
It's Kind of a Funny Story  2010   7.1E0  139813
Love, Rosie                 2014   7.2E0  129885
The Kids Are All Right      2010   7.0E0  128165
The Big Sick                2017   7.5E0  127743
Dil Bechara                 2020   8.1E0  124240
Safety Not Guaranteed       2012   7.0E0  123673
Love, Simon                 2018   7.6E0  105950

The nested query above checked that only a single value was in a relation. Multiple value can be checked as the following example illustrates (note the use of parentheses around the two values pr.person_id and pr.movie_id):

select count(*) as total
from plays_role as pr
where (pr.person_id, pr.movie_id)
not in (select hp.person_id, hp.movie_id
        from has_position as hp
        where hp.position = 'actor' or hp.position = 'self');

The output:

Interesting! Does this tell us that there is some (unwanted) redundancy in our IMDb data?

Nested queries

Now suppose we want to see what other genres RomComs are associated with. One way to do this is with a nested query:

select m1.title, m1.year, g.genre, m1.rating, m1.votes
from movies as m1
join has_genre as hg on hg.movie_id = m1.movie_id
join genres as g on g.genre_id = hg.genre_id
where m1.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
      and m1.movie_id in
         (select m2.movie_id
         from movies as m2
         join has_genre as hg1 on hg1.movie_id = m2.movie_id
         join has_genre as hg2 on hg2.movie_id = m2.movie_id
         join genres as g1 on g1.genre_id = hg1.genre_id
         join genres as g2 on g2.genre_id = hg2.genre_id
         where g1.genre = 'Romance' and g2.genre = 'Comedy')
order by m1.votes desc
limit 10;

The output:

TITLE                    YEAR  GENRE    RATING   VOTES
-----------------------  ----  -------  ------  ------
Silver Linings Playbook  2012  Drama     7.7E0  680554
Crazy, Stupid, Love.     2011  Drama     7.4E0  491840
500 Days of Summer       2009  Drama     7.7E0  488294
Love Actually            2003  Drama     7.6E0  447788
The Terminal             2004  Drama     7.4E0  427126
Midnight in Paris        2011  Fantasy   7.7E0  402816
Easy A                   2010  Drama     7.0E0  372297
Moonrise Kingdom         2012  Drama     7.8E0  331243
Up in the Air            2009  Drama     7.4E0  326286
50/50                    2011  Drama     7.6E0  322151

Here (select m2.movie_id ...) is called a nested query. We check if our movie is a RomCom, and then check if there is another genre that is not equal to either Romance or Comedy.

Views

We can introduce a name that represents a query using views. For example, the following SQL defines a view romcom_ids to contain all those movie_id values for RomComs. This view was placed in your database by the command create view romcom_ids ...:

create view romcom_ids as
    select m.movie_id as movie_id
    from movies as m
    join has_genre as hg1 on hg1.movie_id = m.movie_id
    join has_genre as hg2 on hg2.movie_id = m.movie_id
    join genres as g1 on g1.genre_id = hg1.genre_id
    join genres as g2 on g2.genre_id = hg2.genre_id
    where g1.genre = 'Romance' and g2.genre = 'Comedy';

The name romcom_ids can now be used as if it were the name of a data base table. For example:

select m.title, m.year, g.genre, m.rating, m.votes
from romcom_ids as r
join has_genre as hg on hg.movie_id = r.movie_id
join genres as g on g.genre_id = hg.genre_id
join movies as m on m.movie_id = r.movie_id
where m.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
order by m.votes desc
limit 10;

This query will will return the same result as the nested RomCom query from above:

TITLE                    YEAR  GENRE    RATING   VOTES
-----------------------  ----  -------  ------  ------
Silver Linings Playbook  2012  Drama     7.7E0  680554
Crazy, Stupid, Love.     2011  Drama     7.4E0  491840
500 Days of Summer       2009  Drama     7.7E0  488294
Love Actually            2003  Drama     7.6E0  447788
The Terminal             2004  Drama     7.4E0  427126
Midnight in Paris        2011  Fantasy   7.7E0  402816
Easy A                   2010  Drama     7.0E0  372297
Moonrise Kingdom         2012  Drama     7.8E0  331243
Up in the Air            2009  Drama     7.4E0  326286
50/50                    2011  Drama     7.6E0  322151

As mentioned above, the has_position table implements multiple many-to-many relationships. Your database contains these five views that can simplify some of your queries over has_position:

create view acts_in as
   select person_id, movie_id
   from has_position
   where position = 'actor';

create view produced as
   select person_id, movie_id
   from has_position
   where position = 'producer';

create view wrote as
   select person_id, movie_id
   from has_position
   where position = 'writer';

create view directed as
   select person_id, movie_id
   from has_position
   where position = 'director';

create view is_self as
   select person_id, movie_id
   from has_position
   where position = 'self';

To see a list of all views defined, type \dv and to see all tables type \dt.

We need to talk about null

SQL has a special value null that is used when a value is not known or is somehow not relevant. We saw null values for the column deathYear in the people table:

select * from people limit 10;

The output:

PERSON_ID  NAME                 BIRTHYEAR  DEATHYEAR
---------  -------------------  ---------  ---------
nm0000002  Lauren Bacall             1924       2014
nm0000005  Ingmar Bergman            1918       2007
nm0000032  Charlton Heston           1923       2008
nm0000056  Paul Newman               1925       2008
nm0000084  Gong Li                   1965
nm0000090  Armin Mueller-Stahl       1930
nm0000092  John Cleese               1939
nm0000093  Brad Pitt                 1963
nm0000095  Woody Allen               1935
nm0000096  Gillian Anderson          1968

The HyperSQL system will usually print null as a blank (other systems may behave differently).

Let's try to count the number of people records with a null in the deathYear column:

select count(*)
from people
where deathYear = null;

The output:

Hmmm. What is going on here? As will be explained further in lectures, with the null we enter the world of 3-valued logic. For two expressions e1 and e2 the expression e1 = e2 can evaluate to true or false or null (which can be read as we don't know the answer). Equality returns null when either e1 or e2 evaluates to null. For example, column = null always returns null. Furthermore, the SELECT ... WHERE construct only returns records for those cases where the where clause returns true.

To get around this problem SQL introduced the is null boolean predicate:

select count(*)
from people
where deathYear is null;

The output:

Generating null values with left join

Sometimes we want to generate null values when we are joining tables together. Let's look at an example.

First, here are the the people credited with work on "Silver Linings Playbook":

select name, position
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
where title = 'Silver Linings Playbook';

The output:

NAME               POSITION
-----------------  --------
Robert De Niro     actor
Bruce Cohen        producer
Bradley Cooper     actor
Donna Gigliotti    producer
Jonathan Gordon    producer
David O. Russell   director
Jacki Weaver       actor
Jennifer Lawrence  actor
Matthew Quick      writer

Let's modify that query in order to see what roles the actors played in that movie:

select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';

The output:

NAME               POSITION  ROLE
-----------------  --------  ----------------
Robert De Niro     actor     Pat Sr.
Bradley Cooper     actor     Pat
Jacki Weaver       actor     Dolores Solitano
Jennifer Lawrence  actor     Tiffany

Well, that's nice but we lost the records associated with non-actors since they do not join with records in the plays_role table! How can we retain them?

We solve this problem by using the left join construct:

select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';

The output:

NAME               POSITION  ROLE
-----------------  --------  ----------------
Robert De Niro     actor     Pat Sr.
Bruce Cohen        producer  [null]
Bradley Cooper     actor     Pat
Donna Gigliotti    producer  [null]
Jonathan Gordon    producer  [null]
David O. Russell   director  [null]
Jacki Weaver       actor     Dolores Solitano
Jennifer Lawrence  actor     Tiffany
Matthew Quick      writer    [null]

Note that in this case our user-interface displays [null] instead of whitespace. Why? That's because the type of this column is string, and this display wants us to be able to distinguish between a null value and the empty string!

The left join construct.

left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id

When there is no matching record in plays_role, the results (role in this case) will be given the null value.

select distinct

Let's try to write a query that returns the movies in which Jennifer Lawrence played the same role:

select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
      and r1.role = r2.role
order by r1.role, m1.title, m1.year;

The output:

ROLE               TITLE                            YEAR
-----------------  -------------------------------  ----
Aurora Lane        Passengers                       2016
Katniss Everdeen   The Hunger Games                 2012
Katniss Everdeen   The Hunger Games                 2012
Katniss Everdeen   The Hunger Games: Catching Fire  2013
Katniss Everdeen   The Hunger Games: Catching Fire  2013
Mystique           X-Men: First Class               2011
Raven              X-Men: First Class               2011
Ree                Winter's Bone                    2010
Rosalyn Rosenfeld  American Hustle                  2013
Tiffany            Silver Linings Playbook          2012

Hmmm. This didn't work -- we have every role Jennifer Lawrence played in any movie, with duplicate rows as well. Why? Suppose our query finds two movies, say A and B. If they are the same movie (A=B), then we we return the role played in that movie. Otherwise, there are two orders in which the query could process these movies -- first A then B, and first B then A. In each case we will return the same result, and this explains the duplicates rows.

Let's avoid the first problem by a modification to the where clause by adding m1.movie_id <> m2.movie_id:

select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by r1.role, m1.title, m1.year;

The output:

ROLE              TITLE                            YEAR
----------------  -------------------------------  ----
Katniss Everdeen  The Hunger Games                 2012
Katniss Everdeen  The Hunger Games: Catching Fire  2013

Wow! That seems to have solved both our problems! But did it really? As a sanity check let's try another actor:

select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;

The output:

ROLE              TITLE                                  YEAR
----------------  -------------------------------------  ----
Lisbeth Salander  The Girl Who Kicked the Hornet's Nest  2009
Lisbeth Salander  The Girl Who Kicked the Hornet's Nest  2009
Lisbeth Salander  The Girl Who Played with Fire          2009
Lisbeth Salander  The Girl Who Played with Fire          2009
Lisbeth Salander  The Girl with the Dragon Tattoo        2009
Lisbeth Salander  The Girl with the Dragon Tattoo        2009

Ooops! We still have duplicate rows! What's going on? (Before reading on, see if you can figure out the answer.)

In this case there are three movies involved, say say A, B, and C. So, how many ways can I find a pair of distinct movies with A first? Well, just A, B and A, C. So the query returns two identical rows for movie A. The same goes for B and C. In general, if there are k movies with the same role, then we will get k-1 copies of each! This explains what happened with Jennifer_Lawrence. In that case k = 2 and k-1 = 1, so no duplicate rows! We just got lucky!

All of this is due to the fact that SQL is not based on sets of records, but rather on multisets of records (sets containing duplicates, also called bags). If we want a set rather than a multiset, we can use the select distinct construct:

select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;

The output:

ROLE              TITLE                                  YEAR
----------------  -------------------------------------  ----
Lisbeth Salander  The Girl Who Kicked the Hornet's Nest  2009
Lisbeth Salander  The Girl Who Played with Fire          2009
Lisbeth Salander  The Girl with the Dragon Tattoo        2009

Yippee! This seems correct. Just to be sure let's sanity check again and try another actor:

select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Scarlett Johansson'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;

The output:

ROLE              TITLE                                YEAR
----------------  -----------------------------------  ----
Black Widow       Captain America: Civil War           2016
Natasha Romanoff  Captain America: Civil War           2016
Black Widow       Captain America: The Winter Soldier  2014
Natasha Romanoff  Captain America: The Winter Soldier  2014
Black Widow       The Avengers                         2012
Natasha Romanoff  The Avengers                         2012

Did this result surprised you? Is it correct? It turns out that Scarlett Johansson played two distinct roles in each of these movies. So it does seem correct. But it also forces us to admit that the original English specification, returns the movies in which so-and-so played the same role, was rather vague!

The countries, languages, and has_alternative tables

Note: These tables have been added for the Michaelmas term of 2020. Just for fun!

It turns out that many movies have alternative titles in different parts of the world and in different languages. These are recorded in the has_alternative table. This table really represents both a one-to-many relationship together with a (weak) entity set that have been implemented as a single table.

Let's take a look at 20 rows:

select * from has_alternative limit 20;

The output:

MOVIE_ID   ALT_ID  TITLE                                       COUNTRY_CODE  LANGUAGE_CODE  IS_ORIGINAL
---------  ------  ------------------------------------------  ------------  -------------  -----------
tt0118694       0  In the Mood for Love                        SE            [null]         FALSE
tt0118694       1  Faajoeng ninwaa                             HK            yue            FALSE
tt0118694       2  Fa yeung nin wah                            ID            [null]         FALSE
tt0118694       3  In the Mood for Love                        NO            [null]         FALSE
tt0118694       4  Szerelemre hangolva                         HU            [null]         FALSE
tt0118694       5  Con ánimo de amar                           PE            [null]         FALSE
tt0118694       6  Fa Yeung Nin Wah                            IN            [null]         FALSE
tt0118694       7  Raspoloženi za ljubav                       HR            [null]         FALSE
tt0118694       8  Disponível Para Amar                        PT            [null]         FALSE
tt0118694       9  Huayang nianhua                             HK            cmn            FALSE
tt0118694      10  In the Mood for Love - Der Klang der Liebe  DE            [null]         FALSE
tt0118694      11  Les silences du désir                       CA            fr             FALSE
tt0118694      12  Aşk Zamanı                                  TR            tr             FALSE
tt0118694      13  In the Mood for Love                        IN            hi             FALSE
tt0118694      14  Dar hal va havaye eshgh                     IR            fa             FALSE
tt0118694      15  Spragnieni miłości                          PL            [null]         FALSE
tt0118694      16  In the Mood for Love                        HK            en             FALSE
tt0118694      17  Desitjant estimar                           ES            ca             FALSE
tt0118694      18  花樣年華                                        HK            yue            FALSE
tt0118694      19  In the Mood for Love                        DK            [null]         FALSE

The has_alternative table has the following columns:

movie_id: A unique identifier (key) for each movie.
alt_id: A discriminator for this week entity.
title: An alternative title for the movie.
country_code: An ISO 3166 code for a country.
language_code: An ISO 639 code for a language.
is_original: A boolean value that is documented by IMDb as TRUE iff original title. We will see below that there must be something missing from this documentation.

The IMDb data does not include a mapping for language and country codes, but we have supplied these in the languages and countries tables.

Let's take a look at 10 rows of the languages table:

select * from languages limit 10;

The output:

CODE  NAME
----  ----------------------------
af    Afrikaans
ar    Arabic
bg    Bulgarian
bn    Bengali; Bangla
bs    Bosnian
ca    Catalan; Valencian
cmn   Mandarin Chinese
cs    Czech
de    German
el    Greek, Modern (1453-); Greek

Let's take a look at 10 rows of the countries table:

select * from countries limit 10;

The output:

CODE  NAME
----  ----------------------
AE    United Arab Emirates
AL    Albania
AM    Armenia
AR    Argentina
AT    Austria
AU    Australia
AW    Aruba
AZ    Azerbaijan
BA    Bosnia and Herzegovina
BD    Bangladesh

Now let's use joins to look at the alternative titles of the very popular movie In the Mood for Love (movie_id is tt0118694):

select m.title as title,
       a.title as alt_title,
       a.is_original as orig,
       c.name as country,
       l.name as language
from movies as m
join has_alternative as a on a.movie_id = m.movie_id
left join countries as c on a.country_code = c.code
left join languages as l on a.language_code = l.code
where m.movie_id = 'tt0118694';

The output:

TITLE                 ALT_TITLE                                   ORIG   COUNTRY                                               LANGUAGE
--------------------  ------------------------------------------  -----  ----------------------------------------------------  ------------------
In the Mood for Love  In the Mood for Love                        FALSE  Sweden                                                [null]
In the Mood for Love  Faajoeng ninwaa                             FALSE  Hong Kong                                             Yue Chinese
In the Mood for Love  Fa yeung nin wah                            FALSE  Indonesia                                             [null]
In the Mood for Love  In the Mood for Love                        FALSE  Norway                                                [null]
In the Mood for Love  Szerelemre hangolva                         FALSE  Hungary                                               [null]
In the Mood for Love  Con ánimo de amar                           FALSE  Peru                                                  [null]
In the Mood for Love  Fa Yeung Nin Wah                            FALSE  India                                                 [null]
In the Mood for Love  Raspoloženi za ljubav                       FALSE  Croatia                                               [null]
In the Mood for Love  Disponível Para Amar                        FALSE  Portugal                                              [null]
In the Mood for Love  Huayang nianhua                             FALSE  Hong Kong                                             Mandarin Chinese
In the Mood for Love  In the Mood for Love - Der Klang der Liebe  FALSE  Germany                                               [null]
In the Mood for Love  Les silences du désir                       FALSE  Canada                                                French
In the Mood for Love  Aşk Zamanı                                  FALSE  Turkey                                                Turkish
In the Mood for Love  In the Mood for Love                        FALSE  India                                                 Hindi
In the Mood for Love  Dar hal va havaye eshgh                     FALSE  Iran, Islamic Republic of                             Persian; Farsi
In the Mood for Love  Spragnieni miłości                          FALSE  Poland                                                [null]
In the Mood for Love  In the Mood for Love                        FALSE  Hong Kong                                             English
In the Mood for Love  Desitjant estimar                           FALSE  Spain                                                 Catalan; Valencian
In the Mood for Love  花樣年華                                        FALSE  Hong Kong                                             Yue Chinese
In the Mood for Love  In the Mood for Love                        FALSE  Denmark                                               [null]
In the Mood for Love  Любовное настроение                         FALSE  Russian Federation                                    [null]
In the Mood for Love  Любовний настрій                            FALSE  Ukraine                                               [null]
In the Mood for Love  In the Mood for Love                        FALSE  Canada                                                English
In the Mood for Love  Meilės laukimas                             FALSE  Lithuania                                             [null]
In the Mood for Love  Stvořeni pro lásku                          FALSE  Czechia                                               [null]
In the Mood for Love  In the Mood for Love                        FALSE  Not an ISO-3166 country?                              English
In the Mood for Love  In the Mood for Love                        FALSE  Australia                                             [null]
In the Mood for Love  Flower Like Years                           FALSE  Hong Kong                                             English
In the Mood for Love  Fa yeung nin wah                            TRUE   [null]                                                [null]
In the Mood for Love  Ερωτική επιθυμία                            FALSE  Greece                                                [null]
In the Mood for Love  In the Mood for Love                        FALSE  Italy                                                 [null]
In the Mood for Love  In the Mood for Love                        FALSE  France                                                [null]
In the Mood for Love  Amor à Flor da Pele                         FALSE  Brazil                                                [null]
In the Mood for Love  Deseando amar                               FALSE  Spain                                                 [null]
In the Mood for Love  O iubire imposibilă                         FALSE  Romania                                               [null]
In the Mood for Love  Расположени за љубав                        FALSE  Serbia                                                [null]
In the Mood for Love  Kayô nenka                                  FALSE  Japan                                                 Japanese
In the Mood for Love  Deseando amar                               FALSE  Mexico                                                [null]
In the Mood for Love  In the Mood for Love                        FALSE  United States of America                              [null]
In the Mood for Love  Con ánimo de amar                           FALSE  Argentina                                             [null]
In the Mood for Love  花様年華                                        FALSE  Japan                                                 Japanese
In the Mood for Love  Tâm trạng khi yêu                           FALSE  Viet Nam                                              [null]
In the Mood for Love  Con ánimo de amar                           FALSE  Uruguay                                               [null]
In the Mood for Love  Three Stories About Food                    FALSE  Not an ISO-3166 country?                              English
In the Mood for Love  花樣年華                                        FALSE  Taiwan, Province of China                             [null]
In the Mood for Love  Fayeung ninwah                              FALSE  Hong Kong                                             Yue Chinese
In the Mood for Love  In the Mood for Love                        FALSE  Spain                                                 [null]
In the Mood for Love  Fi mezaj lelhob                             FALSE  Egypt                                                 Arabic
In the Mood for Love  Valmis armastuseks                          FALSE  Estonia                                               [null]
In the Mood for Love  In the Mood for Love                        FALSE  Finland                                               [null]
In the Mood for Love  花样年华                                        FALSE  China                                                 Mandarin Chinese
In the Mood for Love  In the Mood for Love                        FALSE  United Kingdom of Great Britain and Northern Ireland  [null]
In the Mood for Love  Rakkaus käy taloksi                         FALSE  Finland                                               [null]
In the Mood for Love  Fa Yeung Nin Wah                            FALSE  Canada                                                [null]
In the Mood for Love  Beijing Summer                              FALSE  Hong Kong                                             English

We can see that the data is a little messy (like many real-world tables!). There are quite a few missing languages as well as some countries with codes that don't seem to be in the ISO-3666 standard.

Let's use this movie to explore the meaning of the is_original column:

select m.title as title,
       a.title as alt_title,
       a.is_original as orig,
       c.name as country,
       l.name as language
from movies as m
join has_alternative as a on a.movie_id = m.movie_id
left join countries as c on a.country_code = c.code
left join languages as l on a.language_code = l.code
where m.movie_id = 'tt0118694'
  and (a.is_original = TRUE or c.name= 'Hong Kong')
order by a.title;

The output:

TITLE                 ALT_TITLE             ORIG   COUNTRY    LANGUAGE
--------------------  --------------------  -----  ---------  ----------------
In the Mood for Love  Beijing Summer        FALSE  Hong Kong  English
In the Mood for Love  Fa yeung nin wah      TRUE   [null]     [null]
In the Mood for Love  Faajoeng ninwaa       FALSE  Hong Kong  Yue Chinese
In the Mood for Love  Fayeung ninwah        FALSE  Hong Kong  Yue Chinese
In the Mood for Love  Flower Like Years     FALSE  Hong Kong  English
In the Mood for Love  Huayang nianhua       FALSE  Hong Kong  Mandarin Chinese
In the Mood for Love  In the Mood for Love  FALSE  Hong Kong  English
In the Mood for Love  花樣年華                  FALSE  Hong Kong  Yue Chinese

This is a bit hard to interpret! In the IBMd data, whenever a record has TRUE for is_original we find that the country_code and the language_code are NULL. And here we have another record with the original title (Fa yeung nin wah), but it has an is_original value of FALSE. We can at least conclude that IMDb is doing something that is not well documented!

END. (C) TGG 2021, DJG 2022.