Relational Database Tutorial

With HyperSQL

Contents

Introduction

This tutorial will quickly get you started with our relational database. You will learn more if you run each of these queries yourself and experiment with your own variations. Once you have done this you will be ready to solve the practical exercises.

The database used in this course contains information about movies and people, taken from the Internet Movie Database (IMDb).

We use SQL as query language. SQL is a vast language, and this tutorial teaches you a small subset through examples. However, we do try to cover the most commonly used constructs in the query language.

Setting up HyperSQL

The database management system is HyperSQL Database (HSQLDB), which has the advantage of being quite easy to set up. The things you will learn in these exercises apply equally to more common databases you might have heard of, such as MySQL or PostgreSQL.

If you run into any problems, please post questions (or answers!) on the course's Moodle page.

You can use your own computer, or a college or university computer with Java installed. The examples should work equally well on Linux, Windows or Mac OS.

Download HyperSQL

Follow the instructions at HyperSQL Database (HSQLDB) in order to dowload the system. Once you have unzipped it, you should have a directory hsqldb-2.5.0 containing a subdirectory hsqldb. Within hsqldb is a subdirectory lib where the *.jar files reside.

Download the data and configuration file

Download the data from movies-relational.zip and unzip this to create the directory movies-relational (put it anywhere you want).

Next, download the configuration file db1a.rc. Follow the instructions in this file to add the full path to your movies-relational data. This file sets you up as the administrator for the db1a database with an empty password.

HyperSQL Shell

Now you are ready to run the HyperSQL shell with the following command line:

java -jar <full path to directory with jar files>sqltool.jar --rcfile=db1a.rc db1a

This should start a read-eval-print loop in which you can enter SQL expressions (terminated with a ;) and see the results. Type ? for help and q to quit.

SQL by example

Exploring the movies table with select ... from

Our movie database contains the following tables

  • movies
  • people
  • genres
  • has_genre
  • has_position
  • plays_role

In terms of the Entity-Relationshp model described in lectures, the first three tables contain entities while the last three implement relationships.

Let's start by using variations on the select ... from construct to explore the contents of the movies table.

The following query will list 10 records from this table:

select * from movies limit 10;

The * in the query specifies that every column from the table will be printed. We are using limit 10 since there are thousands of records in this table! Note that most systems today are not case sensitive but for some (strange) reason it has become a tradition to write SQL's keywords using CAPITALS, so you many people would write this query as

SELECT * FROM movies LIMIT 10;

We will write all queries using lower case. Both queries will produce the same output:

MOVIE_ID   TITLE                                              YEAR  TYPE   MINUTES  RATING    VOTES
---------  -------------------------------------------------  ----  -----  -------  ------  -------
tt0118694  In the Mood for Love                               2000  movie       98   8.1E0   107729
tt0120630  Chicken Run                                        2000  movie       84   7.0E0   165659
tt0120679  Frida                                              2002  movie      123   7.4E0    75287
tt0120737  The Lord of the Rings: The Fellowship of the Ring  2001  movie      178   8.8E0  1530145
tt0120903  X-Men                                              2000  movie      104   7.4E0   542044
tt0120917  The Emperor's New Groove                           2000  movie       78   7.3E0   163952
tt0121164  Corpse Bride                                       2005  movie       77   7.3E0   225638
tt0121766  Star Wars: Episode III - Revenge of the Sith       2005  movie      140   7.5E0   647896
tt0126029  Shrek                                              2001  movie       90   7.8E0   569925
tt0133240  Treasure Planet                                    2002  movie       95   7.2E0    92041

The movies table has the following columns:

movie_id
A unique identifier (key) for each movie.
title
The movie's title.
year
The release year of the movie.
type
Either movie or tvMovie.
minutes
Runtime of movie in minutes.
rating
A weighted average of all the individual user ratings.
votes
The number of votes received (by IMDb) for this movie.

Rather than using * we can be explicit about the which columns we want to see and in what order. For example:

select year, title, rating from movies limit 10;

This query produces the following output:

YEAR  TITLE                                              RATING
----  -------------------------------------------------  ------
2000  In the Mood for Love                                8.1E0
2000  Chicken Run                                         7.0E0
2002  Frida                                               7.4E0
2001  The Lord of the Rings: The Fellowship of the Ring   8.8E0
2000  X-Men                                               7.4E0
2000  The Emperor's New Groove                            7.3E0
2005  Corpse Bride                                        7.3E0
2005  Star Wars: Episode III - Revenge of the Sith        7.5E0
2001  Shrek                                               7.8E0
2002  Treasure Planet                                     7.2E0

The where construct allows us to select only those records that satisfy a given condition. Here we use it to select only movies released in 2017:

select title, rating from movies
where year = 2017
limit 10;

The output:

TITLE                     RATING
------------------------  ------
Wonder Woman               7.4E0
It                         7.4E0
The Greatest Showman       7.6E0
The Foreigner              7.0E0
A Dog's Purpose            7.1E0
Blade Runner 2049          8.0E0
Spider-Man: Homecoming     7.5E0
Coco                       8.4E0
Star Wars: The Last Jedi   7.1E0
Wonder                     8.0E0

SQL has many reserved words. Our column name year happens to be an SQL reserved word that gets highlighted by the automation we are using. Sigh. Happily, most SQL systems are very liberal in allowing many reserved words to be used as column or table names.

The condition of the where clause can contain a complex boolean-valued constraint. For example, if we want to list movies from either 2016 or 2017 with titles containing the word Star we can write

select title, year, rating
from movies
where (year = 2016 or year = 2017) and title like '%Star%';

The output:

TITLE                                                      YEAR  RATING
---------------------------------------------------------  ----  ------
Star Wars: The Last Jedi                                   2017   7.1E0
Star Trek Beyond                                           2016   7.1E0
Rogue One: A Star Wars Story                               2016   7.8E0
Bright Lights: Starring Carrie Fisher and Debbie Reynolds  2016   8.0E0

Here the like operator matches strings against some pattern. The percent sign % in the pattern means that "any sub-string can appear here". Thus, title like '%Star%' matches only titles that contain the word "Star" as a sub-string.

How many movies are in our database? The following query will tell us:

select count(*) from movies;

The output:

1236

We can use the group by construct to see how many records are associated with each year:

select year, count(*) as total
from movies
group by year;

The output:

YEAR  TOTAL
----  -----
2000     51
2002     54
2001     52
2005     58
2003     63
2008     67
2006     70
2004     72
2007     88
2009     74
2013     71
2015     60
2010     68
2011     71
2019     20
2012     63
2017     58
2014     65
2016     70
2018     41

The group by year clause tells the database system to group together all of the records having the same value for the year column and to then count the number of records in each group. Note that we have given the name total to the count(*) column using the as construct.

Your system probably did not list the years in order. We can fix that using the order by construct!

select year, count(*) as total
from movies
group by year
order by year;

The output:

YEAR  TOTAL
----  -----
2000     51
2001     52
2002     54
2003     63
2004     72
2005     58
2006     70
2007     88
2008     67
2009     74
2010     68
2011     71
2012     63
2013     71
2014     65
2015     60
2016     70
2017     58
2018     41
2019     20

If we want to list the output in reverse order we use the desc keyword:

select year, count(*) as total
from movies
group by year
order by year desc;

The output:

YEAR  TOTAL
----  -----
2019     20
2018     41
2017     58
2016     70
2015     60
2014     65
2013     71
2012     63
2011     71
2010     68
2009     74
2008     67
2007     88
2006     70
2005     58
2004     72
2003     63
2002     54
2001     52
2000     51

SQL contains many aggregate functions like count. Here is a query that uses the most common ones:

select type,
       count(*) as movie_count,
       min(votes) as min_votes,
       max(votes) as max_votes,
       sum(votes) as sum_of_votes,
       avg(votes) as averge_votes
from movies
group by type;

The output:

TYPE     MOVIE_COUNT  MIN_VOTES  MAX_VOTES  SUM_OF_VOTES  AVERGE_VOTES
-------  -----------  ---------  ---------  ------------  ------------
movie            947      50210    2093718     242717815        256301
tvMovie          289       1011      31496       1131873          3916

We can also use arithmetic in expressions such as

select sum(minutes)/1440 as days, sum(votes)/count(*) as average, 17 * (max(votes) - min(votes)) as nonsense
from movies
where type = 'movie';

The output:

DAYS  AVERAGE  NONSENSE
----  -------  --------
  78   256301  34739636

Exploring the other tables

Before we go on to learn more SQL let's first use our current knowledge to explore the other tables in our database.

The people table

How many records are in the people table?

select count(*) from people;

The output:

6284

Let's look at 10 records:

select * from people limit 10;

The output:

PERSON_ID  NAME                 BIRTHYEAR  DEATHYEAR
---------  -------------------  ---------  ---------
nm0000002  Lauren Bacall             1924       2014
nm0000005  Ingmar Bergman            1918       2007
nm0000032  Charlton Heston           1923       2008
nm0000056  Paul Newman               1925       2008
nm0000090  Armin Mueller-Stahl       1930
nm0000092  John Cleese               1939
nm0000093  Brad Pitt                 1963
nm0000095  Woody Allen               1935
nm0000096  Gillian Anderson          1968
nm0000098  Jennifer Aniston          1969

The people table has the following columns:

person_id
A unique identifier (key) for each record.
name
An associated name.
birthyear
Birth year if the record represents a person.
deathyear
Death year if the record represents a person.

The blank columns represent null values (more on those below), so deathyear will be null if the person is still alive or if their deathyear has not been entered into the database.

The genres table

Let's take a look at the contents of the genres table with this query.

select * from genres;

The output:

GENRE_ID  GENRE
--------  -----------
       1  Animation
       2  Drama
       3  Fantasy
       4  Action
       5  Romance
       6  Sci-Fi
       7  Comedy
       8  Thriller
       9  War
      10  Mystery
      11  Musical
      12  Adventure
      13  Music
      14  Family
      15  Biography
      16  Crime
      17  Horror
      18  Documentary
      19  History
      20  Sport
      21  Western
      22  News

This table has the following columns:

genre_id
A unique key for each genre.
genre
The genre.
The has_genre table

The has_genre table implements a relationship between movies and genres.

select * from has_genre
limit 10;

The output:

MOVIE_ID   GENRE_ID
---------  --------
tt0118694         2
tt0118694         5
tt0120630         1
tt0120630         7
tt0120630        12
tt0120679         2
tt0120679         5
tt0120679        15
tt0120737         2
tt0120737         3

This table has the following columns:

movie_id
A foreign key pointing into the movies table.
genre_id
A foreign key pointing into the genres table.

Both columns together form the key. Why is that? Because it implements a many-to-many relationship (see lectures).

The has_position table

Let's look at 10 records in the has_position table:

select * from has_position limit 10;

The output:

PERSON_ID  MOVIE_ID   POSITION
---------  ---------  --------
nm0000002  tt0276919  actor
nm0000005  tt0299478  director
nm0000032  tt0310793  self
nm0000056  tt0317219  actor
nm0000090  tt0765443  actor
nm0000092  tt0481499  writer
nm0000093  tt0208092  actor
nm0000093  tt0240772  actor
nm0000093  tt0266987  actor
nm0000093  tt0332452  actor

The has_position table has the following columns:

movie_id
A foreign key pointing into the movies table.
person_id
A foreign key pointing into the people table.
position
The primary job of the associated person in the associated movie.

Since this table implements are relationship between movies and people, the key formed of the pair movie_id and person_id. (Note: The key should really be all three columns since people can have multiple positions in the same movie. For some reason the IMDb data source has been constructed in a way that only one position per movie/person is listed. I hope they fix this in the future ....)

Let's see what values the position column contains:

select position, count(*) as total
from has_position
group by position
order by total desc;

The output:

POSITION  TOTAL
--------  -----
actor      4675
producer   2149
writer     2114
director   1324
self        229
The plays_role table

Let's look at 10 records in the plays_role table:

select * from plays_role limit 10;

which produces the output:

PERSON_ID  MOVIE_ID   ROLE
---------  ---------  ---------------
nm0000002  tt0276919  Ma Ginger
nm0000032  tt0310793  Himself
nm0000056  tt0317219  Doc Hudson
nm0000090  tt0765443  Semyon
nm0000093  tt0208092  Mickey O'Neil
nm0000093  tt0240772  Rusty Ryan
nm0000093  tt0266987  Tom Bishop
nm0000093  tt0332452  Achilles
nm0000093  tt0361748  Lt. Aldo Raine
nm0000093  tt0421715  Benjamin Button

The plays_role table has the following columns:

movie_id
A foreign key pointing into the movies table.
person_id
A foreign key pointing into the people table.
role
A role associated with the person in the associated movie.

Here the key is a triple: all columns together form the key. Why is that?

The real power of SQL: joining tables together

Let's do our first join! This query will "join together" records in the movies and has_genre tables in order to list movie titles from 2017 together with their genres:

select title, genre
from movies
join has_genre on has_genre.movie_id = movies.movie_id
join genres on genres.genre_id = has_genre.genre_id
where year = 2017
limit 20;

The output:

TITLE                   GENRE
----------------------  ---------
Wonder Woman            Fantasy
Wonder Woman            Action
Wonder Woman            Adventure
It                      Horror
The Greatest Showman    Drama
The Greatest Showman    Musical
The Greatest Showman    Biography
The Foreigner           Action
The Foreigner           Thriller
A Dog's Purpose         Drama
A Dog's Purpose         Comedy
A Dog's Purpose         Adventure
Blade Runner 2049       Drama
Blade Runner 2049       Action
Blade Runner 2049       Mystery
Spider-Man: Homecoming  Action
Spider-Man: Homecoming  Sci-Fi
Spider-Man: Homecoming  Adventure
Coco                    Animation
Coco                    Adventure

As we can see, some movies are associated with multiple genres. This answers the question of why movie_id and genre together form the key for the has_genre table! (The same reasoning goes for the plays_role table --- an actor can play more than one role in a movie.)

We have used the join .. on construct. Our query looks for movie records with year = 2017 and then joins together each such record with every record in the genres table that has the same movie_id value. Note the condition has_genre.movie_id = movies.movie_id disambiguates which instance of the column movie_id we are referring to since both the movies and has_genre tables share a column with the same name.

We can write the same query using the as construct that gives a local abbreviation for a table's name:

select title, genre
from movies as m
join has_genre as hg on hg.movie_id = m.movie_id
join genres as g on g.genre_id = hg.genre_id
where year = 2017
limit 20;

This kind of abbreviation will be essential when our joins contain more than one instance of the same table. For example, suppose we want to list the popular (votes > 100000) romantic comedies (RomComs)--- those movies that are associated with the genres Romance and Comedy:

select title, year, rating, votes
from movies as m
join has_genre as hg1 on hg1.movie_id = m.movie_id
join has_genre as hg2 on hg2.movie_id = m.movie_id
join genres as g1 on g1.genre_id = hg1.genre_id
join genres as g2 on g2.genre_id = hg2.genre_id
where m.votes > 100000 and g1.genre = 'Romance' and g2.genre = 'Comedy'
order by votes desc;

The output:

TITLE                           YEAR  RATING   VOTES
------------------------------  ----  ------  ------
Amélie                          2001   8.3E0  659080
Silver Linings Playbook         2012   7.7E0  625643
500 Days of Summer              2009   7.7E0  445253
Crazy, Stupid, Love.            2011   7.4E0  440746
Love Actually                   2003   7.6E0  388881
The Terminal                    2004   7.3E0  370467
The 40-Year-Old Virgin          2005   7.1E0  368827
Midnight in Paris               2011   7.7E0  360153
Easy A                          2010   7.1E0  334211
Up in the Air                   2009   7.4E0  303024
50/50                           2011   7.6E0  302970
Meet the Parents                2000   7.0E0  295100
Moonrise Kingdom                2012   7.8E0  294643
Pitch Perfect                   2012   7.2E0  259275
Forgetting Sarah Marshall       2008   7.1E0  249843
Vicky Cristina Barcelona        2008   7.1E0  229401
The Artist                      2011   7.9E0  218524
Garden State                    2004   7.4E0  201946
P.S. I Love You                 2007   7.0E0  192577
The Lobster                     2015   7.1E0  187144
I Love You, Man                 2009   7.0E0  186634
Sideways                        2004   7.5E0  170885
About a Boy                     2002   7.0E0  163016
Definitely, Maybe               2008   7.1E0  147430
The Spectacular Now             2013   7.1E0  135417
Punch-Drunk Love                2002   7.3E0  134438
Lars and the Real Girl          2007   7.3E0  132212
Good Bye Lenin!                 2003   7.7E0  130105
It's Kind of a Funny Story      2010   7.1E0  126506
The Kids Are All Right          2010   7.0E0  121248
Me and Earl and the Dying Girl  2015   7.7E0  114081
Safety Not Guaranteed           2012   7.0E0  112692
Love, Rosie                     2014   7.2E0  107473
The Big Sick                    2017   7.6E0  106480

Nested queries

Now suppose we want to see what other genres RomComs are associated with. One way to do this is with a nested query:

select m1.title, m1.year, g.genre, m1.rating, m1.votes
from movies as m1
join has_genre as hg on hg.movie_id = m1.movie_id
join genres as g on g.genre_id = hg.genre_id
where m1.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
      and m1.movie_id in
         (select m2.movie_id
         from movies as m2
         join has_genre as hg1 on hg1.movie_id = m2.movie_id
         join has_genre as hg2 on hg2.movie_id = m2.movie_id
         join genres as g1 on g1.genre_id = hg1.genre_id
         join genres as g2 on g2.genre_id = hg2.genre_id
         where g1.genre = 'Romance' and g2.genre = 'Comedy')
order by m1.votes desc
limit 10;

The output:

TITLE                    YEAR  GENRE    RATING   VOTES
-----------------------  ----  -------  ------  ------
Silver Linings Playbook  2012  Drama     7.7E0  625643
500 Days of Summer       2009  Drama     7.7E0  445253
Crazy, Stupid, Love.     2011  Drama     7.4E0  440746
Love Actually            2003  Drama     7.6E0  388881
The Terminal             2004  Drama     7.3E0  370467
Midnight in Paris        2011  Fantasy   7.7E0  360153
Easy A                   2010  Drama     7.1E0  334211
Up in the Air            2009  Drama     7.4E0  303024
50/50                    2011  Drama     7.6E0  302970
Moonrise Kingdom         2012  Drama     7.8E0  294643

Here (select m2.movie_id ...) is called a nested query. We check if our movie is a RomCom, and then check if there is another genre that is not equal to either Romance or Comedy.

Views

We can introduce a name that represents a query using views. For example, the following SQL defines a view romcom_ids to contain all those movie_id values for RomComs. Executing the statement create view romcom_ids ... will place this view definition in the database. The name romcom_ids can then be used as if it were the name of a data base table.

drop view if exists romcom_ids;

create view romcom_ids as
    select m.movie_id as movie_id
    from movies as m
    join has_genre as hg1 on hg1.movie_id = m.movie_id
    join has_genre as hg2 on hg2.movie_id = m.movie_id
    join genres as g1 on g1.genre_id = hg1.genre_id
    join genres as g2 on g2.genre_id = hg2.genre_id
    where g1.genre = 'Romance' and g2.genre = 'Comedy';

select m.title, m.year, g.genre, m.rating, m.votes
from romcom_ids as r
join has_genre as hg on hg.movie_id = r.movie_id
join genres as g on g.genre_id = hg.genre_id
join movies as m on m.movie_id = r.movie_id
where m.votes > 100000 and (not (g.genre = 'Romance' or g.genre = 'Comedy'))
order by m.votes desc
limit 10;

The query above will return the same result as the previous version:

TITLE                    YEAR  GENRE    RATING   VOTES
-----------------------  ----  -------  ------  ------
Silver Linings Playbook  2012  Drama     7.7E0  625643
500 Days of Summer       2009  Drama     7.7E0  445253
Crazy, Stupid, Love.     2011  Drama     7.4E0  440746
Love Actually            2003  Drama     7.6E0  388881
The Terminal             2004  Drama     7.3E0  370467
Midnight in Paris        2011  Fantasy   7.7E0  360153
Easy A                   2010  Drama     7.1E0  334211
Up in the Air            2009  Drama     7.4E0  303024
50/50                    2011  Drama     7.6E0  302970
Moonrise Kingdom         2012  Drama     7.8E0  294643

In general, when you create a table or view into a database it is best to be careful and make sure you first delete any older versions. That is the reason for the statement drop view if exists romcom_ids that appears just before the view definition. (Please don't use views in your solutions to the relational tick!)

We need to talk about null

SQL has a special value null that is used when a value is not known or is somehow not relevant. We saw null values for the column deathYear in the people table:

select * from people limit 10;

The output:

PERSON_ID  NAME                 BIRTHYEAR  DEATHYEAR
---------  -------------------  ---------  ---------
nm0000002  Lauren Bacall             1924       2014
nm0000005  Ingmar Bergman            1918       2007
nm0000032  Charlton Heston           1923       2008
nm0000056  Paul Newman               1925       2008
nm0000090  Armin Mueller-Stahl       1930
nm0000092  John Cleese               1939
nm0000093  Brad Pitt                 1963
nm0000095  Woody Allen               1935
nm0000096  Gillian Anderson          1968
nm0000098  Jennifer Aniston          1969

The HyperSQL system will usually print null as a blank (other systems may behave differently).

Let's try to count the number of people records with a null in the deathYear column:

select count(*)
from people
where deathYear = null;

The output:

0

Hmmm. What is going on here? As will be explained further in lectures, with the null we enter the world of 3-valued logic. For two expressions e1 and e2 the expression e1 = e2 can evaluate to true or false or null (which can be read as we don't know the answer). Equality returns null when either e1 or e2 evaluates to null. For example, column = null always returns null. Furthermore, the SELECT ... WHERE construct only returns records for those cases where the where clause returns true.

To get around this problem SQL introduced the is null boolean predicate:

select count(*)
from people
where deathYear is null;

The output:

5919

Generating null values with left join

Sometimes we want to generate null values when we are joing tables together. Let's look at an example.

First, here are the the people credited with work on "Silver Linings Playbook":

select name, position
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
where title = 'Silver Linings Playbook';

The output:

NAME               POSITION
-----------------  --------
Robert De Niro     actor
Bruce Cohen        producer
Bradley Cooper     actor
Donna Gigliotti    producer
Jonathan Gordon    producer
David O. Russell   director
Jacki Weaver       actor
Jennifer Lawrence  actor
Matthew Quick      writer

Let's modify that query in order to see what roles the actors played in that movie:

select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';

The output:

NAME               POSITION  ROLE
-----------------  --------  -------
Robert De Niro     actor     Pat Sr.
Bradley Cooper     actor     Pat
Jacki Weaver       actor     Dolores
Jennifer Lawrence  actor     Tiffany

Well, that's nice but we lost the records associated with non-actors since they do not join with records in the plays_role table! How can we retain them?

We solve this problem by using the left join construct:

select name, position, role
from people as p
join has_position as c on p.person_id = c.person_id
join movies as m on c.movie_id = m.movie_id
left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id
where title = 'Silver Linings Playbook';

The output:

NAME               POSITION  ROLE
-----------------  --------  -------
Robert De Niro     actor     Pat Sr.
Bruce Cohen        producer  [null]
Bradley Cooper     actor     Pat
Donna Gigliotti    producer  [null]
Jonathan Gordon    producer  [null]
David O. Russell   director  [null]
Jacki Weaver       actor     Dolores
Jennifer Lawrence  actor     Tiffany
Matthew Quick      writer    [null]

Note that in this case our user-interface displays [null] instead of whitespace. Why? That's because the type of this column is string, and this display wants us to be able to distinguish between a null value and the empty string!

The left join construct.

  • left join plays_role as r on r.movie_id = m.movie_id and r.person_id = c.person_id

When there is no matchng record in plays_role, the results (role in this case) will be given the null value.

select distinct

Let's try to write a query that returns the movies in which Jennifer Lawrence played the same role:

select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
      and r1.role = r2.role
order by r1.role, m1.title, m1.year;

The output:

ROLE               TITLE                            YEAR
-----------------  -------------------------------  ----
Aurora Lane        Passengers                       2016
Katniss Everdeen   The Hunger Games                 2012
Katniss Everdeen   The Hunger Games                 2012
Katniss Everdeen   The Hunger Games: Catching Fire  2013
Katniss Everdeen   The Hunger Games: Catching Fire  2013
Mystique           X-Men: First Class               2011
Raven              X-Men: First Class               2011
Ree                Winter's Bone                    2010
Rosalyn Rosenfeld  American Hustle                  2013
Tiffany            Silver Linings Playbook          2012

Hmmm. This didn't work -- we have every role Jennifer Lawrence played in any movie, with duplicate rows as well. Why? Suppose our query finds two movies, say A and B. If they are the same movie (A=B), then we we return the role played in that movie. Otherwise, there are two orders in which the query could process these movies -- first A then B, and first B then A. In each case we will return the same result, and this explains the duplicates rows.

Let's avoid the first problem by a modification to the where clause by adding m1.movie_id <> m2.movie_id:

select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Jennifer Lawrence'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by r1.role, m1.title, m1.year;

The output:

ROLE              TITLE                            YEAR
----------------  -------------------------------  ----
Katniss Everdeen  The Hunger Games                 2012
Katniss Everdeen  The Hunger Games: Catching Fire  2013

Wow! That seems to have solved both our problems! But did it really? As a sanity check let's try another actor:

select r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;

The output:

ROLE              TITLE                                  YEAR
----------------  -------------------------------------  ----
Lisbeth Salander  The Girl Who Kicked the Hornet's Nest  2009
Lisbeth Salander  The Girl Who Kicked the Hornet's Nest  2009
Lisbeth Salander  The Girl Who Played with Fire          2009
Lisbeth Salander  The Girl Who Played with Fire          2009
Lisbeth Salander  The Girl with the Dragon Tattoo        2009
Lisbeth Salander  The Girl with the Dragon Tattoo        2009

Ooops! We still have duplicate rows! What's going on? (Bofore reading on, see if you can figure out the answer.)

In this case there are three movies involved, say say A, B, and C. So, how many ways can I find a pair of distinct movies with A first? Well, just A, B and A, C. So the query returns two identical rows for movie A. The same goes for B and C. In general, if there are k movies with the same role, then we will get k-1 copies of each! This explains what happened with Jennifer_Lawrence. In that case k = 2 and k-1 = 1, so no duplicate rows! We just got lucky!

All of this is due to the fact that SQL is not based on sets of records, but rather on multisets of records (sets containing duplicates, also called bags). If we want a set rather than a multiset, we can use the select distinct construct:

select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Noomi Rapace'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;

The output:

ROLE              TITLE                                  YEAR
----------------  -------------------------------------  ----
Lisbeth Salander  The Girl Who Kicked the Hornet's Nest  2009
Lisbeth Salander  The Girl Who Played with Fire          2009
Lisbeth Salander  The Girl with the Dragon Tattoo        2009

Yippee! This seems correct. Just to be sure let's sanity check again and try another actor:

select distinct r1.role as role, m1.title as title, m1.year as year
from plays_role as r1
join plays_role as r2 on r2.person_id = r1.person_id
join movies as m1 on m1.movie_id = r1.movie_id
join movies as m2 on m2.movie_id = r2.movie_id
join people as p on p.person_id = r1.person_id
where p.name = 'Scarlett Johansson'
      and r1.role = r2.role
      and m1.movie_id <> m2.movie_id
order by m1.title, r1.role, m1.year;

The output:

ROLE              TITLE                                YEAR
----------------  -----------------------------------  ----
Black Widow       Captain America: Civil War           2016
Natasha Romanoff  Captain America: Civil War           2016
Black Widow       Captain America: The Winter Soldier  2014
Natasha Romanoff  Captain America: The Winter Soldier  2014
Black Widow       The Avengers                         2012
Natasha Romanoff  The Avengers                         2012

Did this result surprised you? Is it correct? It turns out that Scarlett Johansson played two distinct roles in each of these movies. So it does seem correct. But it also forces us to admit that the original English specification, returns the movies in which so-and-so played the same role, was rather vague!

Next Steps

Now you should know enough to attempt the practical questions, which appear on a separate page.