Relational databases: Schema documentation

We have provided a sample database with information about movies and actors, taken from the Internet Movie Database (IMDb). We have structured the data in a relational schema, and this page describes the form and meaning of those structures.

The sample database contains 11 tables. The two main tables are movies, which lists every movie in the database, and people, which lists all the people involved in one or more movies. The credits table indicates who was involved with which movie (as actor, director, producer, etc). The remaining 8 tables contain various additional information about movies, such as the locations where it was shot, and the date it was released.

There are two different versions of the sample database: a large version and a small version. The table structure of those databases is the same; the biggest difference is that the small database contains only 100 movies, whereas the large one contains almost a million movies. As the large database covers a much broader variety of movies, it also contains more strange quirks and oddities!

Table definitions

Here are the SQL schema definitions of those tables:

   id        integer      PRIMARY KEY,
   title     varchar(255) NOT NULL UNIQUE,
   year      integer)

   id        integer      PRIMARY KEY,
   name      varchar(255) NOT NULL UNIQUE,
   gender    varchar(10))

   person_id integer      NOT NULL REFERENCES people (id),
   movie_id  integer      NOT NULL REFERENCES movies (id),
   type      varchar(20)  NOT NULL,
   note      varchar(255),
   character varchar(255),
   position  integer,
   line_order     integer,
   group_order    integer,
   subgroup_order integer,
   UNIQUE (person_id, movie_id, type))

CREATE TABLE certificates(
   movie_id    integer      NOT NULL REFERENCES movies (id),
   country     varchar(20)  NOT NULL,
   certificate varchar(20)  NOT NULL,
   note        varchar(255))

CREATE TABLE color_info(
   movie_id  integer      NOT NULL REFERENCES movies (id),
   value     varchar(20)  NOT NULL,
   note      varchar(255))

   movie_id  integer      NOT NULL REFERENCES movies (id),
   genre     varchar(25)  NOT NULL)

CREATE TABLE keywords(
   movie_id  integer      NOT NULL REFERENCES movies (id),
   keyword   varchar(127) NOT NULL)

CREATE TABLE languages(
   movie_id  integer      NOT NULL REFERENCES movies (id),
   language  varchar(35)  NOT NULL,
   note      varchar(255))

CREATE TABLE locations(
   movie_id  integer      NOT NULL REFERENCES movies (id),
   location  varchar(255) NOT NULL,
   note      varchar(511))

CREATE TABLE release_dates(
   movie_id     integer      NOT NULL REFERENCES movies (id),
   country      varchar(40)  NOT NULL,
   release_date varchar(10)  NOT NULL,
   note         varchar(255))

CREATE TABLE running_times(
   movie_id     integer      NOT NULL REFERENCES movies (id),
   running_time varchar(40)  NOT NULL,
   note         varchar(255))

Table descriptions

Here is some more detail on how the data in each table should be interpreted, with links to the IMDb policies on how data should be structured.

ID, title, and year of first public screening of each movie.
ID, name, and gender of each person. If there are several people with the same name in the database, the name is followed by a roman numeral in parentheses to resolve the ambiguity (the number is arbitrarily assigned). Names in languages that use non-latin characters are romanised into the latin alphabet. The gender is only set for actors and actresses, which are separated into two separate lists in IMDb; for other people (e.g. directors, producers) the gender is null. For transgender or non-binary gender people, either "male" or "female" has been chosen by whoever entered the data into IMDb.

Indicates which people were involved with which movie (for example as actor, director, or producer). The type column indicates the type of involvement:

actor:The person was a member of the cast of this movie. The character column gives the name of the character they played, and the position column indicates the order in which the actors appear in the credits. For example, the actor who is listed first in the credits has a position of 1, the actor who is listed second has 2, and so on. Uncredited actors have a null position.
 The person was a cinematographer or director of photography on this movie.
composer:The person composed the main background score of this movie.
 The person is credited as costume designer on this movie.
director:The person directed this movie.
editor:The person is credited as picture editor of this movie.
producer:The person is credited as producer, executive producer, line producer or similar of this movie.
 The person is credited as production designer on this movie.
writer:The person was screenplay or story writer of this movie. Writers have numbers in the line_order, group_order and subgroup_order columns, indicating the collaborations and level of contribution according to quite specific and complex rules.
Any certificates (such as age rating) that were assigned to a movie. The certificate codes vary by country, and they are documented on the IMDb website.
Colour information about a movie — in particular, whether it is in colour or in black and white.
The genres that characterise this movie.
The keywords that describe any notable object, concept, style or action that takes place during this movie.
The languages spoken in this movie. A note may state in which context the language is used.
The places in which this movie was filmed. A note may state which parts of the movie are shot in this location.
The release dates by country for this movie. There may be multiple dates for the same country due to a premiere or film festival being listed separately from the main release date, or differences between regions. The date is given in the form "YYYY-MM-DD", unless the exact date is unknown, in which case it is in the form "YYYY-MM" or "YYYY".
The running time of this movie. Usually there is only one running time per movie, but there may be several entries if there are different versions of the movie that have different running times. The time is given as either just the number of minutes, or the number of minutes prefixed with a country, such as "Brazil:58" if different countries have different versions of the movie.