Document database: Getting started

This tutorial will help you get started with our document-oriented database, DoctorWho. We query DoctorWho by writing Java code. The Java interface to DoctorWho is documented here: Java documentation.

Setting up DoctorWho

As in the previous practicals on relational databases and graph databases, we have prepared a dataset of movies and people (actors, directors, etc.) based on IMDb. Like last time, there is a large and a small version of the dataset. We recommend that you start with the small database, but you can experiment with the large one if you like.

The link for the small document database is here: document-small.zip. Unpack the zip file, which should give you a subdirectory called document-small.

The link for the large document database is here: document-large.zip. Unpack the zip file, which should give you a subdirectory called document-large.

To download the database system itself, go to document-db.jar. Put this jar file in the same directory that contains the directory document-small (and document-large if you are using it).

Now you are ready to explore the data.

Looking up a movie by ID

The document database contains JSON objects. Each object is identified by a unique number, its ID. There are two types of objects representing movies and people. Let's write a program to look up the details of this movie in one of our databases, and print them out. The following code will take two arguments: the directory of the database and a movie's ID:

import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class GetMovieById {
    public static void main(String[] args) {
        try (MovieDB database = MovieDB.open(args[0])) {
            int id = Integer.parseInt(args[1]);
            Movie movie = database.getMovieById(id);
            System.out.println(movie);
        }
    }
}

The ID of the latest Star Wars movie is 3470465. How do we look it up? Put the above Java code in a file called GetMovieById.java in the same directory as the jar file you downloaded. You can then compile and run the example as follows (if you are running Microsoft Windows):

# On Windows
javac -classpath document-db.jar GetMovieById.java
java -classpath .;document-db.jar GetMovieById document-small 3470465

If you are running Linux or Mac OS, the commands are very similar -- you just need to replace the semicolon with a colon in the second command:

# On Linux or Mac OS
javac -classpath document-db.jar GetMovieById.java
java -classpath .:document-db.jar GetMovieById document-small 3470465

The command-line argument document-small is the name of the directory containing the database. If you have downloaded the large database as well, you could query it by giving document-large as argument instead.

When you run the program, the output should look something like this:

{
  "id": 3470465,
  "title": "Star Wars: Episode VII - The Force Awakens (2015)",
  "year": 2015,
  "actors": [
    {
      "character": "Han Solo",
      "position": 1,
      "person_id": 681817,
      "name": "Ford, Harrison (I)"
    },
    {
      "character": "Luke Skywalker",
      "position": 2,
      "person_id": 850963,
      "name": "Hamill, Mark (I)"
    },
    {
      "character": "Princess Leia",
    ...

Looking up a person by ID

In a similar way we can extact a GetPersonById.java by ID:

import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class GetPersonById {
    public static void main(String[] args) {
        try (MovieDB database = MovieDB.open(args[0])) {
            int id = Integer.parseInt(args[1]);
            Person person = database.getPersonById(id);
            System.out.println(person);
        }
    }
}

Put this code in a file called GetPersonById.java in the same directory as the jar file you downloaded. You can then compile and run the example as follows (if you are running Microsoft Windows):

# On Windows
javac -classpath document-db.jar GetPersonById.java
java -classpath .;document-db.jar GetPersonById document-small 107303

If you are running Linux or Mac OS, the commands are very similar -- you just need to replace the semicolon with a colon in the second command:

# On Linux or Mac OS
javac -classpath document-db.jar GetPersonById.java
java -classpath .:document-db.jar GetPersonById document-small 107303

When you run the program, the output should look like this:

{
  "id": 107303,
  "name": "Bacon, Kevin (I)",
  "gender": "male",
  "actor_in": [
    {
      "character": "Jack Brennan",
      "position": 4,
      "movie_id": 2914879,
      "title": "Frost/Nixon (2008)"
    }
  ]
}

In the large database the output for Kevin Bacon is much longer, because he is involved in many more movies than just Frost/Nixon. However, the small database only contains appearances in 100 selected movies, of which only one features Kevin Bacon.

Scanning all movies, people

You can also search the database by the name of a person who is credited on a movie (whether as actor, director, producer or some other role). The name is stored in the form surname, firstname. For example, you can look up Daniel Radcliffe like this:

database.getByNamePrefix("Radcliffe, Daniel")

The result of getByNamePrefix is a sequence of Person objects. Below we will look at methods for accessing elements of such an object. But first, we will use the getByNamePrefix to scan over all people like this:

database.getByNamePrefix("")

The following code simply counts the number of people in the selected database:

import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class CountPeople {
    public static void main(String[] args) {
        int count = 0;
        try (MovieDB database = MovieDB.open(args[0])) {
            for (Person person : database.getByNamePrefix("")) {
                count++;
            }
            System.out.println("Number of people: " + count);
        }
    }
}

On the small database (document-small), this should return:

Number of people: 7711

In a similar way we can scan over all movies by using the getByTitlePrefix method. The following code counts all movies in the selected database:

import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class CountMovies {
    public static void main(String[] args) {
        int count = 0;
        try (MovieDB database = MovieDB.open(args[0])) {
            for (Movie movie : database.getByTitlePrefix("")) {
                count++;
            }
            System.out.println("Number of movies: " + count);
        }
    }
}

On the small database (document-small), this should return:

Number of movies: 100

Searching by Title

How do you find a movie if you don't already know its ID? Let's search by title instead. You can use database.getByTitlePrefix, and loop over the matching movies. Note that the search only looks at the beginning of the title (not words in the middle of the title), and it is case-sensitive. For example, the following program looks up the ID and title of all movies whose title starts with Harry Potter and the:

import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class MoviesByTitle {
    public static void main(String[] args) {
        try (MovieDB database = MovieDB.open(args[0])) {
            System.out.println("List of Harry Potter movies:");

            // Search for movies whose title starts with some string
            for (Movie movie : database.getByTitlePrefix("Harry Potter and the ")) {
                System.out.println("    " + movie.getId() + ": " + movie.getTitle());
            }
        }
    }
}

Note that this code uses method calls movie.getId() and movie.getTitle(). These and many other methods are described here: Java documentation.

The code above will produce this output on document-small:

List of Harry Potter movies:
    2961496: Harry Potter and the Deathly Hallows: Part 2 (2011)

Change the query to search for your favourite movie, and print the results. Note that the small database only contains 100 movies from the last ten years, so your favourite one might not be included. In the large database, almost every movie that appears on IMDb is included.

Searching for an actor's roles

List the titles of all the movies in which Daniel Radcliffe played, and the name of the character he played in each:

import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class MoviesByActor {
    public static void main(String[] args) {
        try (MovieDB database = MovieDB.open(args[0])) {
            for (Person person : database.getByNamePrefix("Damon, Matt")) {
                System.out.println(person.getName() + " played:");
                for (Role role : person.getActorIn()) {
                    System.out.println("    " + role.getCharacter() + " in " + role.getTitle());
                }
            }
        }
    }
}

The output should look like this:

Damon, Matt played:
    Colin Sullivan in The Departed (2006)
    Jason Bourne in The Bourne Ultimatum (2007)
    Narrator in Inside Job (2010)
    LaBoeuf in True Grit (2010)
    Mark Watney in The Martian (2015)

Genres of movies in which an actor has played

You may have noticed in the documentation that a Movie object has a getGenres() method, which returns a list of genre names that characterise the movie. Let's say you want to show a summary of the kinds of movies in which a particular actor tends to play.

We now write a program that looks up an actor by name, finds all the movies in which they have played, and then counts the number of movies by genre. Since a movie can have more than one genre, each movie may be counted several times – that's ok. Given the name of an actor, find all the movies in which they have played, and then count the number of movies by genre. A movie with multiple genres may be counted multiple times.

Notes: A person object has a list of Role objects representing the movies in which that person has played. The role object has the title and ID of the movie, but no other details about the movie – in particular, not the genre. Thus, the genres need to be looked up separately by querying the database by movie ID. This is an example of a join in Java!

Here is the code:

import java.util.Map;
import java.util.TreeMap;
import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;

public class GenresByActor {
    public static void main(String[] args) {
        try (MovieDB database = MovieDB.open(args[0])) {
            for (Person person : database.getByNamePrefix("Damon, Matt")) {
                TreeMap<String, Integer> genreCount = new TreeMap<>();

                for (Role role : person.getActorIn()) {
                    // Perform a join by looking up the ID of the movie
                    Movie movie = database.getMovieById(role.getMovieId());

                    // Iterate over the genres of that movie, and add them to the map
                    if (movie.getGenres() != null) {
                        for (String genre : movie.getGenres()) {
                            if (genreCount.containsKey(genre)) {
                                genreCount.put(genre, genreCount.get(genre) + 1);
                            } else {
                                genreCount.put(genre, 1);
                            }
                        }
                    }
                }

                // Print out the aggregate counts
                System.out.println(person.getName() + " appears in movies of the following genres:");
                for (Map.Entry<String, Integer> count : genreCount.entrySet()) {
                    System.out.println("    " + count.getKey() + ": " + count.getValue() + " movies");
                }
            }
        }
    }
}

The output should look like this:

Damon, Matt appears in movies of the following genres:
    Action: 1 movies
    Adventure: 2 movies
    Crime: 2 movies
    Documentary: 1 movies
    Drama: 3 movies
    Sci-Fi: 1 movies
    Thriller: 2 movies
    Western: 1 movies