DOCtorWho Database Tutorial

DOCtorWho really isn't a database in the traditional sense. We are just going to use python dictionaries to encode JSON objects. We will write queries using python. Even if you have never seen python before, you should be able to answer Tick 3 after reading this tutorial.

Contact Tim Griffin, tgg22@cam.ac.uk, with any problems.

Install Python (version 3.5 if possible)

Use a search engine to find instructions on how to install python on your machine.

Read this python dictionary tutorial

We will be using python dictionaries to encode our data, so please read this tutorial : https://realpython.com/python-dicts/.

Download the data

Download the file data.zip and unzip it. This will produce a directory data containing two files, movies.pickled and people.pickled.

Example files

This tutorial will be looking at these files.

The file template.py will be a template for all of our queries. Given an argument for the path to the data directory, this file will extract python dictionaries movies and people from the pickled data. We don't really have to understand what this code is doing, but the comments give you some clues.

The file get_movie_by_id.py represents our first query. It finds a movie eject given a movie_id. Suppose we execute this command in the same directory where we unzipped the data file. Then the command python3.5 get_movie_by_id.py data tt1454468 will print the result:


{'actors': [{'name': 'Ed Harris',
             'person_id': 'nm0000438',
             'roles': ['Mission Control']},
            {'name': 'Sandra Bullock',
             'person_id': 'nm0000113',
             'roles': ['Ryan Stone']},
            {'name': 'Orto Ignatiussen',
             'person_id': 'nm1241511',
             'roles': ['Aningaaq']},
            {'name': 'George Clooney',
             'person_id': 'nm0000123',
             'roles': ['Matt Kowalski']}],
 'directors': [{'name': 'Alfonso Cuarón', 'person_id': 'nm0190859'}],
 'genres': ['Drama', 'Sci-Fi', 'Thriller'],
 'minutes': '91',
 'movie_id': 'tt1454468',
 'producers': [{'name': 'David Heyman', 'person_id': 'nm0382268'}],
 'rating': '7.7',
 'title': 'Gravity',
 'type': 'movie',
 'votes': '718349',
 'writers': [{'name': 'Jonás Cuarón', 'person_id': 'nm0190861'}],
 'year': '2013'}
  

Note that the JSON movie object contains fields 'actors', 'directors', 'producers', and 'writers'. Each of these these fields is associated with a list abbreviated people objects containing a 'name' and a 'person_id'. However, these objects are augmented with 'roles' in the case of 'actors'. The 'genres' field contains a list of genres.

The python code for this query is


def get_movie_by_id (str): 
    # initialise output movie 
    the_movie = {} 
    # Check that str is a movies key 
    if str in movies.keys():
        # get the movie with that key 
        the_movie = movies[str]
    return the_movie 

id    = sys.argv[2] # get the second argument 

pprint.pprint (get_movie_by_id(id))
  

Here we have defined a procedure get_movie_by_id that first checks if its argument is a key of the movies dictionary, and then extracts the object with that key from the dictionary.

There are several things to note about python. Whitespace matters in this language. That is, unlike many other programming languages, indentation is significant in python. Languages like C and Java require a lot of curly braces to indicate the beginning and ending of blocks. Python does this with indentation. That is why the boolean condition of an if statement always ends with : to indicate that the next line starts the indented "then" clause.

The file get_movie_by_title.py finds a movie by searching for its title. For example, the command python3.5 get_movie_by_id.py data Gravity should produce the same JSON as shown above. However, the associated code is a little more complicated since we have to loop over the movie dictionary looking for a title match:


def get_movie_by_title (str): 
    # initialise output 
    the_movie = {} 
    # iterate through all the keys of the movie dictionary 
    # looking for one with the right title 
    for movie_id in movies.keys():
        if movies[movie_id]['title'] == str:
            the_movie = movies[movie_id]
            break
    return the_movie 

title    = sys.argv[2] 

pprint.pprint (get_movie_by_title(title))
  

Here we are using a for construct to loop over all keys of the movies dictionary. We break out of the loop when a match is found.

Look at the the files get_person_by_id.py and get_person_by_name.py. They implement similar queries for the people dictionary. For example, executing the command python3.5 get_person_by_name.py data "Sandra Bullock" will return the following result.


{'acted_in': [{'movie_id': 'tt0375679',
               'roles': ['Jean'],
               'title': 'Crash',
               'year': '2004'},
              {'movie_id': 'tt1454468',
               'roles': ['Ryan Stone'],
               'title': 'Gravity',
               'year': '2013'},
              {'movie_id': 'tt0878804',
               'roles': ['Leigh Anne Tuohy'],
               'title': 'The Blind Side',
               'year': '2009'}],
 'birthYear': '1964',
 'name': 'Sandra Bullock',
 'person_id': 'nm0000113'}
  

The four queries just discussed are the primary ones our "database" is intended to support. There are many JSON-based databases, such as MongoDB that are fully featured (and not so easy to install and set up). MongoDB has many interfaces for various programming languages allowing users to write queries much as we are doing here with python. There have been attempts to develop query languages for JSON databases (such as JsonPath), but we won't be looking into that topic.

What if we need more complex queries? Well, we just write code! For example, consider file example1.py. Here the query is attempting to achieve results similar to exercise2b from tick 2. In Cypher we probably wrote a query something like this:


  match (m:Movie{title : 'The Matrix Reloaded'}) 
         <-[r1:ACTED_IN]- (p) -[r2:ACTED_IN]->
        (n:Movie {title : 'John Wick'}) 
  return p.name as name, r1.roles as roles1, r2.roles as roles2 
  order by name, roles1, roles2;

That query lists actors that played roles in both 'The Matrix Reloaded' and 'John Wick'. The query example1.py generalises this by taking the two movie titles as command-line arguments.

For example, the command python3.5 example1.py data 'The Matrix Reloaded' 'John Wick' will return this result


Actor Keanu Reeves plays Neo in The Matrix Reloaded and John Wick in John Wick    

while the command python3.5 example1.py data 'Harry Potter and the Chamber of Secrets' 'Harry Potter and the Prisoner of Azkaban' returns
Actor Daniel Radcliffe plays Harry Potter in Harry Potter and the Chamber of Secrets and Harry Potter in Harry Potter and the Prisoner of Azkaban
Actor Rupert Grint plays Ron Weasley in Harry Potter and the Chamber of Secrets and Ron Weasley in Harry Potter and the Prisoner of Azkaban
Actor Emma Watson plays Hermione Granger in Harry Potter and the Chamber of Secrets and Hermione Granger in Harry Potter and the Prisoner of Azkaban

Whereas Cypher allows us to write such high-level queries with DOCtorWho we have to explicitly manipulate JSON objects:


def get_movie_by_title (str): 
    # initialise output 
    the_movie = {} 
    # iterate through all the keys of the movie dictionary 
    # looking for one with the right title 
    for movie_id in movies.keys():
        if movies[movie_id]['title'] == str:
            the_movie = movies[movie_id]
            break
    return the_movie 

title1    = sys.argv[2] 
title2    = sys.argv[3]

movie1 = get_movie_by_title(title1)
movie2 = get_movie_by_title(title2)

for actor1 in movie1['actors']:
    for actor2 in movie2['actors']:  
        if actor1['person_id'] == actor2['person_id']:
            roles1 = actor1['roles']
            roles2 = actor2['roles']
            for r1 in roles1:
                for r2 in roles2:
                    print ("Actor %s plays %s in %s and %s in %s" % (actor1['name'], r1, title1, r2, title2))