DOCtorWho really isn't a database in the traditional sense. We are just going to use python dictionaries to encode JSON objects. We will write queries using python. Even if you have never seen python before, you should be able to answer Tick 3 after reading this tutorial.
Contact Tim Griffin, tgg22@cam.ac.uk, with any problems.
Use a search engine to find instructions on how to install python on your machine.
We will be using python dictionaries to encode our data, so please read this tutorial : https://realpython.com/python-dicts/.
Download the file data.zip and unzip it. This will produce a directory data containing two files, movies.pickled and people.pickled.
This tutorial will be looking at these files.
The file template.py will be a template for all of our queries. Given an argument for the path to the data directory, this file will extract python dictionaries movies and people from the pickled data. We don't really have to understand what this code is doing, but the comments give you some clues.
The file get_movie_by_id.py represents our first query. It finds a movie eject given a movie_id. Suppose we execute this command in the same directory where we unzipped the data file. Then the command python3.5 get_movie_by_id.py data tt1454468 will print the result:
{'actors': [{'name': 'Ed Harris', 'person_id': 'nm0000438', 'roles': ['Mission Control']}, {'name': 'Sandra Bullock', 'person_id': 'nm0000113', 'roles': ['Ryan Stone']}, {'name': 'Orto Ignatiussen', 'person_id': 'nm1241511', 'roles': ['Aningaaq']}, {'name': 'George Clooney', 'person_id': 'nm0000123', 'roles': ['Matt Kowalski']}], 'directors': [{'name': 'Alfonso Cuarón', 'person_id': 'nm0190859'}], 'genres': ['Drama', 'Sci-Fi', 'Thriller'], 'minutes': '91', 'movie_id': 'tt1454468', 'producers': [{'name': 'David Heyman', 'person_id': 'nm0382268'}], 'rating': '7.7', 'title': 'Gravity', 'type': 'movie', 'votes': '718349', 'writers': [{'name': 'Jonás Cuarón', 'person_id': 'nm0190861'}], 'year': '2013'}
The python code for this query is
def get_movie_by_id (str): # initialise output movie the_movie = {} # Check that str is a movies key if str in movies.keys(): # get the movie with that key the_movie = movies[str] return the_movie id = sys.argv[2] # get the second argument pprint.pprint (get_movie_by_id(id))
There are several things to note about python. Whitespace matters in this language. That is, unlike many other programming languages, indentation is significant in python. Languages like C and Java require a lot of curly braces to indicate the beginning and ending of blocks. Python does this with indentation. That is why the boolean condition of an if statement always ends with : to indicate that the next line starts the indented "then" clause.
The file get_movie_by_title.py finds a movie by searching for its title. For example, the command python3.5 get_movie_by_id.py data Gravity should produce the same JSON as shown above. However, the associated code is a little more complicated since we have to loop over the movie dictionary looking for a title match:
def get_movie_by_title (str): # initialise output the_movie = {} # iterate through all the keys of the movie dictionary # looking for one with the right title for movie_id in movies.keys(): if movies[movie_id]['title'] == str: the_movie = movies[movie_id] break return the_movie title = sys.argv[2] pprint.pprint (get_movie_by_title(title))
Look at the the files get_person_by_id.py and get_person_by_name.py. They implement similar queries for the people dictionary. For example, executing the command python3.5 get_person_by_name.py data "Sandra Bullock" will return the following result.
{'acted_in': [{'movie_id': 'tt0375679', 'roles': ['Jean'], 'title': 'Crash', 'year': '2004'}, {'movie_id': 'tt1454468', 'roles': ['Ryan Stone'], 'title': 'Gravity', 'year': '2013'}, {'movie_id': 'tt0878804', 'roles': ['Leigh Anne Tuohy'], 'title': 'The Blind Side', 'year': '2009'}], 'birthYear': '1964', 'name': 'Sandra Bullock', 'person_id': 'nm0000113'}
What if we need more complex queries? Well, we just write code! For example, consider file example1.py. Here the query is attempting to achieve results similar to exercise2b from tick 2. In Cypher we probably wrote a query something like this:
match (m:Movie{title : 'The Matrix Reloaded'}) <-[r1:ACTED_IN]- (p) -[r2:ACTED_IN]-> (n:Movie {title : 'John Wick'}) return p.name as name, r1.roles as roles1, r2.roles as roles2 order by name, roles1, roles2;
For example, the command python3.5 example1.py data 'The Matrix Reloaded' 'John Wick' will return this result
Actor Keanu Reeves plays Neo in The Matrix Reloaded and John Wick in John Wick
Actor Daniel Radcliffe plays Harry Potter in Harry Potter and the Chamber of Secrets and Harry Potter in Harry Potter and the Prisoner of Azkaban Actor Rupert Grint plays Ron Weasley in Harry Potter and the Chamber of Secrets and Ron Weasley in Harry Potter and the Prisoner of Azkaban Actor Emma Watson plays Hermione Granger in Harry Potter and the Chamber of Secrets and Hermione Granger in Harry Potter and the Prisoner of Azkaban
Whereas Cypher allows us to write such high-level queries with DOCtorWho we have to explicitly manipulate JSON objects:
def get_movie_by_title (str): # initialise output the_movie = {} # iterate through all the keys of the movie dictionary # looking for one with the right title for movie_id in movies.keys(): if movies[movie_id]['title'] == str: the_movie = movies[movie_id] break return the_movie title1 = sys.argv[2] title2 = sys.argv[3] movie1 = get_movie_by_title(title1) movie2 = get_movie_by_title(title2) for actor1 in movie1['actors']: for actor2 in movie2['actors']: if actor1['person_id'] == actor2['person_id']: roles1 = actor1['roles'] roles2 = actor2['roles'] for r1 in roles1: for r2 in roles2: print ("Actor %s plays %s in %s and %s in %s" % (actor1['name'], r1, title1, r2, title2))