This tutorial will help you get started with our document-oriented database, DoctorWho. Many databases management systems (DBMSs) have language bindings allowing them to be queried from your favourite high-level language, such as FSharp or Python. But here, the DBMS itself is also written in Python. What we provide is very rudimentary, so writing a few query utilities of your own is recommended.
As in the previous practical on relational databases we have prepared a dataset of movies and people (actors, directors, etc.) based on IMDb. Because we are interested in unstructured text, this version also includes trivia entries.
The prepared document database is a python pickled dictionary available in zipped form from the course web site. Download it and unzip it in a folder of your choosing:
-rw-rw-r--. 1 djg11 djg11 15802411 Nov 15 10:25 imdb_doctorwho_database_v3.pickled
Make sure you have the v3 copy, since the older ones were broken.
There is just one 'table'. A document database is roughly equivalent to a key/value store. Both the keys and the values (content stored under a key) are opaque to a document database. There might be no rules of consistency and data cannot be guaranteed to be organised uniformly.
Of course, such an approach is almost useless! Instead, we must have some idea what's in our database.
Here's a hint: all of the DoctorWho keys are either PERSON_ID or MOVIE_ID values, as used in our tidy relational database.
What to do next? We need to know the schema! But being typeless, that's not documented. So instead, we'll have to take a look at some example data and infer it. Sadly, a large amount of so-called Computing Science is done that way today. You ask someone for a definition and they give you an example and don't even realise they've not answered your question.
Well, we'll jus have to look up a person and look up a film and see what is stored. We shall expect to see all the relevant information for a person or a film in a denormal form, with some of the information repeated between enties --- or indeed replicated manyfold.
We query DoctorWho by writing python code. If we know the IMDB for a film, eg. tt1109624, we can just quote it
# This will be the common prelude to all of our queries. import sys # talk to the operating system import os.path # manipulate paths to files, directories import pickle # read/write pickled python dictionaries import pprint # pretty print JSON data_dir = sys.argv[1] # first command-line argument -- the directory of data # use os.path.join so that path works on both Windows and Unix doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled') # open data dictionary file and un-pickle it doctorwhoFile = open(doctorwho_path, mode= "rb") doctorwho = pickle.load(doctorwhoFile) ##################################### # write your query code here ... movie_key = "tt1109624" # Paddington - 2014 pprint.pprint (doctorwho[movie_key])
When you run the program, the output should look like this:
{'actors': [{'name': 'Hugh Bonneville', 'person_id': 'nm0095017', 'roles': ['Henry Brown']}, {'name': 'Sally Hawkins', 'person_id': 'nm1020089', 'roles': ['Mary Brown']}, {'name': 'Julie Walters', 'person_id': 'nm0910278', 'roles': ['Mrs. Bird']}, {'name': 'Jim Broadbent', 'person_id': 'nm0000980', 'roles': ['Mr Gruber']}], 'directors': [{'name': 'Paul King', 'person_id': 'nm1653753'}], 'genres': ['Adventure', 'Animation', 'Comedy'], 'minutes': '95', 'movie_id': 'tt1109624', 'producers': [{'name': 'David Heyman', 'person_id': 'nm0382268'}], 'rating': '7.2', 'title': 'Paddington', 'trivia': {'title2': ['Paddington (2014) - IMDb'], 'trivia_entries': ['\n' " Karen Jankel, daughter of Paddington's " 'creator, Michael Bond, was almost moved to ' 'tears after the first screening she saw. "For ' 'me, it was bringing to life the bear that was ' 'so real to me. And I think they got it ' 'absolutely right." ', '\n' ... ... ... 'and the security guard. ']}, 'type': 'movie', 'votes': '105159', 'writers': [{'name': 'Michael Bond', 'person_id': 'nm1286491'}, {'name': 'Hamish McColl', 'person_id': 'nm0566100'}], 'year': '2014'}
You can see the document database contains JSON objects.
There is no inverted index stored yet (you might make one in your own work and compare the run time), so a linear scan is used to find a film:
# This will be the common prelude to all of our queries. import sys # talk to the operating system import os.path # manipulate paths to files, directories import pickle # read/write pickled python dictionaries import pprint # pretty print JSON data_dir = sys.argv[1] # first command-line argument -- the directory of data # use os.path.join so that path works on both Windows and Unix doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled') # open data dictionary file and un-pickle it doctorwhoFile = open(doctorwho_path, mode= "rb") doctorwho = pickle.load(doctorwhoFile) ##################################### def get_movie_by_title(str): # initialise output the_movie = None # iterate through all the keys of the database looking for one with the right title for key in doctorwho: entry = doctorwho[key] if 'title' in entry: if (entry['title'] == str): the_movie = entry break if (the_movie != None): break return the_movie # write your query code here: search_title = sys.argv[2] # "Harry Potter and the Order of the Phoenix" pprint.pprint (get_movie_by_title(search_title))
When you run the program, the output should look like this:
{'actors': [{'name': 'Rupert Grint', 'person_id': 'nm0342488', 'roles': ['Ron Weasley']}, {'name': 'Daniel Radcliffe', 'person_id': 'nm0705356', 'roles': ['Harry Potter']}, {'name': 'Emma Watson', 'person_id': 'nm0914612', 'roles': ['Hermione Granger']}, {'name': 'Brendan Gleeson', 'person_id': 'nm0322407', 'roles': ["Alastor 'Mad-Eye' Moody"]}], 'directors': [{'name': 'David Yates', 'person_id': 'nm0946734'}], 'genres': ['Action', 'Adventure', 'Family'], 'minutes': '138', ...
And the same applies to find a person:
The following code will take two arguments: the directory of the database and a person's name:
# This will be the common prelude to all of our queries. import sys # talk to the operating system import os.path # manipulate paths to files, directories import pickle # read/write pickled python dictionaries import pprint # pretty print JSON data_dir = sys.argv[1] # first command-line argument -- the directory of data search_item = sys.argv[2] # for example "Keeley Hawes" # use os.path.join so that path works on both Windows and Unix doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled') # open data dictionary file and un-pickle it doctorwhoFile = open(doctorwho_path, mode= "rb") doctorwho = pickle.load(doctorwhoFile) # Iterate through all the keys of the database looking for a person with the right name def get_person_by_name(name): the_person = None for key in doctorwho: entry = doctorwho[key] if 'name' in entry: if (entry['name'] == name): the_person = entry break if (the_person != None): break return the_person the_person = get_person_by_name(search_item) pprint.pprint (the_person)
When you run the program, the output should look like this:
{'acted_in': [{'movie_id': 'tt0795368', 'roles': ['Jane'], 'title': 'Death at a Funeral', 'year': '2007'}], 'birthYear': '1976', 'name': 'Keeley Hawes', 'person_id': 'nm0369954'}
Here we demonstrate sorting the output, limiting the number of returned records and searching for a string in the title and trivia.
# This will be the common prelude to all of our queries. import sys # talk to the operating system import os.path # manipulate paths to files, directories import pickle # read/write pickled python dictionaries import pprint # pretty print JSON data_dir = sys.argv[1] # first command-line argument -- the directory of data # use os.path.join so that path works on both Windows and Unix doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled') # open data dictionary file and un-pickle it doctorwhoFile = open(doctorwho_path, mode= "rb") doctorwho = pickle.load(doctorwhoFile) ##################################### # write your query code here ... def get_movie_by_subject(str): # initialise output results = [] for key in doctorwho: if ("tt" in key): # Check it's a film entry = doctorwho[key] title = entry['title'] selected = False if (str in title): selected = True trivia = entry['trivia'] if (trivia != None): for triv in trivia['trivia_entries']: if (selected): break; if (str in triv): selected = True if (selected): results.append(title) return results # Note: append in Python is nothing like the @ in ML: it is mutable and adds only a single item. def display_ans(line_limit, ans): ans = sorted(ans) # Sort into ascending order ans = ans[0:line_limit] # Limit number of entries for item in ans: print(f'{item}') print(f'Returned {len(ans)} records') ans = get_movie_by_subject("bear") display_ans(30, ans)
When you run the program, the output should look like this:
10 Cloverfield Lane 28 Days Later... 28 Weeks Later 3:10 to Yuma A History of Violence A Scanner Darkly A Serious Man A Tale of Two Sisters A.I. Artificial Intelligence About Schmidt Adaptation. American Sniper Anchorman: The Legend of Ron Burgundy Apocalypto Avatar Avengers: Age of Ultron Batman Begins Beasts of No Nation Big Hero 6 Black Panther Blade Runner 2049 Blue Valentine Booksmart Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan Brave Brothers Captain America: The Winter Soldier Cars Chocolat Coraline Returned 30 records
Of course, much more sophisticated natural language processing will be considered in subsequent courses. Here we've not ignored capital letters in our match (which would be likely in a title), let alone considering whether a pony is a horse!
END. (C) TGG 2021, DJG 2022.