Document Database Tutorial Sheet

This tutorial will help you get started with our document-oriented database, DoctorWho. Many databases management systems (DBMSs) have language bindings allowing them to be queried from your favourite high-level language, such as FSharp or Python. But here, the DBMS itself is also written in Python. What we provide is very rudimentary, so writing a few query utilities of your own is recommended.

Setting up DoctorWho

As in the previous practical on relational databases we have prepared a dataset of movies and people (actors, directors, etc.) based on IMDb. Because we are interested in unstructured text, this version also includes trivia entries.

The prepared document database is a python pickled dictionary available in zipped form from the course web site. Download it and unzip it in a folder of your choosing:

-rw-rw-r--. 1 djg11 djg11 15802411 Nov 15 10:25 imdb_doctorwho_database_v3.pickled

Make sure you have the v3 copy, since the older ones were broken.

How many tables are there?

There is just one 'table'. A document database is roughly equivalent to a key/value store. Both the keys and the values (content stored under a key) are opaque to a document database. There might be no rules of consistency and data cannot be guaranteed to be organised uniformly.

Of course, such an approach is almost useless! Instead, we must have some idea what's in our database.

Here's a hint: all of the DoctorWho keys are either PERSON_ID or MOVIE_ID values, as used in our tidy relational database.

What to do next? We need to know the schema! But being typeless, that's not documented. So instead, we'll have to take a look at some example data and infer it. Sadly, a large amount of so-called Computing Science is done that way today. You ask someone for a definition and they give you an example and don't even realise they've not answered your question.

Well, we'll jus have to look up a person and look up a film and see what is stored. We shall expect to see all the relevant information for a person or a film in a denormal form, with some of the information repeated between enties --- or indeed replicated manyfold.

Looking up a movie by ID

We query DoctorWho by writing python code. If we know the IMDB for a film, eg. tt1109624, we can just quote it

# This will be the common prelude to all of our queries. 
import sys     # talk to the operating system 
import os.path # manipulate paths to files, directories 
import pickle  # read/write pickled python dictionaries 
import pprint  # pretty print JSON

data_dir = sys.argv[1] # first command-line argument -- the directory of data 

# use os.path.join so that path works on both Windows and Unix 
doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled')
# open data dictionary file and un-pickle it
doctorwhoFile = open(doctorwho_path, mode= "rb")
doctorwho     = pickle.load(doctorwhoFile)
#####################################
# write your query code here ...

movie_key = "tt1109624" # Paddington - 2014
pprint.pprint (doctorwho[movie_key])

When you run the program, the output should look like this:

{'actors': [{'name': 'Hugh Bonneville',
             'person_id': 'nm0095017',
             'roles': ['Henry Brown']},
            {'name': 'Sally Hawkins',
             'person_id': 'nm1020089',
             'roles': ['Mary Brown']},
            {'name': 'Julie Walters',
             'person_id': 'nm0910278',
             'roles': ['Mrs. Bird']},
            {'name': 'Jim Broadbent',
             'person_id': 'nm0000980',
             'roles': ['Mr Gruber']}],
 'directors': [{'name': 'Paul King', 'person_id': 'nm1653753'}],
 'genres': ['Adventure', 'Animation', 'Comedy'],
 'minutes': '95',
 'movie_id': 'tt1109624',
 'producers': [{'name': 'David Heyman', 'person_id': 'nm0382268'}],
 'rating': '7.2',
 'title': 'Paddington',
 'trivia': {'title2': ['Paddington (2014) - IMDb'],
            'trivia_entries': ['\n'
                               "    Karen Jankel, daughter of Paddington's "
                               'creator, Michael Bond, was almost moved to '
                               'tears after the first screening she saw. "For '
                               'me, it was bringing to life the bear that was '
                               'so real to me. And I think they got it '
                               'absolutely right."          ',
                               '\n'
... ... ...
                               'and the security guard.          ']},
 'type': 'movie',
 'votes': '105159',
 'writers': [{'name': 'Michael Bond', 'person_id': 'nm1286491'},
             {'name': 'Hamish McColl', 'person_id': 'nm0566100'}],
 'year': '2014'}

You can see the document database contains JSON objects.

Looking up a movie by film title

There is no inverted index stored yet (you might make one in your own work and compare the run time), so a linear scan is used to find a film:

# This will be the common prelude to all of our queries. 
import sys     # talk to the operating system 
import os.path # manipulate paths to files, directories 
import pickle  # read/write pickled python dictionaries 
import pprint  # pretty print JSON

data_dir = sys.argv[1] # first command-line argument -- the directory of data 

# use os.path.join so that path works on both Windows and Unix 
doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled')
# open data dictionary file and un-pickle it
doctorwhoFile = open(doctorwho_path, mode= "rb")
doctorwho     = pickle.load(doctorwhoFile)
#####################################

def get_movie_by_title(str):
    # initialise output 
    the_movie = None
    # iterate through all the keys of the database looking for one with the right title 
    for key in doctorwho:
        entry = doctorwho[key]
        if 'title' in entry:
            if (entry['title'] == str):
                the_movie = entry
                break
        if (the_movie != None): break
    return the_movie

# write your query code here:
search_title    = sys.argv[2] #  "Harry Potter and the Order of the Phoenix"
pprint.pprint (get_movie_by_title(search_title))

When you run the program, the output should look like this:

{'actors': [{'name': 'Rupert Grint',
             'person_id': 'nm0342488',
             'roles': ['Ron Weasley']},
            {'name': 'Daniel Radcliffe',
             'person_id': 'nm0705356',
             'roles': ['Harry Potter']},
            {'name': 'Emma Watson',
             'person_id': 'nm0914612',
             'roles': ['Hermione Granger']},
            {'name': 'Brendan Gleeson',
             'person_id': 'nm0322407',
             'roles': ["Alastor 'Mad-Eye' Moody"]}],
 'directors': [{'name': 'David Yates', 'person_id': 'nm0946734'}],
 'genres': ['Action', 'Adventure', 'Family'],
 'minutes': '138',
...

Finding a person using their name

And the same applies to find a person:

The following code will take two arguments: the directory of the database and a person's name:

# This will be the common prelude to all of our queries. 
import sys     # talk to the operating system 
import os.path # manipulate paths to files, directories 
import pickle  # read/write pickled python dictionaries 
import pprint  # pretty print JSON

data_dir = sys.argv[1] # first command-line argument -- the directory of data 
search_item = sys.argv[2] # for example "Keeley Hawes"
# use os.path.join so that path works on both Windows and Unix 
doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled')
# open data dictionary file and un-pickle it
doctorwhoFile = open(doctorwho_path, mode= "rb")
doctorwho     = pickle.load(doctorwhoFile)



# Iterate through all the keys of the database looking for a person with the right name
def get_person_by_name(name):
    the_person = None
    for key in doctorwho:
        entry = doctorwho[key]
        if 'name' in entry:
            if (entry['name'] == name):
                the_person = entry
                break
        if (the_person != None): break
    return the_person

the_person = get_person_by_name(search_item)
pprint.pprint (the_person)

When you run the program, the output should look like this:

{'acted_in': [{'movie_id': 'tt0795368',
               'roles': ['Jane'],
               'title': 'Death at a Funeral',
               'year': '2007'}],
 'birthYear': '1976',
 'name': 'Keeley Hawes',
 'person_id': 'nm0369954'}

Sorted list of film titles that have something to do with bears

Here we demonstrate sorting the output, limiting the number of returned records and searching for a string in the title and trivia.

# This will be the common prelude to all of our queries. 
import sys     # talk to the operating system 
import os.path # manipulate paths to files, directories 
import pickle  # read/write pickled python dictionaries 
import pprint  # pretty print JSON

data_dir = sys.argv[1] # first command-line argument -- the directory of data 

# use os.path.join so that path works on both Windows and Unix 
doctorwho_path = os.path.join(data_dir, 'imdb_doctorwho_database_v3.pickled')
# open data dictionary file and un-pickle it
doctorwhoFile = open(doctorwho_path, mode= "rb")
doctorwho     = pickle.load(doctorwhoFile)
#####################################
# write your query code here ...


def get_movie_by_subject(str):
    # initialise output 
    results = []
    for key in doctorwho:
        if ("tt" in key): # Check it's a film
            entry = doctorwho[key]
            title = entry['title']
            selected = False
            if (str in title): selected = True
            trivia = entry['trivia']
            if (trivia != None):
                for triv in trivia['trivia_entries']:
                    if (selected): break;
                    if (str in triv): selected = True
            if (selected): results.append(title)
    return results

# Note: append in Python is nothing like the @ in ML: it is mutable and adds only a single item.

def display_ans(line_limit, ans):
    ans = sorted(ans) # Sort into ascending order
    ans = ans[0:line_limit]  # Limit number of entries
    for item in ans:
        print(f'{item}')
    print(f'Returned {len(ans)} records')

ans = get_movie_by_subject("bear")
display_ans(30, ans)

When you run the program, the output should look like this:

10 Cloverfield Lane
28 Days Later...
28 Weeks Later
3:10 to Yuma
A History of Violence
A Scanner Darkly
A Serious Man
A Tale of Two Sisters
A.I. Artificial Intelligence
About Schmidt
Adaptation.
American Sniper
Anchorman: The Legend of Ron Burgundy
Apocalypto
Avatar
Avengers: Age of Ultron
Batman Begins
Beasts of No Nation
Big Hero 6
Black Panther
Blade Runner 2049
Blue Valentine
Booksmart
Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan
Brave
Brothers
Captain America: The Winter Soldier
Cars
Chocolat
Coraline
Returned 30 records

Of course, much more sophisticated natural language processing will be considered in subsequent courses. Here we've not ignored capital letters in our match (which would be likely in a title), let alone considering whether a pony is a horse!

END. (C) TGG 2021, DJG 2022.