Supervision 2 Solution Notes

Below are model answers to help supervisors. The solutions here are not the only possible solutions, but merely suggestions.

Discuss the practical exercises for graph databases. Do you understand the queries and what they are doing?

Supervisor notes: Model answers for the practical exercises are given on a separate page.
In SQL, the creator of a database needs to define the schema up front: that is, what columns each table is going to have, and the datatype of each column. This means that all rows in the table have the same set of columns (although some of them may be set to NULL if the value is unknown).

On the other hand, Cypher does not have an explicit schema: a node can have any set of properties, and you can always add a property with a new name to an existing node.

Discuss the pros and cons of these two approaches.

Supervisor notes:

A schema constrains the data that can be written to a database — it is like a static type checker that you find in a language like ML or Java. The database only allows you to write data that conforms to the schema. This can mean fewer surprises when the data is read: the schema tells you exactly what columns you are going to get back. The schema also acts as a form of documentation (although you generally still need documentation to tell you how to interpret values).

If you ever need to change the structure of the data (for example, if you want to start recording the names of people or the titles of movies in their original script, such as Chinese or Arabic), you would need to change the schema. Databases allow you to do this: for example, you can issue a command to add a new column to an existing table. In some database management systems, such schema changes are expensive.

In Cypher's dynamic approach, there is no need for explicit schema changes, because anybody could add a new property to a node. This can make it easier to manage, but also increases the risk that someone accidentally writes incorrectly formed data to the database.

Cypher's approach is more flexible if there are lots of different entities and relationships with different properties (which in SQL would require either lots of tables, or columns with lots of NULL values). For example, our credits table in the relational schema jammed nine different types of relationship into a single table, and required lots of nulls. The Cypher representation of these relationships and their properties is arguably more elegant.

In the graph databases tutorial we showed a Cypher query for find the number of co-actors for Jennifer Lawrence. How would you write the same query in SQL?

Supervisor notes:

SELECT count(*)
FROM people p1
JOIN credits c1 ON c1.type = 'actor' AND c1.person_id = p1.id
JOIN credits c2 ON c2.type = 'actor' AND c2.movie_id = c1.movie_id
WHERE p1.name = 'Lawrence, Jennifer (III)' AND c2.person_id <> p1.id;

On the small database this query also returns None, like the Cypher query.

We discussed several examples of Cypher queries that have close equivalents in SQL, and some examples that are hard to express in SQL. Discuss: would it make sense to move existing SQL-based applications to Cypher? In what circumstances would you choose one query language over the other?

Supervisor notes:

SQL has the advantage of being a lot more popular and mature. That is not an inherent advantage in the language design, but makes a big practical difference: there are many database systems, tools, textbooks, courses etc. that use SQL, and many people are familiar with it.

Aside from the popularity aspect, Cypher can express many common SQL queries equally well, and some kinds of query are much more concise and readable in Cypher (especially those involving many joins and path traversals). It is possible to find some queries that can be written in SQL but not in Cypher, but they are fairly obscure. So it seems that Cypher is generally more expressive than SQL.

An advantage of the graph data model is that it easily allows any node to have a relationship to any other. For example, in a social network like Facebook, a user can have a like relationship to almost anything: a photo, a comment, a rock band, a shop, an event, a political cause, etc. In a relational model, allowing such relationships from anything to anything would cause an explosion in the number of tables (if there are n different tables representing entity types, there might be O(n²) different tables representing relationships between them).
In the example graph database, there are 22 nodes of label Genre, but in the relational database, the genres table has 273 rows. How do you explain this discrepancy? What are the consequences of this difference in data model?

Supervisor notes:

This question is closely related to question 2 from supervision 1. The graph database represents each distinct genre (e.g. "Drama", "Action") as a single node, and movies are associated with genres by a HAS_GENRE relationship from the movie node to the genre node. This makes our graph database more like the alternative two-table schema from the last supervision:
```
has_genre(movie_id, genre_id)
genres(genre_id, genre, ...)
```
Note that there are 273 HAS_GENRE relationships in the graph (as we discovered in the Getting Started document for Cypher), which correspond to the rows in the genres table in the current relational database.

In Cypher, making a genre a distinct entity (node) is more natural because the joins are easier to write, and because the graph model does not have a concept of weak entities: every node can stand alone and does not depend on the existence of any other node. Multi-valued attributes exist to a limited extent (Neo4j supports an array datatype for node properties), but they are less convenient if you want to search for all movies of a particular genre.
What do you think the database software is doing internally when you ask it to find the shortest path between two nodes? Describe it in words (no code required).

Supervisor notes:

It performs a breadth-first search on a graph, which the students will learn about in the Algorithms course in Lent term. Intuitively, it works as follows:

First, find the node from which to start the search (for example, the node representing Jennifer Lawrence). Follow all outgoing ACTS_IN relationships to find movies she has acted in, and from there all incoming ACTS_IN relationships to find her co-actors.

If the set of co-actors includes the end node of the search (for example, the node representing Matt Damon), we're finished. If not, repeat the previous step of following ACTS_IN relationships to find all second-degree co-actors. This process keeps repeating until the end node is found.

In the process, the algorithm must keep track of any nodes it has visited before, otherwise it would keep going round in circles and never finish.