Supervision 1 Solution Notes

Below are model answers to help supervisors. The solutions here are not the only possible solutions, but merely suggestions.

Discuss the practical exercises for relational databases. Do you understand the queries and what they are doing?

Supervisor notes: Model answers for the practical exercises are given on a separate page.
Currently our relational database uses a single table:
```
genres(movie_id, genre)
```
to record the genres associated with a movie. We could instead implement this with two tables as:
```
has_genre(movie_id, genre_id)
genres(genre_id, genre, ...)
```
where the first table represents the relationship and the second table contains genres. Similar changes could be made to the tables languages, certificates, locations, etc. Discuss the pros and cons of the two approaches.

Supervisor notes: The single table approach has the advantage of only requiring one join to retrieve genres, whereas two tables would require two joins, which is slower. However, with one table there is no guarantee that a given genre is represented consistently (for example, Horror and Horror Film might both be used). In addition, there may be additional information associated with a genre, such as a description, and such information would have to be duplicated in the first schema. Both of these problems are solved in the second schema. However, the second approach requires that we generate unique keys for each genre entry (most database systems provide a means of doing this easily).
Our relational database has another example of representing multiple notions in a single table — the credits table. Discuss the pros and cons of at least one alternative schema. Is this just another instance of the issue discussed in question 1, or is it conceptually different?

Supervisor notes: The credits table is actually implementing nine many-to-many relationships between movies and people and using the type column to indicate which relationship. The main problem here is that the different relationships are associated with different attributes, thus requiring the use of null values where the attributes do not apply. For example, the character attribute is only used for actors, and is null for directors, producers, composers, etc.

An advantage of using a single credits table rather than nine different tables is that it is easy to list all the people involved on a single movie (regardless of whether they were actor, director, etc.) — an operation that would otherwise require the union of nine different joins. Likewise, it is easy to list all the movies that a particular person has been involved with.
The current database encodes a person's name in the form surname, firstname (number), for example Hamill, Mark (I). This form allows convenient sorting by surname. How would you change the schema to allow names to be displayed in the form firstname surname, e.g. Mark Hamill, while still allowing sorting by surname? What issues may arise if you make such a change?

Supervisor notes: Currently there is a single attribute name, which could be split into three attributes firstname, surname and disambiguator (the roman numeral has the purpose of distinguishing several people with the same name).

Such a split would be quite easy for people with standard English names, but it quickly becomes difficult with names from other cultures. In English, the given name (first name) usually comes first, but in some cultures, the family name is always put before the given name. In some cultures, people only have a given name but not a family name; in others, people have multiple family names. Surname prefixes, as in Vincent van Gogh and Robert De Niro, confuse the system. For a name like Pope Francis, splitting into firstname and surname doesn't really make sense. And so on.

One option might be to have a sorting name attribute that is separate from the display name, but without trying to split the name into firstname and surname. However, in that case, there remains the question of how to deal with names that are normally written in a non-Latin script, such as Chinese or Arabic. Should they all be transcribed into the Latin alphabet? (This is what IMDb does.) Should the name in its original script be preserved in its non-transcribed form in a separate attribute? How are letters with accents or diacritics sorted? Different languages have different sorting rules: for example, in German the letter ä (letter a with umlaut) is sorted as ae, whereas in Swedish the same letter is sorted after z in the alphabet.

Basically, names are complicated. The W3C has some fascinating guidance on names around the world, and the Unicode Consortium has a standard for sorting strings. In general, there is probably no one solution that works well globally. Most likely, the simple approach of just a single name attribute, without further structure, is still the best.

By the way, this question is not frivolous. It demonstrates that in order to create a good data model in any database, you need to understand the domain in depth. And even the domain of a person's name — which on the surface just seems to be a string — turns out to be surprisingly complex.