University of Cambridge, Computer Laboratory

Databases Ia Learners' Guide

The course official syllabus is on the web page Databases Ia. The syllabus page for all Computer Laboratory courses always contains a definitive summary of what is examinable.

Updates for 2023

The syllabus for 2023 had its order changed from previous years to reflect what was lectured the previous year under the previous lecturer.

Apart from a slight reorganisation of order, the only change in syllabus in 22/23 from the previous year is the (re-)introduction of the concept of normalised data schemas, where the content of a record should be dependent on the key it is stored under. There are various esoteric definitions of this concept, such as BCNF and 3NF and so on, whose names (not precise definitions) I may have given in passing, but their subtle differences are irrelevant and certainly not examinable.

FAQ

Q. With the new lecturer, will there be a change in 2023 and future years to the type of exam questions asked?
A. It is clear that Databases Ia questions in the last four years' (2019-22) have concentrated almost entirely on E/R modelling, the relational model and SQL. Although these topics can be expected to feature strongly in future questions, this limited coverage of the syllabus should never be interpreted as providing an exhaustive indication of scope. As always, the whole syllabus is examinable. 2021-03-01 asked about redundancy: the one change of syllabus for 2023 (mentioned in this document above) means that the effect of normal-form data on redundancy becomes examinable. In general, students should expect a change in 2023 onwards, with other aspects of the syllabus being examined more frequently.

Q. During your databases Ia lectures you briefly mentioned 'inverted indices'. How much depth do we need to know about this topic as I am struggling to find much about what precisely an inverted index is - I am aware of what an index is and how it can speed up record lookup but I am not sure what inverted means in this context.
A. The index to a data store is what is consulted when looking something up. Generally a logarithmic lookup time is achieved which is better than a linear scan through all keys. Given that a schema will define the primary key for a data store, the primary index is the index of those keys. An inverted index provides lookup via a different attribute or set of attributes. For instance, if a store of people has a primary index which is the artificial key 'pid' (a person identifier number), an inverted index might provide rapid lookup of people by name, returning for instance, a list of people called David Greaves. Again, the inverted index will aim to give a performance improvement with respect to a linear scan, but it will likely return more than one answer since only the primary key has the uniqueness property.

Q. How much of the Neo4j/graphical databases is precisely examinable?
The level of detail presented in lectures was fairly low, and this is the level expected for answering questions. Unlike with SQL, hands-on practical experience is not expected. You should know enough to describe the principle differences w.r.t. an rDBMS and answer questions on what might change in terms of redundancy or expressibility if data is stored in a graphical database instead of an rDBMS. The same remarks apply to the document model(s) lectured.

Q. How much of the hardware/new additions to the course is examinable (and where would you say these new additions are)?
A. The descriptions of hardware that I presented contain nothing that is not covered in more detail in the Operating Systems course. Because this is the first course in the academic year, and some students have very little idea of how operating systems and computer hardware work at the time they arrive, I felt the descriptions helped set the context. However, there is nothing examinable about them that you should not also know as part of your general knowledge or from the Operating Systems course. Computer Science is essentially a marriage of Engineering and Maths, being aware of how design decisions are likely to affect performance is examinable --- that is the Engineering aspect.
The ethos of the Databases course is logical organisation of data (not physical data organisation) and the logical concepts should be applicable over all sorts of computing device (optical, quantum, organic wet jelly etc). So although no questions directly relating to hardware or operating systems are likely to appear under the Databases monica at Tripos, complete ignorance of the efficiency aspect is never acceptable.

Q. If you were to provide a set of topics to focus on which would you say are most important?
Study the syllabus page. www.cl.cam.ac.uk/teaching/2223/Databases.

Q. I couldn't find anything on "shredding" online. As I understood it (from my project's use of ElasticSearch) "sharding" is about splitting the whole dataset between nodes, not splitting documents themselves.
A. Yes, sharding is the process of spreading out the document on to multiple servers to increase aggregate read bandwidth. If the same shard is stored on more than one server, which it often is, then this also provides redundancy (in the availibility sense, not logical/schema redundancy). The term 'shredding' is in the text book I think. I'll check on that. Shredding is where the document is split up instead of being stored monolithically. The shredding,at least as I lectured it, must be reversible, which means that the index needs to preserve/record shred order as well as provide lookup.

Q. JOINs confuse some of the students (and some insist on using the ancient implicit join syntax "... FROM a, b, c WHERE ..."). They usually struggle with the differences between an INNER JOIN and an OUTER JOIN, and I am not sure this is in the tutorial, but I am sure it ought to be.
A. The difference between these joins mostly (entirely?) depends on the existence of NULL. In the lecture slides relating to null there is a footnote mentioning the names of the joins. They are told they should seek to understand the motivation for these different forms via private study. One the other hand, knowing the precise names and semantic of each form is beyond the examinable content of the course. Overall, the SQL knowledge required for this course is not fully enumerated anywhere, but it is clarified in the next question.

Q. What precisely constitutes the 'core subset' of SQL that we should be fluent with and what does fluency mean in this context? Previous years' had a fairly heavy focus on SQL and writing SQL in exams but there is a move away from that it seems - is there still a chance of writing SQL code in exams?
A. The SQL subset you should know and be able to use in answers to exam questions is all of the constructs appearing on lecture slides and mentioned in footnotes on lecture slides (unless specifically marked as unexaminable, including recursive SQL in 2023) and anything else that was necessary to complete the practical exercises. This subset consists of all the core constructs of the query part of the language that can easily be expressed using set theory or are related to the relational algebra. There is no change of examinable subset for 2023 compared with previous years. Hence schema definitions and update and maintenance SQL statements are not examinable. When SQL answers are needed, no marks will be lost for small syntax errors (as is the case in all Tripos exams, unless accurate syntax is specifically requested). Please also see the question from a supervisor above regarding inner and outer joins.


END. Updated May 2023.