Databases
NB: Despite having a new lecturer this year, the changes from last year are all superficial: the changes in syllabus and lecture titles just reflect what was taught last year and are not really a change.
Practical Work & Assessed Exercises (ticks)
There are two Exercises for this course with associated Help-and-Tick sessions. Please go to the following web page and then add your own crsid to the end of the URL: Exercise Worksheets. The first Exercise uses SQL and HSQLDB. The second Exercise uses the DoctorWho document database.
For the first Exercise, you are expected to learn some basic SQL from the tutorial below or from elsewhere. For the second Exercise you need to know enough Python to understand the constructs used in the second tutorial.
The first Help-and-Tick session will be on the afternoon of 16th November and the second is one week later. Details are on the
Exercise Worksheets link.
[The practical classes mentioned in one of the on-line course diaries (and elsewhere) for week 3 and 4 do not exist and are replaced with the two Help-and-Tick sessions.]
Assessed Exercise submission deadlines are given in the Head Of Department's Notice www.cst.cam.ac.uk/teaching/exams/hod-notice/part-ia. It is important to have given in a substantially correct effort by the formal deadline, even if there are some small bits missing. You need not to have had the work marked and ticked off by the submission deadline.
The second Exercise worksheet is available through the same mechanism as the first.
Please be aware that work submitted for Assessed Exercises must be entirely your own work and that publishing or otherwise sharing answers to exercises risks you being penalised for "academic misconduct" if someone plagiarises your work.
Primary Lecture Materials
Note: as well as reading all the material on the Syllabus, as lectured, you are expected to learn a basic subset of SQL in private study. Please see the relational tutorial below. SQL today is a very big language with the original, tiny, core language still alive inside it. You are expected to become fluent with that core subset, not extending beyond anything lectured.
- Lectures 1 to 4: slides PDF (84 pages).
- Lectures 5 to 8: slides PDF (88 pages).
Erratum: A slide in Lecture 8 (fixed now in rev C) previously said that deconstructing the parts of a field into sub-fields violated entity integrity. But that was the wrong term, sorry. I meant to say that splitting a field into sub-fields violated value integrity and that generally a better schema is preferable where the sub-fields are properly documented by being normal fields. It's also a somewhat academic point since regular expression searches of field contents and using other predicates on fields (eg SQL's in test) are very commonly used: a predicate applied to a field is not very different from extracting a sub-field as data.
- Learners' Guide and FAQ: Learners Guide
- Supervision Suggestions:
- Supervision 1: Kick-off Supervision Materials LINK. Main sheet Supervision Sheet 1.
- Supervision 2A: Supervision Sheet 2A.
- Supervision 2B: Supervision Sheet 2B.
- In Supervision 2 or 3, do also discuss your written answers to the Doctor Who Assessed Exercise.
- Supervision 3: Supervision Sheet 3.
HSQLDB Practical Materials
We shall use an rDBMS called HSQLDB for some of the supervision exercises and Assessed Exercises. Note there are plenty of other SQL implementations online, including interactive forms-driven study aids. Feel free to use these for private experiments while exploring the language.
For data, we shall use a subset of the IMBD website data contained in the following snapshot: https://www.cl.cam.ac.uk/teaching/2122/Databases/movies-relational.zip.
Helpful JDK8/Linux Setup Note.
Helpful Windows setup note: please see the start of the Relational Tutorial.
Relational Tutorial LINK (html).
Document/Aggregate Database Materials
Document/aggregate database tutorial LINK (html).
Document/aggregate database data file imdb_doctorwho_database_v3.zip.
Example query template code (for the second Assessed Exercise) template.zip.
Secondary Materials
Peter Chen's E/R Modelling Paper (PDF).
Are Nulls Evil - A Discussion PDF.
A few extra slides that illustrate outer joins and also generate the "Different key nestings of (semi-)structured data" slide in Lecture 6 LINK (pdf).
Web article illustrating equi, inner and outer joins Natural vs Equijoin. See Lemahieu 7.3.1.5 too.
Example Xpath code, embedded in Python, for extracting trivia fields from IMDB scapings xpath-example.zip.
Have a quick browse through the Neo4j Cypher language reference manual
Please also browse last year's secondary material.
Finally, as a hardware person, I do like to refer to hardware and efficiency quite a lot. For those of you with no computer hardware knowledge, I recommend all of Video 1 and the middle third of Video 4 on this LINK. This mirrors content on the pre-arrival course.
- See the Neo4j tutorial from a previous year: Graph Tutorial 2020.
A book on graph algorithms using Neo4j: Graph Algorithms (full text, PDF) note: most of the content relates to the Ia Algorithms course, rather than this course.
A few "NoSQL" pointers
- NoSQL Movement : http://en.wikipedia.org/wiki/NoSQL_(concept)
- A list of NoSQL database systems : http://nosql-database.org/
- Berkeley DB : http://en.wikipedia.org/wiki/Berkeley_DB
- Graph Databases : http://en.wikipedia.org/wiki/Graph_database
- Dremel: Interactive Analysis of Web-Scale Datasets
- F1: A Distributed SQL Database That Scales
Further reading
- Tarkski's 1941 paper "On the Calculus of Relations".
- A short biography of Alfred Tarski http://en.wikipedia.org/wiki/Alfred_Tarski.
- Codd's original 1970 paper describing the relational model (reprinted here in 1983).
- A short biography of Edgar Codd http://en.wikipedia.org/wiki/Edgar_F._Codd.
- Chen's original 1976 paper on Enitity-Relationship models. SEE ABOVE LINK.
- A short biography of Peter Chen http://en.wikipedia.org/wiki/Peter_Chen.
- Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
END