NEW: I have added answers to questions asked by other students.
See pp.15–24 and 35 of the slides for background. This page will be updated during the course of the exercise, so please check back periodically.
In this exercise, you will be provided with data collected from a simulated threshold mix, and you are asked to write a program to analyze it. The final part of the exercise is to perform a statistical disclosure attack, to de-anonymize user(s) of the system. Example code will be provided in C, but the solution may be written in any language. You are encouraged to work in teams.
For a bit of fun, I'll also be offering a prize for the best team name. You can set yours by changing answers/teamname.txt. Names are limited to 30 characters and HTML is now removed; thanks to team <code>1nj3kt0r_lulz</code> for drawing this to my attention :-)
The listing of current teams' scores is updated every few minutes.
Please register your team by sending an email to Steven.Murdoch at cl.cam.ac.uk, listing the CRSIDs of your team members, and the contents of answers/token.txt.
An IRC channel is available for students to discuss the exercise.
The server is irc.srcf.ucam.org and the channel name is
#taex (link).
What kind of success rate should we be aiming for? I'm guessing around 50% is an upper bound.
If you implement the code as I intended, you should get about 37.4% for Alice. There are some arbitrary choices to be made, so you might do a bit better or worse for this particular data set.
If you implement that attack for all users, you should get about 29% overall. The total score has precedence over the attempted score, so this will get a higher position in the table.
If you implement some more sophisticated techniques, such as the two-sided statistical disclosure attack (which takes advantage of the fact that users are likely to reply to messages), or perfect matching (taking advantage of the fact that messages are not duplicated) you should be able to do a bit better. 50% seems quite optimistic though.
It helps to bear in mind what the data is. All you are getting are timings of encrypted messages with the to/from address stripped out. The batch size is fairly large, and the threshold mix was considered a fairly secure design when the field of anonymous communications was young. The data is certainly more much heavily anonymized than is commonly done for commercial and government releases of personal data. Given this, it is quite surprising that it is possible to do as well as 10%, let alone 37%.
One of the goals of the exercise is to illustrate how hard it is to anonymize data. Given large quantities of very poor quality data, it is often possible to recover a surprisingly large amount of the original information.
Please send questions to Steven.Murdoch at cl.cam.ac.uk
Last modified 2011-01-03 19:55:21 +0000
[ Home ]