This is an exercise relating to the family of statistical disclosure attacks, which are used to de-anonymize traces of anonymized communications. A list of recommended papers explaining those techniques can be found at the end of this document.
A network trace is provided containing the observations of an eavesdropper that records the functioning of a small mix network. The objective of the exercise is ultimately to extract as much information as possible about the correspondence between sent and received messages, that the mix network tries to conceal. A series of questions have to be answered relating to this dataset, starting from simple descriptions of the data all the way up to de-anonymizing messages from target users using simple techniques. A free form exercise concludes this session, where we try to improve as much as possible the effectiveness of the de-anonymization techniques.
Two datasets are provided. The "Example" dataset contains the full traces from communications within an example network, as the attacker and the mix would see them. This can be used to test any techniques and evaluate their effectiveness against a known ground truth. The "Exercise" dataset contains only traces that the adversary would passively collect, and is meant to evaluate and compare the techniques in Exercise (9).
There are two bundles available:
The message traces are based on real social networks sampled from Facebook. Each user has a small number of friends or/and colleagues, with whom they have a close or casual relationships. Based on the type and strength of relationships they send each other mail, and reply to each other over the week. Relationships are fully symmetric.
All messages are relayed through a simple anonymization system, comprised of a single threshold mix. Network communication is assumed to be near instantaneous and the only latency introduced in the system is due to mixing.
0) Have a look inside the "Example" files and understand their structure.
1) How many messages are sent through the mix in total? How does the rate of message transmission vary according to the time and day?
2) What are the characteristics of the mix network? (Batch size?)
3) What is the average, minimum and maximum latency of messages relayed through the mix?
4) What is the probability each recipient receives a message?
5) What is the probability each recipient receives a message, during a round in which Alice sends a message? For the Example data set, Alice has ID 0809ef07, and in the Exercise dataset, Alice has ID e29b9154.
6) What is the probability Alice sends to each recipient? (Use the results from (4) and (5) to perform a simple SDA.)
7) What is the most likely message corresponding to Alice's sent messages? (Produce a .res file and compare it with the example dataset.)
8) How could one improve the performance of the de-anonymization, to get more correct correspondences between input and output messages?
9) Implement a better de-anonymization technique; produce a .res file for your technique against the exercise dataset, and send it to the instructors along with a description of your techniques and your code. Try de-anonymizing messages of other users, e.g. 03295680 and c5d98515 in the Example dataset; 38546347 and 0d351e16 in the Exercise dataset.
10) Can you infer the strength and type of relationship for each pair of users?
Last modified 2009-08-02 16:47:51 +0100
[ Home ]