Traffic Analysis Exercise

George Danezis and Steven J. Murdoch

This is an exercise relating to the family of statistical disclosure attacks, which are used to de-anonymize traces of anonymized communications. A list of recommended papers explaining those techniques can be found at the end of this document.

A network trace is provided containing the observations of an eavesdropper that records the functioning of a small mix network. The objective of the exercise is ultimately to extract as much information as possible about the correspondence between sent and received messages, that the mix network tries to conceal. A series of questions have to be answered relating to this dataset, starting from simple descriptions of the data all the way up to de-anonymizing messages from target users using simple techniques. A free form exercise concludes this session, where we try to improve as much as possible the effectiveness of the de-anonymization techniques.

Datasets

Two datasets are provided. The "Example" dataset contains the full traces from communications within an example network, as the attacker and the mix would see them. This can be used to test any techniques and evaluate their effectiveness against a known ground truth. The "Exercise" dataset contains only traces that the adversary would passively collect, and is meant to evaluate and compare the techniques in Exercise (9).

There are two bundles available:

version 0.4, released 2009-08-02

Instructor (30 MB): Includes full information for both the Example and Exercise datasets, and full source code.
Student (20 MB): Includes full information for the Example dataset, but for the Exercise dataset only the information which would be visible to an attacker is provided. Also, only the script for scoring traffic analysis results is included.

Data formats

.net files: describe the friendship relations between users. It is a plain text line where each line describes a relationship between two users represented as a comma separated tuple. E.g: "00e07565,3cbc2291\n"
.soc files: describes the type of relationship between users. Users can be work friends ('w') or social friends ('f') or both, and have close relations ('c'). Each line is a comma separated tupple of users followed by "|" separated flags. E.g. "00fe1613,39cc8b09,c|f\n"
.dot / .png: contains the structure of the social network in DOT format and a corresponding PNG image of the graph. You can compile the DOT file by using GraphViz and the command "neato -Tpng -O Example.dot".
.msg: Lists all the messages sent between users of the system, before anonymization. Each line is a comma separated list representing a single message. The formal of each event is: "ID1,ID2,CONVID,SENDER,RECEIVER,REPLYSEQ,DAYIN,TIMEIN,DAYOUT,TIMEOUT\n". ID1 and ID2 are the IDs of the messages as they enter and leave the anonymity network respectively. CONVID and REPLYSEQ link that message to a conversation, and represent a serial number of that message within the conversation. SENDER and RECEIVER are the senders and receivers of the message, and DAYIN,TIMEIN,DAYOUT,TIMEOUT are the days and times the messages was injected into the anonymity network and delivered to the final recipient. E.g. "3e0114ac458f7b870268,654170970690e38610c9,2444,07fcd9ab,d79a70af,1,0,0.0551969561121,0,9.17713597036\n"
.trc: Lists the traces as an adversary would observe them. Different lines represent messages going from users to the MIX and from the MIX to users upon flushing. The format of each line is: ID,SENDER,RECEIVER,ROUND,DAY,TIME, where all fields have the same significance as for .msg files, and ROUND is the round number of the MIX (for convenience.) E.g: "b7336bda5de37294339e,c6af162f,MIX,0,0,9.17713597036\n" "aacc0e819fa35ff57331,MIX,948dca88,0,0,9.17713597036\n"
.res: are results files representing the outcome of the traffic analysis. Each line links two message IDs, one that entered the MIX and one that left the MIX. The program TestResults.py can be used to evalute the success of the attack given the results file, using the command "python TestResults.py RESULTS.res MSGFILE.msg". E.g. "01c128b6eb488cbeab0a,c046b5f659bdc0a43569\n"

Details

The message traces are based on real social networks sampled from Facebook. Each user has a small number of friends or/and colleagues, with whom they have a close or casual relationships. Based on the type and strength of relationships they send each other mail, and reply to each other over the week. Relationships are fully symmetric.

All messages are relayed through a simple anonymization system, comprised of a single threshold mix. Network communication is assumed to be near instantaneous and the only latency introduced in the system is due to mixing.

Questions

Descriptive questions

0) Have a look inside the "Example" files and understand their structure.

1) How many messages are sent through the mix in total? How does the rate of message transmission vary according to the time and day?

2) What are the characteristics of the mix network? (Batch size?)

3) What is the average, minimum and maximum latency of messages relayed through the mix?

Traffic analysis questions

4) What is the probability each recipient receives a message?

5) What is the probability each recipient receives a message, during a round in which Alice sends a message? For the Example data set, Alice has ID 0809ef07, and in the Exercise dataset, Alice has ID e29b9154.

6) What is the probability Alice sends to each recipient? (Use the results from (4) and (5) to perform a simple SDA.)

7) What is the most likely message corresponding to Alice's sent messages? (Produce a .res file and compare it with the example dataset.)

Free form exercise

8) How could one improve the performance of the de-anonymization, to get more correct correspondences between input and output messages?

9) Implement a better de-anonymization technique; produce a .res file for your technique against the exercise dataset, and send it to the instructors along with a description of your techniques and your code. Try de-anonymizing messages of other users, e.g. 03295680 and c5d98515 in the Example dataset; 38546347 and 0d351e16 in the Exercise dataset.

10) Can you infer the strength and type of relationship for each pair of users?