                           Traffic Analysis Exercise

   George Danezis and Steven J. Murdoch

   This is an exercise relating to the family of statistical disclosure
   attacks, which are used to de-anonymize traces of anonymized
   communications. A list of recommended papers explaining those techniques
   can be found at the end of this document.

   A network trace is provided containing the observations of an eavesdropper
   that records the functioning of a small mix network. The objective of the
   exercise is ultimately to extract as much information as possible about
   the correspondence between sent and received messages, that the mix
   network tries to conceal. A series of questions have to be answered
   relating to this dataset, starting from simple descriptions of the data
   all the way up to de-anonymizing messages from target users using simple
   techniques. A free form exercise concludes this session, where we try to
   improve as much as possible the effectiveness of the de-anonymization
   techniques.

Datasets

   Two datasets are provided. The "Example" dataset contains the full traces
   from communications within an example network, as the attacker and the mix
   would see them. This can be used to test any techniques and evaluate their
   effectiveness against a known ground truth. The "Exercise" dataset
   contains only traces that the adversary would passively collect, and is
   meant to evaluate and compare the techniques in Exercise (9).

   There are two bundles available:

   version 0.4, released 2009-08-02

   Instructor (30 MB)
           Includes full information for both the Example and Exercise
           datasets, and full source code.

   Student (20 MB)
           Includes full information for the Example dataset, but for the
           Exercise dataset only the information which would be visible to an
           attacker is provided. Also, only the script for scoring traffic
           analysis results is included.

  Data formats

   .net files
           describe the friendship relations between users. It is a plain
           text line where each line describes a relationship between two
           users represented as a comma separated tuple. E.g:
           "00e07565,3cbc2291\n"

   .soc files
           describes the type of relationship between users. Users can be
           work friends ('w') or social friends ('f') or both, and have close
           relations ('c'). Each line is a comma separated tupple of users
           followed by "|" separated flags. E.g. "00fe1613,39cc8b09,c|f\n"

   .dot / .png
           contains the structure of the social network in DOT format and a
           corresponding PNG image of the graph. You can compile the DOT file
           by using GraphViz and the command "neato -Tpng -O Example.dot".

   .msg
           Lists all the messages sent between users of the system, before
           anonymization. Each line is a comma separated list representing a
           single message. The formal of each event is:
           "ID1,ID2,CONVID,SENDER,RECEIVER,REPLYSEQ,DAYIN,TIMEIN,DAYOUT,TIMEOUT\n".
           ID1 and ID2 are the IDs of the messages as they enter and leave
           the anonymity network respectively. CONVID and REPLYSEQ link that
           message to a conversation, and represent a serial number of that
           message within the conversation. SENDER and RECEIVER are the
           senders and receivers of the message, and
           DAYIN,TIMEIN,DAYOUT,TIMEOUT are the days and times the messages
           was injected into the anonymity network and delivered to the final
           recipient. E.g.
           "3e0114ac458f7b870268,654170970690e38610c9,2444,07fcd9ab,d79a70af,1,0,0.0551969561121,0,9.17713597036\n"

   .trc
           Lists the traces as an adversary would observe them. Different
           lines represent messages going from users to the MIX and from the
           MIX to users upon flushing. The format of each line is:
           ID,SENDER,RECEIVER,ROUND,DAY,TIME, where all fields have the same
           significance as for .msg files, and ROUND is the round number of
           the MIX (for convenience.) E.g:
           "b7336bda5de37294339e,c6af162f,MIX,0,0,9.17713597036\n"
           "aacc0e819fa35ff57331,MIX,948dca88,0,0,9.17713597036\n"

   .res
           are results files representing the outcome of the traffic
           analysis. Each line links two message IDs, one that entered the
           MIX and one that left the MIX. The program TestResults.py can be
           used to evalute the success of the attack given the results file,
           using the command "python TestResults.py RESULTS.res MSGFILE.msg".
           E.g. "01c128b6eb488cbeab0a,c046b5f659bdc0a43569\n"

  Details

   The message traces are based on real social networks sampled from
   Facebook. Each user has a small number of friends or/and colleagues, with
   whom they have a close or casual relationships. Based on the type and
   strength of relationships they send each other mail, and reply to each
   other over the week. Relationships are fully symmetric.

   All messages are relayed through a simple anonymization system, comprised
   of a single threshold mix. Network communication is assumed to be near
   instantaneous and the only latency introduced in the system is due to
   mixing.

Questions

  Descriptive questions

   0) Have a look inside the "Example" files and understand their structure.

   1) How many messages are sent through the mix in total? How does the rate
   of message transmission vary according to the time and day?

   2) What are the characteristics of the mix network? (Batch size?)

   3) What is the average, minimum and maximum latency of messages relayed
   through the mix?

  Traffic analysis questions

   4) What is the probability each recipient receives a message?

   5) What is the probability each recipient receives a message, during a
   round in which Alice sends a message? For the Example data set, Alice has
   ID 0809ef07, and in the Exercise dataset, Alice has ID e29b9154.

   6) What is the probability Alice sends to each recipient? (Use the results
   from (4) and (5) to perform a simple SDA.)

   7) What is the most likely message corresponding to Alice's sent messages?
   (Produce a .res file and compare it with the example dataset.)

  Free form exercise

   8) How could one improve the performance of the de-anonymization, to get
   more correct correspondences between input and output messages?

   9) Implement a better de-anonymization technique; produce a .res file for
   your technique against the exercise dataset, and send it to the
   instructors along with a description of your techniques and your code. Try
   de-anonymizing messages of other users, e.g. 03295680 and c5d98515 in the
   Example dataset; 38546347 and 0d351e16 in the Exercise dataset.

   10) Can you infer the strength and type of relationship for each pair of
   users?

Recommended Reading

     * George Danezis: Statistical Disclosure Attacks. SEC 2003: 421-426
     * Nick Mathewson, Roger Dingledine: Practical Traffic Analysis:
       Extending and Resisting Statistical Disclosure. Privacy Enhancing
       Technologies 2004: 17-34
     * George Danezis, Andrei Serjantov: Statistical Disclosure or
       Intersection Attacks on Anonymity Systems. Information Hiding 2004:
       293-308
     * George Danezis, Claudia Diaz, Carmela Troncoso: Two-Sided Statistical
       Disclosure Attack. Privacy Enhancing Technologies 2007: 30-44
     * Carmela Troncoso, Benedikt Gierlichs, Bart Preneel, Ingrid
       Verbauwhede: Perfect Matching Disclosure Attacks. Privacy Enhancing
       Technologies 2008: 2-23
     * Claudia Diaz, Carmela Troncoso, Andrei Serjantov: On the Impact of
       Social Network Profiling on Anonymity. Privacy Enhancing Technologies
       2008: 44-62
