Ground Truth Verification System (GTVS)

Design, implementation and installation documentation

Contents

Introduction
Installation
1. Requirements
  1. Ubuntu packages
  2. CentOS packages
2. Instructions
User Manual

Introduction

We describe the GTVS with a general flow-chart as following (the number in brackets means see also the chapter number):

                        +--------------------------+
                        |       Trace files        |
                        +--------------------------+
                                     |  Flow aggregation and 
                                     |  L7 pattern matching 
                                     V  with Click(2)
                        +--------------------------+

                        |   Flow records with L7   |<..... results from other mechanisms 
                        |   marks in MySQL DB(4)   |       e.g. QosMOS, Snort
                        +--------------------------+
                                     |
                                     |  Manual verification 
                                     |  with MVI(3) <..... Other source of information 
                                     V                     e.g. IANA port list, DNS resolving,
                        +--------------------------+            inspecting packet payload

                        |Hand-verified flow records|
                        +--------------------------+

There are 3 major components:

TNT Click - enables packets aggregation into flows and pattern matching with regular expressions. Here we took the L7 signature set, modified it by removing the protocols that have high false-positive rates (i.e. significantly overmatching compared with its popularity). This module exports the flow records into a file or MySQL database.
MySQL database - stores the flow records exported from click, and keeps them (virtually) forever. Any work with the flows thereafter is done in the database.
Manual Verification Interface (MVI) - enables the user to conveniently identify the applications in the traffic. Actually, as a frontend, it automates many procedures in hand-classification process. For each trace, the user can look at different views aggregated by the server IP, (server IP, server port), and (client IP, server IP, server port) tuples or at each individual flow and its payload content. The frontend can provide a comprehensive set of information of a flow: the L7 Mark, results from other mechanisms (QOSMOS, Snort - still work in progress), host names, default service on the port number, and packet payload. Based on all this information, he can then verify the application the flow(s) belongs to. This information is exactly what he otherwise would maximally expect to get if without this frontend.

Apart from the 3 major components, the system includes a number of scripts to connect between these components.

Flow Aggregation and Signature Matching (FLAGSIM) Module

The FLAGSIM module is implemented on Click modular router infrastructure. It contains a port of the userspace version of L7-filter.

First of all, the FLAGSIM module aggregates the packets from pcap/erf format trace files, or live traffic, into full-duplex flows. A flow is defined as a series of packets with the identical IP five-tuple {server ip, server port, client ip, client port, ip proto}, and finished either when FIN packet is seen or no more packet is seen before a timeout (our defaults are 180s for TCP, and 60s for UDP). Then, it carries out signature-matching on the payload of the flows, and sets the L7 mark for each flow it recognises. Finally, the flow records are stored into the database.

Manual Verification Interface (MVI)

The MVI basically follows the way which a user will possibly take for manual verifying the traffic flows, and dedicates to automate the information gathering, decision making and data manipulation process for the user. The MVI directly communicates with MySQL database. It queries the database for the information related to the flows or IPs, and updates the verification tags (protocol, service) in the database. The functionality of current MVI include:

Allows the user to look at the flows in various scales, including:
- traffic flows aggregated by the server IP,
- traffic flows aggregated by (server IP, server Port) pair,
- traffic flows aggregated by (client IP, server IP, server Port) triplet,
- individual flows,
- the packets' payloads in a flow
Display detailed information about the traffic, including:
- the name server record of the destination IP;
- default service on the destination port number;
- port number and found signatures breakdown for each IP;
- pattern matching results using modified L7 signatures (minimised false positives);
- ability to obtain results from other mechanisms (e.g. QosMOS, Snort) - we are still working on it;
- packet payload (e.g. tcpdump -X, tcpdump -A)
The user can verify traffic at different levels. He can verify the traffic of a server IP at a time, or the traffic on a {server IP, server Port} pair, or the traffic on a (client IP, server IP, server Port) triplet, or of each flow. The user can establish two different kind of ground truth:
- protocol (the protocol the flow uses)
- service (the actual service the flow is for)
Automatically log the verification history: who verified which {IP,port,flows} as what.

The interface is fairly convenient to use: at the first instance when looking at a trace, the user lists out all the destination IPs in the trace. Some of the IPs are simpler where it can be of a known server running a known service, also the port and signatures breakdown may show that the traffic is relatively straightforward. Thus he may want to verify the traffic belonging to the server without looking into each flow. Otherwise, if the traffic has some tricky behaviour or the host has many different port numbers and signatures found associated with it, he may want to look into further detail to decide individually what the flows are.

Previously we carefully studied a few real traces and discovered a fact that the {server IP, server Port} pair has very good consistency with the protocol and the service. That is the principle reason why we take this approach of looking from per-IP, per-{IP,Port} tuple till per-flow to verify the ground truth. And this fact is because (1) a port number can't be used for two applications at one time and (2) in most cases an pplication will not accept two different base protocols or two different kind of services, especially on one port number. There are exceptions though, like socks proxy, VPN or ssh tunneling, however these can be recognised too, and once we find them we can tag them as a proxy, VPN or tunneling.

The user makes the decision of which type of application and which type of protocol the flow(s) belongs to. If the application does not exist in the current database, MVI will ask the user to confirm again. Also, MVI will provide the user, by some heuristics, what it thinks the flow(s) is, to save more time and labour for the user.

There can be three status for a flow: unverified, verified and questioned. Once the user has verified a flow, the status of the flow is changed from "unverified" to "verified". He can also set the status as "questioned" for those flows he has special interests in. The user can choose to view all/unverified/verified/questioned flows, in all flow-listing views.

Database Design and Table Structure

All the tables are kept in one database called GTVS. The following shows the list of tables when there is only one trace "abcd1" in the database. The database schema will be introduced below, using this example database.

+---------------------------+
| Tables_in_GTVS            |
+---------------------------+
| GtApps                    |
| GtProtos                  |
| Traces                    |
| VerificationLog           |
| @VerificationLogV         |
| abcd1_Flows               |
| abcd1_FlowsAggByDstIp     |
| abcd1_FlowsAggByDstIpPort |
| abcd1_HostNames           |
+---------------------------+

Traces

This table keeps the necessary information of each available trace. It is composed of:

Name - the trace name (e.g., over_11108_00)
Description - a short description
StartDate - the start date
EndDate - the end date
DataPath - the full path to the directory that contains the trace files

The DataPath field points to the physical locations of the original trace files, for the MVI to look for when the user tries to look into the packet payload.

GtApps

This table maintains the enumeration of types of applications. The list can be expanded by the user.

This is the initial list:

botnet
bulk
chat
database - Mysql, Mssql, Oracle, postgres
email - pop2, pop3, imap3, imap4, smtp
filesharing
ftp
gaming - Doom, Blizzard Battle Net, WoW, WarCraft, Diablo II, UO etc. except MS gaming zone
grid
hidden
im - MSN, AIM, yahoo messenger, ICQ, QQ, Apple iChat
irc
malicious
malware - Spyware, viruses and worms, scan?
ms gaming zone - directplay
news
ntp
os services
p2p filesharing - BitTorrent, Edonkey, Emule, Gnutella, Kazaa, SoulSeek, WinMX, Winny
p2p streaming
proxy - http, socks
remote access - ssh, telnet
remote control - vnc, remote desktop, PCAnywhere
services
streaming
svn
sw updates - Software updates: windows, adobe, etc, etc
tor
tunneling - Tunneling X11 forwarding
unknown
videoconf - MS NetMeeting
voip - Skype, SkypeOut, other
vpn
web audio - mediaplayer,realmedia
web mail - Gmail, Hotmail, yahoo, institutional webmail..
web proxy
web services - Other than webbrowsing, e.g. RSS reader, desktop widgets
web video - mediaplayer,realmedia,QT?
web - Web browsing
workspace - MS Exchange, Lotus Workspace

GtProtos

This table maintains the enumeration of types of protocols. The list can be expanded by the user.

The initial list derives from the L7 signature names:

aim
aimwebcontent
aleph
applejuice
applet
arcp
ares
av
bittorrent
custom
cvsup
dcc
dhcp
directconnect
dns
edonkey
fasttrack
flash
flash streaming
ftp
globus
gnutella
groove
hamachi
http
httpaudio
http-itunes
http-rtsp
https
httpvideo
ident
imap
imaps
imesh
irc
isakmp
jabber
joltid
ldap
linkproof
msn-filetransfe
msnmessenger
ms-rpc
mssql
ms-streaming
mysql
napster
nav
nntp
ntp
openbase
openft
pando
pop3
pop3s
pyzor
quicktime
razor2
rsync
rtmp
rtp
rtsp
scan
secondlife
sip
skype
smtp
snmp
ssh
ssl
ssmtp
stun
telnet
telnets
theprayer
traceroute
unknown
vnc
yahoo

VerificationLog

This table contains a complete log of all the verification activities (who and when verifies what).

It comprises the following fields:

Ts - when (the timestamp of an action)
User - who
Trace - trace name
SrcIp, SrcPort, DstIp, DstPort, IpProto - what
Event - one of verifies or questions
GtProto - the verified protocol
GtApp - the verified application

The view VerificationLogV should be used instead of VerificationLog as the view provides human readable IP addresses.

{tracename}_Flows

The major table for each trace. It keeps the flow table, basically include information like the flow tuple, per-flow characteristics, intermediate results (e.g. signature matching result), the protocol and service (ground truth) verified by the user, and the verification status (verifies, unverified, questioned).

{tracename}_FlowsAggByDstIp

Shadow table for {tracename}_Flows to accelerate the retrieval of the information about the flows aggregated by Server IP. For performance tuning only.

{tracename}_FlowsAggByDstIpPort

Shadow table for {tracename}_Flows to accelerate the retrieval of the information about the flows aggregated by {Server IP and Server Port}. For performance tuning only.

{tracename}_HostNames

Keeps the name server records for IPs appeared in the trace {tracename}.

Ideally it should be collected at the same time when the trace is captured.

So we put a separate HostNames table for each trace here.

Installation

Requirements

In order to run GTVS you will need to set up your machine to fulfill these requirements:

Apache HTTP server of lighttpd
PHP 5
Ruby 1.8
MySQL 5
Graphviz
libpcre
libdb
java
tcpdump

In addition, your machine needs to satisfy Click's requirements too which include:

GNU C++ compiler
flex, bison
libpcap
git (for using the latest click sources)

You might also want to install phpmyadmin to manage the database directly.

Ubuntu packages

If you have Ubuntu use this command to install the required packages:

apt-get install apache2 php5 php5-mysql php-pear ruby graphviz mysql-client mysql-server flex bison libpcap0.8-dev libdb4.3++-dev libpcre3-dev libmysqlclient15-dev sun-java5-jre tcpdump

CentOS packages

If you have CentOS use this command to install the required packages:

yum install httpd php php-mysql php-mbstring php-pear ruby mysql mysql-server libpcap-devel mysql-devel flex bison pcre-devel db4-devel tcpdump

To install graphviz, download the graphviz-rhel.repo file and save it (as root) in /etc/yum.repos.d/

Then run

yum install graphviz

Instructions

mon2, Click, tnt

First proceed to build the tnt Click package and the mon2 tools.

The quick way to have a working setup of mon2, Click and tnt simply requires you to run the following:

cd gtvs-release
./bootstrap.sh

The following refers to the slow way. If you have done the above skip this part.

cd gtvs-release

If you already have Click or want to choose the version and build type, please modify the file named click.sh Change CLICK_VER as you need. CLICK_VER=git will use the latest Click source code pulled from the git repository.

Building Click: for the debug build run

./click.sh debug

Otherwise, for the release build (will run faster) run

./click.sh release

Building mon2:

cd mon2/tools
make

GTVS

Assuming your LAMP stack is running, follow these instructions to complete the GTVS installation.

Find out what are the user and group names used by the Apache HTTP server: look into the main config file for the User and Group directives. Ubuntu calls them www-data while CentOS has them as apache.

export WWW_USER=www-data
export WWW_GROUP=www-data

Note that some of these commands require super-user privileges.

On Ubuntu:

cd gtvs-release/gtvs
BASEDIR=pwd
chown $WWW_USER.$WWW_GROUP app/vendors/gtapphint.php
chown -R $WWW_USER.$WWW_GROUP app/tmp
cp htaccess.sample .htaccess
# create a link in apache htdocs to gtvs root directory
ln -s $BASEDIR /var/www
cp apache.conf /etc/apache2/conf.d/gtvs
# edit /etc/apache2/conf.d/gtvs to make the directory points to "/var/www/gtvs/"
/etc/init.d/apache2 restart
# configure the database
/etc/init.d/mysql start
# create the GTVS database
mysql -u root -p < gtvs-empty.sql
# edit app/config/database.php and insert the MySQL user credentials
# install cron.d/gtvs in the system's cron jobs directory (/etc/cron.d)

On CentOS:

cd gtvs-release/gtvs
BASEDIR=pwd
chown $WWW_USER.$WWW_GROUP app/vendors/gtapphint.php
chown -R $WWW_USER.$WWW_GROUP app/tmp
cp htaccess.sample .htaccess
# create a link in apache htdocs to gtvs root directory
ln -s $BASEDIR /var/www/html
cp apache.conf /etc/httpd/conf.d/gtvs.conf
# edit /etc/httpd/conf.d/gtvs.conf to make the directory points to "/var/www/html/gtvs/"
setsebool -P httpd_disable_trans 1 # make sure SELinux is not bothering
/etc/init.d/httpd restart
# configure the database
service mysqld start
# create the GTVS database
mysql -u root -p < gtvs-empty.sql
# edit app/config/database.php and insert the MySQL user credentials
# install cron.d/gtvs in the system's cron jobs directory (/etc/cron.d)

The installation is complete! You can now access GTVS by pointing your browser to the installation machine's web server at the path /gtvs.

Enabling Authentication

The access restriction configuration file is called .htaccess and is located in the GTVS root directory.

To enabled basic authentication you need to create a password database (unless you already have it) with a new user:

htpasswd -c .htpasswd <username>

Once you have set up the password database, you can edit the file named .htaccess and uncomment the authentication section so that it looks like this:

AuthType Basic
AuthName "GTVS"
AuthUserFile /path/to/gtvs-release/gtvs/.htpasswd
Satisfy All
#Require user <user1> <user2> ...

Require user gtvs-cron

Use the Require directive to specify the list of user names that are allowed to use GTVS. At this point you need to create a user named "gtvs-cron" which will be used by the cron script. Alternatively, you can disable the authentication for requests coming from 127.0.0.1 by adding this line to .htaccess:

SSLRequire %{REMOTE_ADDR} != "127.0.0.1"

Notes

The file .htacces must contain these lines (don't delete them):

<IfModule mod_rewrite.c>
   RewriteEngine on
   RewriteRule    ^$ app/webroot/    [L]

   RewriteRule    (.*) app/webroot/$1 [L]
</IfModule>

Troubleshooting

In CentOS, if you see this message "warning: group apache does not exist - using root", try with

yum remove httpd
groupadd -r apache
yum install httpd

id apache

doesn't show gid=XX(apache) try with

groupdel apache
groupadd -o -g XX -r apache

User Manual

Trace processing

The Click configuration is specified in gtvs.click. This file includes the declaration of the packet processing pipeline made of Click elements and contains the elements' arguments (e.g., the number of packets and bytes in each flow that are considered for pattern matching). It also includes the list of signatures that are used for pattern matching.

In the file, AppMarks specifies the possible application marks and the associations between each L7 signature and an application class.

The FileExport element is responsible for exporting the flow records to an output file.

The SQLExport element (enabled when @SQLEXPORT@ is defined) is responsible for exporting the flow records to a MySQL database accessed with the information provided by these variables: @DB_NAME@, @DB_USER@, @DB_PASS@, @DB_HOST@, and @DB_TABLE@.

A number of variables are used to control FlowCache which is responsible for aggregating packets into flows. Firstly, @REAP@ sets after how many seconds (of packet time) the garbage collection process is triggered. The process is then repeated periodically. Two variables, namely @TCP_TIMEOUT@ and @UDP_TIMEOUT@, affect the timeout of TCP and UDP flows respectively. A number of variables can be set to tweak the resources allocated for TCP flows: @RESOURCE_TCP_MAX_FLOWS@ (maximum number of flows), @RESOURCE_TCP_MAX_BUCKETS@ (size of maximum hashmap bucket array), @RESOURCE_TCP_MAX_BUCKET_LENGTH@ (maximum length of a bucket chain, if this limit is reache the hashmap is rebuilt), @RESOURCE_TCP_INITIAL_BUCKETS@ (size of initial hashmap bucket array). Analogous variables are available for UDP flows.

FlowsStats collects the number of payload bytes in both directions and TCP maximum segment size (MSS).

For collecting packet inter-arrival times and packet payload sizes, two elements are used: InterArrivalStats, and PacketSizeStats. The argument SERIES_LENGTH (in this case 15) defines how many samples will be collected.

The signature matching is done by L7. This elements needs to be configured with the list of active appterns and the maximum number of packets and bytes to be used for the signature matching. The variables @MAX_PKTS@, and @MAX_BYTES@ controls these last two arguments. Finally, @PATTERNS_DIR@ specifies where the pattern definitions are stored on disk. Finding a set of proper string signatures to use is the key to obtain accurate and meaningful results and to maximally save the human labour for ground truth classification. Currently we are using a trimmed L7 signature set - we removed and strengthened many of the signatures. This is because many of the default L7 signatures are severely overmatching compared to the popularity of the applications they correspond to (otherwise there will be much more false positives than true positives).

A convenient wrapper, run-click, can be used to process a given trace.

If SQLEXPORT is defined, Click will automatically export the flow records into the database. However, for large traces it is faster to save the output inside a file and load it into the database with a single instruction as shown later.

Please note that the tablename for a trace Site is expected to be named in the database as <Site>_Flows (which we refer to as the Flows table).

However, it is possible to obtain a flat file (flowsinfo) where each flow record is exported into a single line of space separated fields with an initial heading row.

After the Flows table has been populated, the script tools/create-agg should be used to generate the aggregate tables FlowsAggByDstIp and FlowsAggByDstIpPort.

create-agg <Site> | mysql GTVS

Further, the HostNames table must be created to map each IP address into one or more corresponding host names. You can use tools/create-hn for this task.

create-hn <Site> | mysql GTVS

All this trace processing steps can be conveniently coded into a script. Here is an example script that you can use as a standard template. The script assumes the dataset is located on disk at /traces/abcd/1.

#!/bin/sh

PATH=/path/to/mon2/bin:/path/to/gtvs/tools:$PATH

export TRACES_REPOS=/traces
DATASET=abcd/1

export RESULTS=/results

DB_NAME=GTVS
DB_USER=root
DB_PASS=<MySQL password>
DB_HOST=localhost
DB_TRACE=abcd1
DB_TABLE=${DB_TRACE}_Flows
export DB_NAME DB_USER DB_PASS DB_HOST DB_TABLE


REAP=60
TCP_TIMEOUT=600
TCP_DONE_TIMEOUT=15
UDP_TIMEOUT=600
export REAP TCP_TIMEOUT TCP_DONE_TIMEOUT UDP_TIMEOUT

PATTERNS_DIR=/path/to/mon2/etc/l7-protocols
MAX_PKTS=10
MAX_BYTES=4096
export PATTERNS_DIR MAX_PKTS MAX_BYTES

MYSQL_FLAGS="-h $DB_HOST -u $DB_USER -p$DB_PASS $DB_NAME"

TAG=mytag
run-click $DATASET gtvs $TAG

mysql $MYSQL_FLAGS < $RESULTS/$DATASET/click/gtvs/$TAG/flowsinfo.sql


create-agg $DB_TRACE | mysql $MYSQL_FLAGS
create-hn $DB_TRACE | mysql $MYSQL_FLAGS
add-trace $DB_TRACE $TRACE_REPOS/$DATASET | mysql $MYSQL_FLAGS

Note that the Flows table now contains all flows in the trace. Initially, it might be convenient to only consider complete TCP traffic. After this, you can go on to verify the trace from the MVI.

Web interface (MVI)

The web interfaces is the main way of interaction with GTVS. Through this interface the user can review the network data on a per flow minimum granularity and tag them based on what type of protocol he thinks the flow is carrying. The system consists of different views of the data in order to help the user to easily perform the verification process.

All the traces of a GTVS installation are listed on the trace page. A user can go to this page by pressing on the 'Would you like to verify one of the traces?' link on the welcome page of GTVS. This page contains a list of traces that have been loaded on GTVS with their start and end time. In order to go to the first perspective of data presentation for a trace a user needs press on the 'Verify by Dst IP' link.

On the first perspective, the user can view flows aggregated by their destination IP. For all the flows destined to the specific IP the table provides details about the number of flows, the number of packets, the total number of bytes with the packet header of each protocol, the total number of bytes that the transport protocol carried, the percentage of the flows that are verified or questioned, the L7-filter signatures that have been matched and the ports that the flows used. As an input the user can use the last two text boxes found on each line where he can describe the exact type of protocol that the flow carries as well as the general class of the application. For better interaction the interface provides a popup dropdown menu that can help the user to fill in the field by providing him a list of values that have already been used by the system and match the first letters that he has entered. Finally on the last row of each line the user can find some links through which he can verify data. The links found are Verified, through which he verifies that all flows are of the type that he has defined on the text boxes, Questioned, through which the flows are marked to be reviewed latter, and Verify only unverified, that defines that the values on the text boxes will be used to tag the flows that have not been verified yet.

The idea behind this perspective is based on the fact that usually TCP sockets have a persistence of many days especially on server. So if we manage to identify some of the flows carried by a specific IP/port set then we can infer that all the data carried from that specific set are of the same type. For the characterisation of an IP/port set GTVS provides 3 sources of information. Firstly, it provides information from the IANA list for port assignments. This document provides a list of TCP/UDP ports and the protocol of the data that a connection to that port should carry. The second source of information that is provided by the system is the L7-filter signature matches. The L7-filter runs within Click and performs a deep packet inspection which tries to match some signatures on the payload of the flow. For this the interface provides a summary of the most common signature that L7-filter has identified and if the link breakdown is pressed then the user can see a list of the signatures that have been matched and the frequency that each signature was matched. Two special tags are used to denote the case that no signature has matched. The first one is NC-10, which means that the flow had less than 10 packets, and the second is NC+10, which is used when a flow has more than 10 packets. In both cases it wasn't possible to identify any protocol through pattern matching. The value 10 is the number of packet that is used in the Click configuration file as a minimum for the L7-filter to characterise a flow. The last source of information that this perspective provides is the plain data of the payload of the flow. This information can be used by the user to verify himself the protocol that the flow carries by checking the payload of the flow. The data of the first flow for each port is provided. In order to view more flows for the same port, the user needs to go to a different view. In order for the user to view the row dump of the payload he should click the arrow next to the port number and a new window will pop up with the relevant information.

If the user presses on the link of the Destination IP on a line of the first perspective, he will be transferred to another view where the flows to a specific IP are grouped by their destination port. On this perspective the information provided is the same as the previous view. This perspective though provides an additional functionality. The user can see a diagram of the clients that are connected to this destination IP. This source of information can be used to discover overlay networks that are usually formed by P2P systems and characterise this way other hosts about their activity. Lately P2P traffic uses a lot of encryption and its identification gets rather difficult. With this functionality though if we identify a P2P server, we can discover others by inferring something about their interactions. In order to view this diagram a user needs to press the image of the graph that resides next to the port number.

If a user presses on the destination port of this view, he will be transferred to a new perspective where he can see a list of the source IPs for the flows to the specific destination IP and port. The information that is provided on this perspective is the same as the information provided on the previous perspective.

Finally, if the user presses on a source IP he can see all the flows that where initiated between the triplet destination IP, destination port and source IP. For each flow the system provides information about the start time, the total duration of the flow, the source port, the total number of packets, the total total of bytes with the headers of all protocols, the total number of bytes of the payload of the transport protocol and the L7-filter mark. As in the initial perspective, the user can define the type of the protocol and the class of the application and either verify or put under question the flow using the relevant links. By pressing the link on the start time of the flow, the user can view the full payload of the flow.

Extra scripts

In order to make the process of data tagging easier, GTVS provides a bundle of command line scripts, that can be used to increase the information provided by the system or to either perform some batch tagging based on some observed heuristics. This scripts can be found in the subfolder 'tools'. These are:

add-trace

This script can be used when you want to add a new trace in the Traces table, so that you can see it on the trace list on the web interface. The script requires two parameters, firstly the name of the trace that you want to add and secondly the path to the folder where the capture files exist. The script outputs an SQL script that can be run in mysql to add the new trace. An example of it usage is:

add-trace <trace_name> </path/to/flow/data> | mysql -u db_username - p db_password GTVS

create-agg

After the Flows table has been populated, the script create-agg should be used to generate the aggregate tables FlowsAggByDstIp and FlowsAggByDstIpPort. This script accepts as a parameter the name of the trace and outputs an SQL script that generates the relevant tables. The script can later be imported in mysql database. An example of its usage can be:

create-agg <trace_name> | mysql -u db_username - p db_password GTVS

create-hn

The create-hn script can be used to create the HostNames table. This table is necessary to map each IP address into one or more corresponding host names. The script accepts as a parameter the name of the trace and outputs an SQL script that creates the relevant table. An example of its usage can be:

create-hn <trace_name> | mysql -u db_username - p db_password GTVS

verify with heuristics

This scripts contains the support for heuristic rules that can be used to quickly tag flows based on general observation on the data. The rules that we develop can be divided in three classes. The first class contains rules that sanitize the L7-filter results using also the knowledge from the IANA port assignment. So in this class if for a server and a specific port if we see a lot of traffic be of a specific protocol and only a few flows being identified as something different we can infer that this port uses the specific protocol. The rules that we use are the following:

The second case of rules is based on host names. This is used for well known services that run on specific domains such as msnmessenger, facebook etc. The rules that we use are:

The last class of rules base their philosophy on specific application behaviour characteristics. The rules that we have develop so far are:

create-http-trans

This script outputs an SQL script that creates a table to store data about the HTTP transactions. Within this table store specific fields of each HTTP transaction. As a transaction we identify a single query/response interaction between a server and a client. The script accepts as inpput the name of the flow and output on stdin the query. An example of its usage is:

create-httptrans <trace_name> | mysql

resolve-hn-http

This script can be used to extract the hostname of the servers of the trace by using the HTTP traffic. It takes advantage of the typical feature of HTTP clients to include a header with the host name in the HTTP request. Based on this observation this script will fill in the HostNames table with the names that have been found in the HTTP traffic. An example of its usage is:

./resolve-hn-http <trace_name>

gtvs-resolve

This java program is used to retrieve the DNS names of each IP address found in the trace (using reverse DNS queries). The program performs several concurrent queries on DNS servers and imports the data of the responses back to the database. Because the queries are performed in batches, the process takes a while and might seem to be stuck at some points waiting for the responses. An example of its usage is:

gtvs-resolve -u <db_username> -p <db_pass> <trace_name_!HostNames>



© 2009 University of Cambridge Computer Laboratory Please send any comments to andrew.moore (at) cl.cam.ac.uk Page last updated on 12-Jun-2009 at 12:42 by Marco Canini