Proposals for Part II / ACS Projects 2017-2018

These projects are all linked, one way or another, to the Cambridge Cybercrime Centre.

These proposals are not of uniform complexity. If some seem less challenging than others, then they're probably more appropriate for Part II candidates rather than those seeking an MPhil -- though of course they're all open ended and you ought to be able to extend them all into first class pieces of work.

Analysing blog comment spam

Automating URL shortener following

Email spam categorisation

419 scams

Open directory analysis

String reflecting honeypots

DDoS victim categorisation

Analysing Mirai malware

What is on the 'Dark Web'?

Quantifying users of multiple underground forums

Trading of social network accounts in underground forums

Analysing blog comment spam

Our group blog www.lightbluetouchpaper.org receives 1000 or so comments per day. Almost every single one is advertising of one sort or another. We'd like to understand what is being posted, by who, and what they are advertising. For this project NLP skills are as relevant as the ability to parse URLs.

Automating URL shortener following

URL shorteners such as bit.ly redirect you to destination websites in a number of different ways. This project is to create a universal (and easy to extend) URL shortener following system.

Email spam categorisation

We have various feeds of email spam which we need to understand better and so we're interested in projects that seek to do this. Approaches could utilise NLP, Machine Learning, both or neither.

419 scams

We have many tens of thousands of 'Advance Fee Fraud' scams. We'd like to know the extent to which these scams evolve over time (are Nigerian Princes getting rarer?) and whether the amount of money involved is changing. Also, are all the orphans girls and all the barristers men?

Open directory analysis

We regularly visit web pages on compromised websites -- often because of an insecure (unpatched) content management system. Where the website has "open" (world-viewable) directories we would like to automatically determine what files have been added by the attacker and to profile the nature of their attack.

String reflecting honeypots

Could one construct honeypots (to fool robots) by merely declaring which string to send in response to any particular input. That is, make almost no attempt to emulate a shell. The project would be to create a user-friendly (and easy to update) scripting language for this task.

DDoS victim categorisation

We have details of over 3 trillion packets from the 10,000+ reflected amplified UDP attacks that occur each day. These tell us which IP was victimised. We often say it's one US High School kid attacking another to gain an advantage in an online game, but can we be more rigorous in explaining what is going on?

Analysing Mirai malware

We have over 9000 Mirai (and similar) malware samples -- infections are used to create large 'Internet of Things' botnets. Many are undoubtedly very similar to each other. Can we disassemble them rapidly with a view to understanding what small differences might be present?

What is on the 'Dark Web'?

Tor hidden services (accessed by .onion addresses) have been regularly surveyed but there is little consensus as to what types of site are present. Can we classify sites and better understand the scams being pulled by scammers who clone sites for their personal enrichment?

Quantifying users of multiple underground forums

We have a dataset of more than 25 million posts, 500k members and 3 million threads scraped from several underground forums. Some handles (user names) appear on more than one forum and it may be the same person. Some users have more than one account on a single forum. Can we apply Social Network Analysis and Natural Language Processing techniques to link handles together?

Trading of social network accounts in underground forums

Underground forums are widely used for trading accounts from sites like Twitter or Instagram. Can we identify this trading within our substantial dataset of forum data and automatically infer what (undoubtedly undesirable thing) the buyers do with these accounts?

For historical interest only: old proposals from 2006 or earlier!

Su Doku assistant
CAPTCHA solving
Email header reading
419 Scam "Bingo!"
Tracking changes to websites
Cryptanalysis of M-209
Signed faxes
Locating web servers
Bulk viewing of images
Measurement functions for classical cryptanalysis
Cryptanalysis of Enigma
Phone directory
Advanced Fee Fraud email analysis
Microsoft Word document cleaner

Su Doku assistant

Solving So Doku puzzles by computer (by brute force or otherwise) isn't especially interesting. Setting them using a computer is more of a challenge, since they need to be soluble and meet various ethical criteria for the nature of the solution. This project is not unrelated to this last issue -- what this project proposes is an assistant that will offer suggestions to a human solver so that they can progress when they get stuck on a puzzle. These suggestions need to be couched in terms that the solver will understand such as, `have you considered where a 3 can be placed in the bottom left corner?'. Ideally the assistant would analyse the solution the human has so far and deduce which tactics they knew and hence make suggestions for progress that were couched in terms of their current abilities, teaching new and more complex logical deductions only when it was necessary. The assistant could then be used to assess the difficulty of puzzles, is 'difficult' in The Guardian at a lower level than 'fiendish' in The Times or not ?

CAPTCHA solving

Captchas (or HIPs) are supposed to distinguish computers and humans (see http://www.captcha.net) Unfortunately many of them are very poorly designed and will fall to very simple image processing procedures. Recent work by Chellapilla et al has shown that computers are in fact better than humans! at picking out distorted characters, and the real challenge is to separate the glyphs. However, there's a whole heap of CAPTCHAs, notably in spam challenge response systems which are often just text against a hatched background. See here for what currently seems to be possible. What would be interesting would be to extend this work, or perhaps to develop a `proof of concept' program that parses incoming email and automatically deals with the challenge-response within it! [you can tell that like John Levine I consider challenge-response part of the problem and not part of the solution to spam.]

Determining where email has come from

The path email takes through the Internet is recorded in "Received" headers. By working methodically through these, and taking an informed view about what can be trusted, it is possible to determine the true source of an email (or at least where to make further enquiries about that source).

This is of interest in many situations, for example when reporting "spam"; when contacting people whose machines are infected by viruses or when tracking down the senders of scam emails such as those involved in advanced fee fraud (419/Nigerian).

In simple cases, parsing headers is easy, but it can become complicated. Some of the issues are outlined in this section of the training course I've taught to police SPOCs (Single Points of Contact) on Traceability. Though the notes suggest that automating the process is hard, it does seem to be a viable project to propose. For example, by clever use of MX records and other DNS access it might be possible to have a serious stab at a system that at the very least can indicate a small number of alternative sources.

419 email Bingo!

This is a simple project on the 419 scam theme that might suit a Diploma student.

In order to ensure that people are not taken in by these emails some effective end-user education is required. The idea is to develop a Bingo style game to be played with these emails. A web application is needed that will process one of these emails and fill in squares depending on details such as the country, name of the famous dictator, number of millions of dollars etc. One might imagine colleagues playing against each other to see which one can shout House! (or perhaps Fraud!) first. An attractive game would be of interest to all those creating 419 information sites because of its properties as a meme to spread information about these scams.

Tracking changes to websites

The Regulation of Investigatory Powers Act 2000 includes provisions (not yet in force) which can require people to hand over their encryption keys (see s49). Under some special circumstances outlined in s53 you can be required to keep secret that your keys have been compromised in this way. It is generally understood that nothing in RIP prevents one from revoking ones key, however, if asked one would commit an offence by revealing why. This has led a number of people to provide carefully worded statements about their revocation policies on their websites:

It would be interesting to build a web-based system that would enable people to register their statements. A robot would then check these statements on a daily basis to determine if they had changed in any significant way. If they had changed then the owner would be contacted to determine if the change was intentional -- and in the meantime the community would be altered. The project will require the development of parsing tools rich enough to spot changes, along with a database and suitable data entry tools so that people can register details of their statements. Attention needs to be paid to privacy and security issues throughout.

Cryptanalysis of the M-209

The same technique that worked on Enigma (see below) may also work on the M-209. Dennis Ritchie describes something similar on his web page, but omits (as the page explains) all the details.

See: Dennis Ritchie, Dabbling in the Cryptographic World -- A Story

This project has been attempted in the past -- and it didn't go too well. However, it did turn up a hand-driven cryptanalytic method, which turns out to have some interesting twists when one attempts to implement it on a machine. There may be some mileage in looking more deeply into this.

Signed faxes

Although email is the medium of choice for much communication, sometimes recipients feel happier with a fax because they get a copy of the sender's signature. A well-known example of this strange approach to authentication is the University whose Ordinances often prescribe that requests must be made in writing. They currently interpret this to include signed faxes but not emails. It would convenient to allow busy academics to use email to submit documents such as calls for discussions or Regent House votes, whilst meeting the requirements of the Old Schools.

The project is to develop an email to fax system that will incorporate a previously stored signature into the document that is sent. The easy way to obtain signatures is by processing scanned pictures, or indeed a signature sent from a real fax machine.

For obvious reasons, security is of prime importance to this project and it will be necessary to develop a security policy and show how the resulting system adheres to this policy.

This project was attempted in 2002-2003, but the Diploma student who attempted it did not quite manage to develop a complete system, though a number of necessary components were designed and built. It would be suitable for both Diploma and Part II students to attempt again.

Locating web servers

The Internet Watch Foundation (http://www.iwf.org.uk/) operates a "hot line" through which members of the public can report the presence of potentially illegal material on the Internet. Their first priority is child pornography.

When a report has been made that refers to a web site, trained IWF staff view the site and determine whether the material is indeed potentially illegal. If so, then a report is made to the hosting ISP and to the UK police. If the site is abroad then the report is sent via the National Criminal Intelligence Service (NCIS) who will pass it to the appropriate police force in another country.

The report needs to include information on the IP address of the web server and should correctly identify who "owns" this address (and who the "upstream" ISP may be). It should also identify, as far as possible, which country the site is in - few sites of the form "xxx.to" are located in Tonga! All this information can be determined by a rather tedious mixture of techniques such as "whois", "nslookup" and "traceroute".

Sometimes web-sites are "forwarded", viz: the URL that apparently points to the site will retrieve a page that the redirects the browser to another web server. This redirection can take place either by HTTP commands or sometimes via Java or JavaScript. Also, images may not be present on the same server as the rest of the site content.

This forwarding leads to some complexity in producing reports as to exactly which web servers are involved in providing illegal content and in ensuring that all parts of a chain are detected and reported.

At present the IWF performs the task of unpicking the forwarding and then identifying the ISP (and country of origin) by hand. They would appreciate having an "easy to use" tool, running under Windows, that they could use to automate this part of their process.

The IWF is based in Oakington (a few miles west of Cambridge) and some liaison would be required with them, if the project is to produce software that addresses their needs.

Bulk viewing of images

The Internet Watch Foundation (http://www.iwf.org.uk/) exists to restrict the availability of criminal content and help consumers prevent access to potentially harmful content on the Internet in the United Kingdom.

The IWF operates a hot-line through which the public can report the presence of potentially illegal material. Their first priority is child pornography.

Reports of child pornography on Usenet have shown that it tends to cluster in particular newsgroups, and the IWF's trained staff now proactively scan some newsgroups to check if further potentially illegal material has appeared. They are currently using a standard newsreader for this task, which makes it rather labour intensive.

They believe that it would be considerably easier if they had a special purpose tool that automatically selected all the new images from a newsgroup and displayed them in a "thumbnail" size. They could then examine full-size versions of selected pictures as required - and then automatically generate standardised reports of any potentially illegal content that was present.

People considering this project should note that it is as much about usability as it is about displaying graphic formats. It is unlikely that a perfect statement of program functionality will be achievable at the start of the project - so the aim should be to create a prototype at an early stage and then iterate the design, in consultation with the IWF, towards a truly useful tool. As such, this project should only be considered by people who already have some experience in developing GUI programs in a Windows environment.

The IWF is based in Oakington (a few miles west of Cambridge) and a fair amount of liaison would be required with them, to ensure that the project produced software that addressed their needs.

Measurement functions for classical cryptanalysis

Guvf cebwrpg fubhyq vagrerfg nalbar jub unf znantrq gb ernq guvf qrfpevcgvba bs vg!

Gurer ner n ahzore bs urhevfgvp nccebnpurf gb fbyivat pynffvpny pelcgbtencuvp flfgrzf fhpu nf zbabnycunorgvp be cbylnycunorgvp fhofgvghgvba. [Gur "pnrfne" pvcure "EBG13" gung lbh'ir whfg hacvpxrq vf n fvzcyr rknzcyr bs n zbabnycunorgvp fpurzr].

Gur urhevfgvp nytbevguzf eryl ba pbafgehpgvat bar fbyhgvba naq gura pbafgehpgvat shegure fbyhgvbaf gung vzcebir n "tbbqarff bs svg" shapgvba. Gurfr zrnfherzrag shapgvbaf hfhnyyl pbzcner yrggre serdhrapl naq qvtenz serdhrapl jvgu rkcrpgrq inyhrf sbe abezny grkg. Gevtenzf naq ybatre pna nyfb or vaibxrq va n ubcr bs trggvat n orggre zrnfher bs jurgure n urhevfgvp punatr vf trggvat pybfre gb gur gehr fbyhgvba be shegure njnl.

Gur cebwrpg vf gb vairfgvtngr nccebcevngr shapgvbaf - qb gevtenzf ernyyl uryc lbh svaq n fbyhgvba dhvpxre ? Ybt(serdhrapl) hfhnyyl jbexf orggre guna n fvzcyr serdhrapl zrnfher... vf guvf gehr sbe nyy urhevfgvp nytbevguzf ? be ner fbzr zber frafvgvir guna bguref ?

N pbechf bs fbzr 100 zvyyvba Ratyvfu jbeqf vf ninvynoyr gb cebivqr gur enj qngn gung vf arrqrq ba serdhrapvrf, gubhtu bapr guvf unf orra cebprffrq gurer fubhyq or ab arrq sbe nal hahfhny nzbhagf bs qvfx fgbentr be cebprffvat cbjre.

Guvf cebwrpg zvtug cnegvphyneyl fhvg qvcybzn fghqragf urfvgnag nobhg gurve cebtenzzvat novyvgvrf fvapr vg jvyy vaibyir perngvat n ahzore bs fznyy cebtenzf engure guna bar ynetr zbabyvgu. Ubjrire, vg jvyy cebivqr n fvtavsvpnag punyyratr gb nal raguhfvnfg fvapr gur cebwrpg vf nf bcra raqrq nf lbh jvfu gb znxr vg.

Cryptanalysis of Enigma

James J Gillogly has shown that it is possible to cryptanalyse Enigma traffic using a ciphertext only attack. This was not the approach taken during WWII, when known-plaintext attacks were used. As machines get ever faster (and more are available to use in parallel), it should be possible to extend Gillogly's attack to recover ring settings at the same time as message key settings and wheel selection.

Refn:James J Gillogly, Ciphertext-only Cryptanalysis of Enigma. Cryptologia, October 1995, XIX(4).

Phone directory

The killer application for a small to medium sized organisation considering an intranet is putting the internal phone directory online so that it can accessed via a browser. Now that digital cameras are universally available, the usual details can be supplemented with pictures of who you are talking to. Obvious extensions are links to organisation charts. Something like this...

Despite this being "obvious" there are very few easy-to-use, freeware examples of this type of software. What's needed is a tool with an interface that's at least as easy to use as "powerpoint" and which hooks into standard systems like Access. There's nothing wrong with Linux in this context -- except that the target audience won't be using it!

Advanced Fee Fraud email analysis

Advanced Fee Fraud is sometimes called "419 scams" or "Nigerian letters" though these days much comes from other countries than Nigeria and the range of scenarios is ever growing, with lottery 'wins' being popular at present. You can read more about it here.

NCIS are considering creating a website for individuals to report these scams. Clearly, because of the way in which these letters are sent out, the same email is likely to be reported many times, and substantively the same email may be sent with minor variations over time.

The project is to investigate the practicalities of processing large numbers of reported emails and summarising this into a simple report that indicates numbers, sources, and how an email is changing over time. It may be possible to apply textual analysis systems to attempt to identify commonality of authorship.

Microsoft Word document cleaner

Microsoft Word records all sorts of information within documents as well as what you see on the screen. It's quite common to read reports of people who didn't realise that previous text can be preserved within the document and considerable embarrassment can result when last minute changes come to light. Besides this, documents can also "leak" who wrote them, when they started and how long they've been worked on. What's needed is a tool to process Word documents and put them into a canonical (and non-embarrassing) form. Further enhancements would be to flag the presence of macros (the recipient may not be happy to risk executing your code).

Return to Richard Clayton's Home Page

last modified 4 OCT 2015 -- http://www.cl.cam.ac.uk/~rnc1/projects.html