Algorithms2 challenge 2

Algorithms2 Challenge 2: Finding Order

In this tick, your aim is to find a good order for a set of items, given similarity scores between them. You are given a list of pairs of items and their similarity scores (this list doesn’t include all pairs). Here is an example:

ticksim_train.csv

We saw an illustration in the video for section 6.6. We were given a list of students, and also the similarity scores between their submitted code for Tick 1. We used Kruskal’s algorithm to find an ordering for the students, such that two students with a high similarity score appeared close to each other in the order.

Your aim to produce a good ordering of items. To be precise, let \(s_{u v}\in(0,1)\) be the similarity score between items \(u\) and \(v\). Your score will be \[ \text{score}=100\times\frac{x-m}{-m} \quad\text{where}\quad x = \frac{1}{M N}\sum_{\text{pairs }(u,v)} \bigl|z_u-z_v\bigr|\log(1-s_{u v}). \] Here \(z_u\) is the index of item \(u\) in your ordering, \(M\) is the number of pairs, and \(N\) is the number of items. The normalization is so that the score is always \(\leq 100\) (since \(x\leq 0\)); and the constant \(m\) is the expected score from a random ordering, \[ m = \frac{1}{3M}\sum_{\text{pairs }(u,v)} \log(1-s_{u v}), \] thus any sensible answer will give score \(> 0\).

Hint. Could we improve on the Kruskal clustering method, by choosing which branch of the classification tree is left and which is right? Could we use ideas from toposort? Could we somehow use max-flow to order vertices?

Submission (Python)

Submission filename: orderer.py

Please submit a source file orderer.py, which implements this function:

order(user1, user2, similarity)

Here all three arguments are lists of the same length, and similarity[i] is the similarity score between items user1[i] and user2[i]. Your function should return a list containing all the items that appear in either user1 or user2. A silly answer would be

# silly answer
def order(user1, user2, similarity):
    return list(set(user1) | set(user2))

You can use numpy and pandas.

The Moodle tester runs your code on the training dataset linked to above, and shows you your score. For the final evaluation, the lecturer will run your code on a hold-out dataset.