Algorithms challenge: rank-sim

Order items by similarity

In this tick, your aim is to find a good order for a set of items, given similarity scores between them. You are given a list of pairs of items and their similarity scores (this list doesn’t include all pairs). Here is an example:

ticksim_train.csv

We saw an illustration in the video for section 6.6. We were given a list of students, and also the similarity scores between their submitted code for an Algorithms tick. We used Kruskal’s algorithm to find an ordering for the students, such that two students with a high similarity score appeared close to each other in the order.

Your aim is to produce a good ordering of items. To be precise, let \(s_{u v}\in(0,1)\) be the similarity score between items \(u\) and \(v\). Your score will be \[ \text{score}=100\times\frac{x-m}{-m} \quad\text{where}\quad x = \frac{1}{M N}\sum_{\text{pairs }(u,v)} \bigl|z_u-z_v\bigr|\log(1-s_{u v}). \] Here \(z_u\) is the index of item \(u\) in your ordering, \(M\) is the number of pairs, and \(N\) is the number of items. The normalization is so that the score is always \(\leq 100\) (since \(x\leq 0\)); and the constant \(m\) is the expected score from a random ordering, \[ m = \frac{1}{3M}\sum_{\text{pairs }(u,v)} \log(1-s_{u v}), \] thus any sensible answer will give score \(> 0\).

Ideas: Could we improve on the Kruskal clustering method, by choosing which branch of the classification tree is left and which is right? Can ideas from toposort or max-flow help with ordering? Is there a completely different approach that would work better?

Submission

Please submit a source file orderer.py to Moodle. It should implement this function:

order(user1, user2, similarity)

Here all three arguments are lists of the same length, and similarity[i] is the similarity score between items user1[i] and user2[i]. Your function should return a list (i.e. an ordering) containing all the items that appear in either user1 or user2. A silly answer would be

# silly answer
def order(user1, user2, similarity):
    return list(set(user1) | set(user2))

You may use numpy and pandas.

The Moodle tester runs your code on the training dataset linked to above, and shows you your score. For the final evaluation, the lecturer will run your code on a hold-out dataset.