Computer Laboratory

Course pages 2013–14

Information Retrieval

Principal lecturer: Dr Simone Teufel
Taken by: Part II
Past exam questions
Information for supervisors (contact lecturer for access permission)

No. of lectures: 8
Suggested hours of supervisions: 2
Prerequisite courses: Mathematical Methods for CS (Part IB)

Aims

The course is aimed to characterise information retrieval in terms of the data, problems and concepts involved. IT follows the text book “Introduction to Information Retrieval”, cf. below. The main formal retrieval models and evaluation methods are described. Web search is also covered. The course then turns to problems and standard solutions in two related areas, clustering and text classification.

Lectures

  • Introduction and Boolean Retrieval. (Chapters 1; 2.3) Key problems and concepts. Information need. Boolean Operators and Implementation.

  • Indexing. (Chapters 2.2; 2.4; 3) Term manipulations; stemming; spelling correction.

  • Index Construction and Compression. (Chapters 4.2-4.4; 5). BSBI, SPIMI, Distributed indexing. Dictionary compression. Byte- and bit-level codes.

  • The Vector Space Model. (Chapter 6). VSM and Term weighting.

  • Evaluation. (Chapter 8, p. 139-148). Test Collections. Relevance. Precision, Recall, MAP, 11pt interpolated average precision.

  • Clustering. Chapters 16.1-16.4; 17.1-17.2). Proximity metrics, hierarchical vs. partitional clustering. Clustering algorithms. Evaluation metrics.

  • Text Classification. (Chapter 13.1-13.4). Naive Bayes. The Bernoulli Model.

  • Link Analysis. (Chapter 21, excluding 21.2.3). PageRank; Hubs and Authorities.

Objectives

At the end of this course, students should be able to

  • define the tasks of information retrieval, web search, clustering and text classification and differences between them;

  • understand the main concepts, challenges and strategies used in IR, in particular the retrieval models currently used.

  • develop strategies suited for specific retrieval, clustering and classification situations, and recognise the limits of these strategies;

  • understand (the reasons for) the evaluation strategies developed for these three areas.

Recommended reading

* Manning, C.D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/.