Department of Computer Science and Technology

Technical reports

Computer Laboratory technical-reports archiving system

We occasionally get asked what software we use to manage our Technical Reports web archive. This writeup tries to answer this question by providing a quick overview and links to the source code, and also to discuss a few other issues that influenced its design. The setup is currently used only at our department, for this series. But it was, in principle, meant to be configurable enough that it could also be used by other departments, or even at other organizations. (A couple of minor local dependencies that have crept in over the years might have to be separated out first, but that wouldn't be a big project, and probably a good idea in any case.)

Architecture

Our TR archive is based entirely on static files, meaning there are no dynamically generated HTML pages or SQL database servers involved. This was the preference of our webmasters when Markus Kuhn offered to take over a previously entirely manually maintained archive in late 2001, and it has worked very well for us and is very easy to maintain. It also allows us to track the state of the entire archive in a file-based version-control system (e.g., we currently use Subversion).

The system was implemented mainly by Markus Kuhn, and was designed for an operator/maintainer who is reasonably comfortable with Unix Tools, in particular Perl and Make. We currently use the system on Ubuntu Linux desktop computers that have NFS mounted the folder that houses the archive, which is also the folder exported by our Apache HTTP server.

The only CGI script currently involved exists to provide an OAI-PMH query interface (todo).

Content

The content of the archive consists mainly of a collection of report PDFs, along with metadata in two plaintext “database” files, which humans edit directly, usually with Emacs:

  • tr-database.txt

    This is the main metadata file, which contains three types of UTF-8 plain-text lines:

    • series parameters, of the form “key=value”, which affect all reports
    • a vertical-bar separated record line for each technical report, of the form “number|date|title|author, author, ...|pages|notes”, where “notes” is itself a vertical-bar separated set of optional “key=value” parameters, and each author entry consists of their name and optionally a unique identifier, in the form “Firstname Lastname <id>”
    • comment lines
  • tr-abstracts.txt

    We keep the abstracts in a separate file, simply because they are longer multi-line text that we want to be able to easily reflow, e.g. using Emacs’ M-q fill-paragraph command. For this reason, abstracts are separated by ~70 character long separator lines, which contain the report number of the following abstract and don't line break.

A more detailed description of the syntax of these files can be found as comments at the start of each file.

Software

Most of the software supporting the archive is contained in a couple of Perl 5 files, using an object-oriented programming style:

  • tr-tool

    This is the main tool for maintaining the archive. It reads tr-database.txt and tr-abstracts.txt metadata into RAM using the functions provided by TechReports.pm and then offers numerous command-line functions to output the metadata in many different formats. It also calls LaTeX to produce standard PDF title pages for each report, and can calls one of several PDF tools to prefix these to a report. It can deal with reports submitted in PDF, PostScript or DVI.

  • tr-doi

    This is an auxiliary tool just for registering a technical report with DataCite, our provider of Digital Object Identifiers (DOIs), via their JSON-over-HTTPS API. It also loads its data using TechReports.pm. It keeps its authentication password stored elsewhere, via our PasswordVault.pm mechanism.

  • TechReports.pm

    This Perl module contains code for reading all the data from tr-database.txt and tr-abstracts.txt into Perl objects. It defines Perl classes TechReports (one object representing the entire series), TechReport (each object representing one published document), and TRAuthor (each object representing one author of one published document), along with a collection of useful methods operating on these.

    In addition, TechReports.pm can also save the in-memory representation back into tr-database.txt and tr-abstracts.txt files. We don't use this routinely, but these save functions enable us to perform algorithmic updates of these files, e.g. in case there are schema changes or systematic updates needed.

    And it can also maintain a tr-database-times.txt file with secure hash values of the metadata of each report, along with a timestamp of when it was last modified. This information can be used to automatically detect and report changes, e.g. to update other files (like abstract web pages) only in case there were any changes, or to satisfy OAI-PMH requests for recently changed records.

  • UniConv.pm

    This Perl module contains some string-processing functions to convert UTF-8 Unicode characters, as found in the titles, author names and abstracts of our technical reports, into both LaTeX and ASCII-fallback notations. These functions used to be a part of tr-tool, but they are now factored out as they became quite comprehensive, and might become useful on their own in some other projects.

Initially, tr-tool was implemented in a non-OO programming style, using separate hash tables for each attribute of a report. This eventually became a bit unwieldy and difficult to extend with new parameters. A major refactoring in 2016 introduced the object-oriented representation now provided in TechReports.pm, which made the code much easier and more pleasant to maintain.

Workflow

The workflow for adding a new technical report is described in README.txt. It mainly consists of the following steps:

  • assign a new TR number nnn to the submission by adding a new line in tr-abstracts.txt
  • transfer (and if necessary edit) the metadata received from the corresponding author (currently via email) into tr-database.txt and tr-abstracts.txt
  • move the PDF file received from the author into orig/nnn.pdf
  • run “make UCAM-CL-TR-nnn.pdf” to make the report, which in turn causes the Makefile to invoke “./tr-tool prefixnnn” to prefix the submitted PDF with a standard 2-sides title sheet auto-generated from the metadata.
  • manually inspect the resulting PDF for problems
  • run “./tr-doi publishnnn” to obtain a DOI from DataCite
  • run “make” to publish the report by generating an abstracts web page and updating all the published catalogue web pages and indices (which is done through a series of calls to “./tr-tool ...”)

While there is normally no need to edit the Makefile for each report, it is possible to set a few additional processing parameters for individual reports in the Makefile. The main reason for doing this is that we have several methods and tools for prefixing the title page to the submitted PDF (pdftk, mupdf, ghostscript), and in the past there have been a few cases where different methods worked better for different submitted PDFs. The Makefile allows us to record neatly what command-line tweaks were used in such special cases.

Housestyle

The tr-tool software itself doesn't applying any web housestyle, it calls an external tool (Ucampas) to do that.

The design of the standard cover page is mainly determined by the tr-title.sty LaTeX package, which typesets title-page documents such as tr-title-demo.tex (PDF) generated by tr-tool.

Identifiers

TR number

ISO 5966:1982 (withdrawn 2000) said “The report shall be given a unique alphanumeric designation that identifies the responsible organization, the report series and the individual report.” The U.S. government had since ~1974 a formal Standard Technical Report Number (STRN) identifier scheme, for issuing alphanumeric tech-report numbers to uniquely identify such documents (ANSI/NISO Z39.23-1983). At the time Markus took over the series in 2001, there were plans for an ISO 10444 standard to extend such a scheme world-wide into an International Standard Report Number (ISRN), but that was then never implemented, and ISO 10444 was withdrawn in 2007.

In the absence of a formal UK-national or international registration scheme for alphanumeric tech-report identifiers, Markus made up an identifier of the form UCAM-CL-TR-nnn (for University of Cambridge – Computer Laboratory – Technical Report, using only characters safe for use in filenames), inspired by the examples given in ISO 5966, to have a unique string associated with each TR that is easy to find with search engines. It served that purpose well, but the “CL” part may eventually look slightly dated following a renaming of our department in October 2017.

Other technical report series use either one continuous integer number sequence, or a year plus an annual sequence number. We had already settled on the former from the start, and there was no reason for changing that.

ISO 5966 also suggested that if “a single report is too large to be handled conveniently, it should be issued in two or more parts under the same title”. We had originally in tr-tool support for nnn.p part numbers, but removed this after a few years as it was never used (PDFs practically don’t have page-count limits).

We don't currently have any version-number suffix as part of our TR number. Minor revisions – fixing only a few words or sentences, but mostly preserving the page breaks – are indicated on page 2 in a note, major revisions are published as new reports.

ISSN

ISO 5966 suggested that a proper tech-report series might have an ISSN, so we requested one in December 2001 and got assigned ISSN 1476-2986 by the ISSN National Centre for the UK at the British Library. But there has been one little niggle. They assign different ISSNs for print and electronic publications, and that ISSN type can't be changed later, as they use distinct number ranges. Back in 2001, we still sent out printed copies of our technical reports, so we applied for a print publications ISSN (1476-2986). Soon afterwards (January 2002), we decided to stop printing technical reports and maintained only an online archive. We should perhaps have asked at that point for another, electronic-publications ISSN, but that never happened. So now all our online TRs display only a print ISSN, whereas almost none of the previously-produced print TRs actually carried our print ISSN.

DOI

Some funding bodies have started to require that research outputs are uniquely identified with a Digital Object Identifier (DOI). Therefore we asked the University Library’s Office of Scholarly Communication in 2021 to mint a DOI prefix (10.48456) and a DataCite Fabrica account for us. They were able to do this within their own DataCite membership, and therefore we didn't have to pay anything for our DOIs as a department. We can now create DOIs and submit associated metadata using the DataCite REST API (see tr-doi tool). Our DOI syntax is “10.48456/tr-nnn”. The inclusion of the “tr-” prefix makes it visually obvious that this is a technical report, and it would allow us in future to use the same numeric prefix for other kinds of departmental publications.

ISBN

Very occasionally an author has a need to obtain an ISBN for a technical report, for example due to a requirement by a funding body. Markus discovered in 2023 that the department had received in the early 1970s a block of 100 ISBNs with prefix “(978-)0-901224-”, but only the first four of these (00–03) appear to have been used at the time. So we reclaimed that registration, left a safety-gap (04–09), and we are now able to assign ISBNs in the range (10–99) from that block. We do not routinely assign ISBNs to tech-reports, because that would cost money and would serve no useful purpose as we don't sell our tech-reports through book shops. But they are available if especially requested.

Licensing

If you publish a tech-report series, you get occasionally sent surveys by researchers who study such things, and one of these asked us in May 2008 for the URLs of both our submission policy and usage licence. In response, Markus wrote an informal default usage licence that allowed normal academic non-commercial uses of unmodified versions, but referred to contacting the authors for anything more.

Since then, there have been occasionally requirements from funding bodies to release research outputs under a Creative Commons licence, and we added a metadata attribute and standard title-page blurbs to support such options as well. However, the most commonly required CC BY licence allows arbitrary adaption and reuse of the material for any purpose, which is a very generous and wide-ranging licence that may conflict with other requirements. For example, one has to be careful that inclusion of proprietary materials such as protected brand logos or commercial fonts in a PDF released under CC BY is not causing conflicts. Since our standard title page currently still contains both, its licence text offers a CC BY licence only for “the following report”, a phrasing aimed to exclude the title page itself.

Likewise, authors may not own the copyright to all the materials included in a report, and may therefore not be able to grant a CC BY licence for the entire resulting PDF. Some of our reports also include formal specifications over which the authors or their sponsor want to keep change control, or for which the copyright had to be assigned to a collaborator or sponsor that does not want a CC BY licence. For such reasons, we resisted so far suggestions that CC BY should be the default, and we only apply it if explicitly requested by the authors.

Wishlist

No project is ever finished, and there are still a few things to do in our setup:

  • The title-page design could do with a refresh, not just to reflect the changed name of our department, but also to remove the use of proprietary elements (University Identifier, commercial font) that might cause headaches in connection with Creative Common licences.
  • The series parameters in tr-database.txt should perhaps be time variable, such that changes to e.g. the name of the department can be applied from a certain number onwards. This will need a bit of refactoring of the relevant datastructure in TechReports.pm.
  • Finish the OAI-PMH CGI script.
  • Add a CGI script for a report-submission form perhaps? (Although: the current semi-prescribed email actually works quite well; it is very flexible and allows/encourages submitting authors to include more context information, e.g. about details they are unsure of.)
  • Some archiving standards (e.g., Plan S) require versioning, i.e. that different versions of a document all remain available and remain unambiguously identified by a version number in the DOI. (We currently publish only the latest version under the same DOI in case of minor updates.)
  • ORCIDs for authors

Some work is needed to factor out a couple of local dependencies and make the code more easily usable elsewhere:

  • "tr-tool nag" should become a separate tool "tr-email-reminders", to factor out a dependency on our local email setup.
  • four functions used from our local site-wide convenience library CLWeb.pl could be copied over
  • calls to Ucampas could probably be moved into the Makefile
  • calls to CLWeb::homepage_url() (to find a local homepage URL for an author's unique ID) should be turned into a cached generic off-line mechanism, e.g. tr-tool produce a file of IDs for which a homepage should be found, and a cron job can then regularly update a second file that lists all URLs found

See also