Department of Computer Science and Technology: Archiving process for personal web sites

A former user’s original departmental personal web space (~crsid/public_html/ = https://www.cl.cam.ac.uk/~crsid/) vanishes from our web server when we move their filer home directory into the departmental archive. Therefore, if we want to preserve public access to their site, we create here an archival copy of its publically visible content, which browsers can then find via an HTTP 301 redirect from its original location.

We collect the archived personal web pages preserved here using a web crawler (see /anfs/www/html/archive/Makefile for the technical details of this process). This way, we publish here only files that the deceased had already linked publically, and therefore had clearly indicated that they wanted these files to be published. We do this, rather than copying over all files in their ~/public_html/ folder, as the latter typically also includes numerous other material clearly never intended for publication, or not even owned by the deceased, such as draft documents of other people that were only placed there temporarily for personal collaboration.

We may also add additional public files that were linked from elsewhere, and therefore may have been missed by our crawler.

We may modify the pages that we copy here to add a note about their status and the death of their author. We also may occasionally fix minor and obvious technical problems, or add/update references to the author’s work that became available posthumously.

Staff members interested in a preview of which part of their web site we might archive here after their death can simulate our default file-harvesting process using this Linux command:

$ wget -r -nv -np -nH -l inf -P test-archive \
       --trust-server-names --regex-type pcre \
       --reject-regex '\?C=[NMSD];O=[AD]$' \
       --cut-dirs=1 http://www.cl.cam.ac.uk/~$USER/

The above command is of course only a starting point, and we often tailor the crawling process to the peculiarities of the individual web site.

You can help make future archiving of your web space smoother by following basic site-maintenance practices:

Syntax checking: Use an HTML validator to find and fix syntax errors in your pages, especially those in the vicinity of “href=” and “src=” attributes, which might cause discrepancies in whether different web crawlers find your links. Suggested tool: tidy (/bin/tidy on ely)
Link checking: Use a link checker regularly, to ensure that at least all internal hyperlinks (i.e., URLs pointing to other locations within your web space) are valid and point to an accessible file. Suggested tool: LinkChecker
URL portability: Prefer relative URLs (like “file.txt”, “../”, “.” or "dir/file.txt") for links within your own web pages, and avoid there redundant absolute URL prefixes, such as “https://www.cl.cam.ac.uk/~crsid/”. Relative paths make it easier to move your pages to a new location without causing redirects. A simple shell command to list your absolute links is
```
$ find ~/public_html -name '*.html' -print0 | \
   xargs -r0 grep --color -P '"((https?:)?//www\.cl\.cam\.ac\.uk)?/(~|users/)'$USER
```
Copyright: Avoid hosting files for which you are not a copyright holder, or at least make sure you identify their creator/owner and clarify how you obtained their permission to host a copy of their file in your space.
Clear structure: Keep any temporary, backup, private, family-related, third-party owned, etc. files in separate folders that indicate their status in the folder name (e.g. tmp/, private/, family/, volatile/, old/ delete-me/).

If you have any questions or suggestions about this archive and the process, contact Markus Kuhn or [Javascript required].

Study at Cambridge

About the University

Research at Cambridge