Next: Uniform Resource Locator Up: The World Wide Web Previous: HTML and SGML

Uniform Resource Locators: URLs

The topic of location is a thorny bush of concepts: To find a resource in the WWW we need a handle of some kind.

A Name distinguishes one object in a distributed system from another.
The address tells us where it is.
The route is how to get from here to there.

So a URL tells us where a resource is located. The WWW clients can use to this to reach the correct web server, and the correct page in that server simply by passing the relevant parts of the URL to the relevant processes and protocols, which then use their routing tables to reach the right place. The authors of URLs call them a Unifying Syntax for the Expression of Names and addresses of Objects on the Network.

Links in Web pages have largely been URLs so far. Ideally, they would be "pure" names, since then we could have replication of WWW entries across multiple servers (or mobile information) without having to change the references.

In fact, a URL is slightly lower level still, and the generic name is a Uniform Resource Identifier, which sits half way between the idea of a name, and the idea of a locator. An identifier is a unique handle, but doesn't tell you "what a thing is" or "where it is".

A name allows a user, with the help of a "client" program, to retrieve or operate on objects via a "server" program. A name may be passed for example:

    
     File Transfer Protocol (Postel 1985):
      Host name or IP-address
      
      [TCP port]
      
      [user name, password]
      
      Filename
      
   W.A.I.S. (Kahle 1990)
   
      Host name or IP-address
      
      [TCP port]
      
      local document id
      
    Gopher (Alberti 1991)
    
      Host name or IP-address
      
      [TCP port]
      
      database name
      
      selector string
      
    HTTP (Berners-Lee 1991)
    
    Host name or IP-address
    
    [TCP port]
    
    local object id
    
    NNTP (Kantor 1986) group Group name
    
      NNTP article
      
      Host name
      
      unique message identifier
      
    Prospero links (Neuman 1992)
    
   Host name or IP address
   
      [UDP port]
      
       Host specific object name
      
      [version]
      
      [identifier]*
      
    x.500 distinguished name
    
      Country

      Organisation
      
      Organisational unit
      
      Person
      
      Local object identifier

HTTP
The HTTP protocol specifies that the path is handled transparently by those who handle URLs, except for the servers which dereference them. The path is passed by the client to the server with any request, but is not otherwise understood by the client. The fragmentid part is not sent with the request. The search part, if present, is sent. Spaces in URLs should be escaped for transmission in HTTP.
FTP
The ftp: prefix indicates a file which is to be picked up from the file system of the given host. The FTP protocol is used. The port number if given gives the port of the FTP server if not the FTP default. (A client may in practice use local file access to retrieve objects which are available though more efficient means such as local file open or NFS mounting, where this is available and equivalent).
The syntax allows for the inclusion of a user name and even a password for those systems which do not use the anonymous FTP convention. The default, however, if no user or password is supplied, will be to use that convention, viz. that the user name is "anonymous" and the password the user's mail address.
The adoption of a unix-style syntax involves the conversion into non-unix local forms by either the client or server. Some non-unix servers do this, but clients wishing to access sites which do not have unix-style naming will need certain algorithms to enable other file systems to be identified and treated. Client software may also have to be flexible in terms of the sequence of FTP commands used with different varieties of server. In view of a tendency for file systems to look increasingly similar, it was felt that the URL convention should not be weighed down by extra mechanisms for identifying these cases.
The data format of a file can only, in the general FTP case, be deduced from the name, normally the suffix of the name. This is not standardised. An alternative is for it to be transferred in information outside the URL. The transfer mode (binary or text) must in turn be deduced from the data format. It is recommended that conventions for suffixes of public archives be established, but it outside the scope of this appendix.
NEWS
The news locators refer to either news group names or article message identifiers which must conform to the rules of RFC 850. A
message identifier may be distinguished from a news group name by the presence of the commercial at "@" character. These rules imply that within an article, a reference to a news group or to another article will be a valid URL (in the partial form).
Note1: Among URLs the news: URLs are anomalous in that they are location-independent. They are unsuitable as URN candidates because the NNTP architecture relies on the expiry of articles and therefore a small number of articles being available at any time. When a news: URL is quoted, the assumption is that the reader will fetch the arcticle or group from his or her local news host. News host names are NOT part of news URLs.
Note 2: An outstanding problem is that the message identifier is insufficient to allow the retrieval of an expired article, as no algorithm exists for deriving an archive site and file name. The addition of the date and news group set to the article's URL would allow this if a directory existed of archive sites by news group. Suggested subject of study in conjunction with NNTP WG. Further extension possible may be to allow the naming of subject threads as addressable objects.
WAIS
The current WAIS implementation public domain requires that a client know the "type" and length of a object prior to retrieval. These values are returned along with the internal object identifier in the search response. They have been encoded into the path part of the URL in order to make the URL sufficient for the retrieval of the object. If changes to WAIS specifications make the internal id something which is sufficient for later retrieval then this will not be necessary. Within the WAIS world, names do not of course not need to be prefixed by "wais:" (by the partial form rules).
The length not now being strictly necessary is kept for historical reasons.
PROSPERO
The Prospero (Neuman, 1991) directory service is used to resolve the URL yielding an access method for the object (which can then itself be represented as a URL if translated). The host part contains a host name or internet address. The port part is optional. The path part contains a host specific object name, an optional version number, and an optional list of attributes. If these latter fields are present thy are separated from the host specific object name and from each other by the characters "%00" (percent, zero, zero), this being and escaped string terminator (null). If the optional list of attributes is provided, the version number must be present, but may be the empty string (i.e. the first attribute would be seperated from the host specific name by "%00%00"). External Prospero links are represented directly as URLs of the underlying access method and are not represented as
Prospero URLs.
GOPHER
The first character of the URL path part (after the initial single slash) is a single-character "type" field which is that used by the Gopher protocol. The rest of the path is the "selector string", with disallowed characters encoded. Note that some selector strings begin with a copy of the gopher type character, in which case that character will occur twice consecutively in the URL. If the type character and selector are omitted, the type defaults to "1". Gopher links which refer to non-Gopher protocols are represented directly as URLs of the underlying access method and are not represented as Gopher URLs.
TELNET, RLOGIN, TN3270
The use of URLs to represent interactive sessions is a convenient extension to their uses for objects. This allows access to information systems which only provide an interactive service, and no information server. As information within the service cannot be addressed individually or, in general, automatically retrieved, this is a less desirable, though currently common, solution.
X500
The mapping of x500 names onto URLs is not defined here. A decision is required as to whether "distinguished names" or "user friendly names" (ufn), or both, should be allowed. If any punctuation conversions are needed from the adopted x500 representation (such as the use of slashes between parts of a ufn) they must be defined. This is a subject for study.
WHOIS
This prefix describes the access using the "whois++" scheme in the process of definition. The hostname part is the same as for other IP based schemes. The path part can be either a whois handle for a whosi object, or it can be a valid whois query string. This is a subject for further study.
NETWORK MANAGEMENT DATABASE
This is a subject for study.

REGISTRATION OF NAMING SCHEMES

A new naming scheme may be introduced by defining a mapping onto a conforming URL syntax, using a new scheme identifier. Experimental scheme identifiers may be used by mutual agreement between parties, and must start with the characters "x-". The scheme name "urn:" is reserved for the work in progress on a scheme for more persistent names. Therefore URNs (Names) and URLs (Locators) be

distinguishable. An object which is either a URL or a URN is known as a URI (Identifier).

It is proposed that the Internet Assigned Numbers Authority (IANA) perform the function of registration of new schemes. Any submission of a new URI scheme must include a definition of an algorithm for the retrieval of any object within that scheme. The algorithm must take the URI and produce either a set of URL(s) which will lead to the desired object, or the object itself, in a well-defined or determinable format.

It is recommended that those proposing a new scheme demonstrate its utility and operability by the provision of a gateway which will provide images of objects in the new scheme for clients using an existing protocol. If the new scheme is not a locator scheme, then the properties of names in the new space should be clearly defined. It is likewise recommended that, where a protocol allows for retrieval by URI, that the client software have provision for being configured to use specific gateway locators for indirect access through new naming schemes.

Uniform Resource Locator (URL) Grammar

Next: Uniform Resource Locator Up: The World Wide Web Previous: HTML and SGML

Jon Crowcroft
Wed May 10 11:46:29 BST 1995