Next: Billing Up: Problems with WWW Previous: TCP + RPC

Replicas, Caches and Consistency

Another problem in the World Wide Web is also a result of its good design and a cause of its success: each piece of information is held at the server run by the information provider; it is not usually to be found anywhere else. This means that all accesses from no matter which client, no matter where, arrive at the same server. This means that popular servers are hot spots in the network, and cause a lot of network traffic, as well as being potentially overloaded in terms of processing power, by requests. Since the information user and providers in the Internet are quite often totally disjoint from the network providers, there is no particular match between the resources each provide, and popular servers may arise much faster than even the most assiduous network provider can re-dimension their communications links.

There are two main solutions for information providers, and third more political one involving the network providers:

Provide Caches
This is already being done, as was described in chapter 5. Smart servers can act as clients to other servers. Instead of client programs accessing each and every server around the world, clients always (or usually ) access their nearest managed WWW server. It then follows links on their behalf, and keeps copies of the data retrieved for future access by the same or other clients. Such caches must be capable of being invalidated (or out-dated) in some way, if the originator of the information decides to withdraw it. This requires servers to keep track of accesses by other servers that are known to cache things, or else some "honesty" mechanism on the part of caching servers, to check whether their cached copy needs updating. (Remember that it will take a lot less time to check the date on an original piece of data than to retrieve it all again, so this could indeed be done every time. However, it needs protocol to support this). Measurements in 1994 suggest that a single level of caching applied consistently throughout the Internet could reduce traffic by as much as 70 %, and increase response items for WWW access and everything else appreciably.
Replicas
Rather than provide "on demand" caches, this approach entails manually tagging data as likely to be popular, and having it replicated across multiple servers when it is installed. Provided a sensible naming system is in place (URNs rather than URLs), again clients will go to an appropriately near server, rather than the originator all the time. This is already in use in the world wide archive servers, and has been very effective in reducing network load for FTP traffic. It is, however, more manual than the previous approach. URNs inidcate what a resource is and would be mapped to URLs which would indicate where it is.
Migration
If the network providers are prepared to monitor and release statistics of network access patterns to WWW servers, and allow the WWW servers access to topological (link map and bandwidth) information about the network, then it would be possible to migrate information around the network as the center of access patterns from clients became obvious. Again, the naming system must make this feasible transparently to the users

All three of the mechanisms above require the URN/URI/URL mappings to be in place to work in a completely transparent way, and require some greater degree of intelligence about managing information in the WWW. However, none require centralised management of the Internet or the WWW, so all are still largely in the spirit of the existing successful system.

Two interesting facets of replicas and caches to muse upon are charging for resources, and security - If we cache for someone, we are saving them disk resources. If we don't provide the same assurances about secure access, we are possibly causing them loss of revenue. Both these problems are complex to solve, although as we've seen in chapter 5, the hooks are there to help provide solutions for all of the problems we have described here, albeit with some extra knowledge on the part of the information managers..

Next: Billing Up: Problems with WWW Previous: TCP + RPC

Jon Crowcroft
Wed May 10 11:46:29 BST 1995