The World Wide Web is very good for browsing. However, by its very web-like nature, it is hard to index, and therefore hard to search. It is true that many Web server sites have searchable data, but that is done in a local way. Often, a CGI backend is used (as mentioned earlier), such as Waisgate, or even some completely local script or database query system.
What is needed is an inter-site indexing and searching mechanism as well as these site and subject specific intra-site schemes. The problem is that the Web grows succesfully because it permits heterogeneity - each site can choose how to structure its information (and links).
One solution is to provide clients that allow the user to specify searchs, and then launch these searches on the network. The problem here is that the search will cause a lot of accesses (possibly unnecesasrily) and load up caches all around the WWW, and will take a long time. This is possibly in all respects, anti-social.
The solution is to provide intelligent tools that automatically build indexes of the web continuously and incrementally as it grows. These are known as spiders, or robots, or wanderers. There are a variety of them, varying in complexity, but the key idea is simple:
Starting from a known site or lists of sites, simply follow all the links, building a map of all URLs (and Titles, and possibly even pulling pages and creating contents indexes through database techniques such as Key Word In Context). Subsequently, this index can be stored, and possibly even replicated across other sites (or simply rely on caching at other servers when remote clients repeatedly access an index).
To prevent network and server overload, there are a number of rules (of etiquette) that such spiders are subject too. The main requirement is that they judge the frequency with which they visit sites to be low enough not to exceed a typical user threshold, but high enough to keep the index reasonably up to date.
Future enhancements to Spiders may include ways to partition the problem across the network, so that load can be localised, and ways to pre-load caches to speed up the index accesses.
There are many different tools, and more are being added monthly, but two of the most popular are:
To exclude a robot or spider from a server, it has been proposed that servers should keep a top level file accessible by HTTP called /robots.txt which lists the robot type and areas that it cannot access on this server.
Sites that maintain spiders and robots should make sure that the program logs and timestamps its actions, so that problemscan be diagnosed, and development of these useful facilities is not endangered by being accused (unjustifiably) overload.