ANNEX A: ABOUT DR REID Dr Alex Reid, the external proposer of the topic, is a Cambridge alumnus and resident, who operates in retirement a small business, Extonet Ltd, which publishes websites. His career has spanned architecture, telecommunications, and computing. After obtaining a degree in Architecture at Trinity College, Cambridge, he obtained a PhD at University College London. His research involved experiments to assess the relative effectiveness of telecommunications (including video conferencing) and face-to-face communication. This grew into a research group, with funding from UK and US government departments, known as the Communications Studies Group. The main part of Dr Reid's career was at British Telecom. His posts included Head of Long Range Studies, Director of Prestel, and Chief Executive of Value Added Services. He left British Telecom to set up a venture capital company investing in start-up companies in the information technology sector. During this time he was a non-executive Director, and then Chief Executive, of Acorn Computer Group, Cambridge. He returned to architecture as Director General of the Royal Institute of British Architects, from which post he retired in 2000. Extonet Ltd currently publishes three main websites, each of which is funded by Google Adsense advertisements. Abacus Construction Index USA (at www.construction-index.com) is a directory of suppliers of building products to the US market. Archinet UK (at www.archinet.co.uk) includes a similar directory for the UK market, with other architectural content. Beesker (at www.beesker.com) is a directory of the world's best website on each of several hundred topics. ANNEX B: PROJECT 1 Project 1, undertaken in February 2012, was an attempt to simulate Google search ranking for five search phrases related to the construction industry: Asphalt Shingles, Bifold Doors, Closet Doors, Drop Ceilings, and Shower Stalls. The construction sector was chosen because that is the main sector in which Extonet Ltd publishes; but the relevance of the findings is not confined to that particular sector. Identifying our sample of web pages We searched Google.com for each of these five phrases, without the double quotes which would produce a search only for that exact phrase. Although based in the UK, we undertook these searches as if from the USA. This is done by, for example in the case of closet doors, opening the page: google.com/search?q=closet+doors&gl=us. We noted for each of the five search terms the web pages that were shown as numbers 1 to 10 in the Google search returns (ie those that did best) and the web pages that were shown as numbers 101 to 110 in the Google search returns (ie those that performed less well). Measuring the characteristics of the web pages We then measured for each of these 100 web pages the following nine characteristics, using tools such as the Google Toolbar and Alexa: a. Google Page Rank for the page. b. Google Page Rank for the front page of website. c. Number of pages in the website. d. Number of occurrences of the search phrase on the page. e. Density of the search phrase on the page (ie occurrences per 100 words). f. Number of words on the page. g. Average number of pages viewed by each visitor to the website. h. Bounce rate (ie percent of visitors to website who view only one page). i. Average time spent on the website by each user. Developing an algorithm We then, by trial and error, developed a simple algorithm which, when fed with these data, would predict as accurately as possible whether the pages would fall correctly into the group of high performers (ranking 1 to 10 in the Google search returns) or into the group of low performers (ranking 101 to 110 in the Google search returns. We found that the last three measures (g to i above) did not appear to correlate with performance, and accordingly built into our algorithm only the first six measures (a to f above). We achieved our best fit with the following algorithm: (a squared) + (b times 2) - (number of digits in c)/2 - (e) - (Occurrence Penalty) - (Words Penalty). The Occurrence Penalty that produced the best fit, by trial and error, was: Number of Occurrences Penalty 0 4 1 4 2 3 3 2 4 1 5 1 6 1 More than 20 1 More than 30 2 The Words Penalty that produced the best fit, by trial and error, was: Number of words on page Penalty Less than 600 words on page 2 Between 600 and 1000 words on page 0 More than 1000 words on page 2 Testing our algorithm We then tested our algorithm by feeding the data for each page into the algorithm, producing a merit score for each page. This enabled us to sort the pages, for each search term, into the ten that performed best in terms of our algorithm (ie those that came in the top half of our ranking) and the ten that performed less well (ie those in the bottom half of our ranking). We then compared our simulation with the actual Google search results, to see in what percentage of cases our algorithm had predicted correctly those pages that would appear in the top ten Google search returns. The results of this comparison were: Search phrase Percentage of pages sorted correctly Asphalt shingles 90% Bifold doors 100% Closet doors 90% Drop ceilings 100% Shower stalls 80% Average 92% Conclusions on the predictive power of our algorithm We were surprised at the predictive power of our algorithm. Overall it predicted correctly in 92% of cases which pages would appear in the top ten Google search returns. Our algorithm was developed using only simple trial and error methods, and measuring only six characteristics of each page. This suggests that using more sophisticated statistical methods, and measuring more than six characteristics, it should be possible to develop algorithms which will explain in considerable detail the characteristics which Google regards as denoting a good quality page. Conclusions for webmasters Our algorithm suggests the following guidance for webmasters wishing to produce web pages which Google will regard as of high quality: - The overwhelmingly most important single characteristic is the Google Page Rank of the page, with a high page rank being good. We needed to square this in order to achieve best fit. This conflicts with the widespread suggestion in discussion forums that 'page rank has ceased to be important'. It is generally accepted that high Google Page Rank is achieved by having inward links to the website from many other websites which themselves have high Google Page Rank. - The Google Page Rank of the front page of the website is also important, with a high page rank being good. - The number of pages in the website is an important factor. This is a negative factor, with a relatively small number of pages being good, and a very large number of pages being bad. There are various possible explanations for this unexpected result. It may be that a very large website is regarded by Google as less focused on the search term. It may be that Google regards a very large website as having achieved many inward links through its sheer size rather than its quality. - The optimal number of occurrences of the search term on the page is between 4 and 30. Occurrences of 3, 2, 1 and 0 are progressively disadvantageous. Occurrences of more than 30 are also disadvantageous. - Density of occurrence of the search term (ie occurrences per 100 words) is a negative factor. This is presumably because some webmasters attempt to fool Google by inserting the search term with artificial and excessive density. - The optimal number of words on the page is between 600 and 1000. Serious penalisation This project was confined to the analysis web pages which appear in the top 110 Google search returns. The project therefore did not explore any serious penalisations which Google may apply which would result in web pages ranking lower than 110. It may well be that there are negative characteristics of web pages which are regarded by Google as so serious that they result in those pages being demoted hundreds of pages down the search rankings. These might, for example, include copying of text, incoherent text, excessive advertisements, and excessive links from the page. Project 1, because it confined itself to the analysis of relatively well performing web pages, would not reveal any such serious penalisation.