The World Wide Web

Retrieving Value from the Web Semantically


Considering the Web as a World 3 library of human knowledge, one methodology facilitating semantically retrieving information from it is to produce subject directories or classified catalogs in World 3 that users can query using browsers. These directory services are human generated Web equivalents of library card catalogs, and list Web pages in hierarchical subject categories, generally according to some “taxonomic” scheme179. The directory subject pages also generally include comments written by the editors describing each of the links provided. Directory users can then navigate the directory hierarchy to find subject pages likely to list relevant URLs, and then read the editor’s summary description associated with each link to determine its likely relevance180.

The leading directories, all claiming more than one million links in November 2000181, are Netscape's Open Directory Project182 – 3,800,000 (as at 6/1/2003); LookSmart183 – >3,000,000 (as at 3/6/2002)184 and Yahoo185 – 1,700,000186. These all attempt to catalog the full range of Web content. Many other directories focus on single issues (e.g., Robin Cover's XML Page187). Most directories claim that their links have a high chance of being relevant to the human user because they have been hand picked by the human catalogers. However, given that catalogers reject sites they see as unimportant, try to list each site only once (i.e., only under one of several possible subjects), and cannot foresee the users’ specific needs for information, it may be difficult for users to locate specific knowledge via a directory188.


All of the indexing search engines have three main parts.

The primary limitation all search engines face with HTML tags is that HTML provides only the most limited semantic information about the particular search terms that occur within them. Many of the engines attempt to compensate for this by offering some form of Boolean and "proximity" searching. Proximity searches may be limited to searching for a phrase (as indicated by quote marks) or may include some kind of "near" function. Most engines also attempt to increase the epistemic quality of the hits they return by applying some kind of ranking algorithm to what may often be tens or even hundreds of thousand hits on the user's search term(s). The engines will then show only the most relevant hits as determined by the algorithm (e.g., occurrence of search terms in metadata tags, within 100 words of the top of the page, number of times the word is used on the page, etc.). For example, combined with other information, Google uses the number of links into the indexed page from other indexed pages to help rank relevance191. Note that these processes are carried out by World 3 logical algorithms - without the intervention of a knowing subject.

Which engine will provide the best results for a particular user request depends on how much information is indexed by the engine, what information the user is seeking and how the engine ranks the relevance of results relating to the user's search terms. Because search engines are so central to extracting information on the Web, many reviews have been published192. The technology leaders, in terms of the number of documents indexed,193 are Google (Brin & Page 1998), which in December 2002 claims to index more than 3 billion HTML pages (4 billion documents)194; Wisenut, claiming to index more than 1.5 billion pages in January 2002195; Fast's AllTheWeb, claiming to index more than 2 billion pages197, Lycos197 (Open directory and Fast's indexing with additional features), NorthernLight198, with more than 390,000 pages in January 2003199; and AltaVista200. Multi– or meta– search engines multiplex searches from several proprietary search engines by individually querying the proprietary engines and then combining results from all the queries before generating the response for the end user201.

Using Portals

Given that people searching for information in the Web will usually want to use one of the search or directory tools, a number of organizations are aggregating search engines, general and specialized directories, along with a range of commercial content and advertising, into what have recently been termed portals. A portal may be one of the popular retrieval sites (e.g., Excite, Netscape, Yahoo, Lycos, LookSmart, About, Google) that many people access from other areas of the Web, or it may be the home page provided by users’ internet service providers (ISP). Some portals require paid subscriptions (e.g., ISP home pages, AOL), but the majority provide their services free to the users and seek financial return by presenting paid advertising to the users.


A comparatively new aspect of the Web I am personally still learning about is its ability to transmit live or from storage many other kinds of media besides simple text and pictures. For, example, as I am writing this paragraph at home in a small country town out of Melbourne, Australia, I am currently listening live to Classic FM, broadcasting on 100 to 102 FM from London, England. This station is also transmitting streaming audio at 28 K over the Web, through the world communication net, to my computer where it is being realized via the computer's audio system while I write202, and query the Web for information on how streaming media works203.

Currently there is a variety of often incompatible formats and software applications for capturing, storing, delivering and receiving media over the Web. These have been developed by three main players: Apple, which launched its first QuickTime video system in January 1992. The first systems were designed to author and deliver video from laser discs. "Streaming video" was only released with Quicktime 3.0, after 204. RealNetworks, which began marketing its Real products in 1995. Microsoft apparently introduced its Windows Media Player in 1998 or 1999 after working for a time with RealNetworks205.

Wrapping Up the Web

Within the last decade, the World Wide Web has vastly extended the memory and cognitive capabilities of individuals with access to personal computers and ISPs. This technology makes available to a single individual a respectable and rapidly growing fraction of the entirety of humanity's recorded knowledge. This overwhelming wealth of knowledge is made useful as well as accessible by virtual and autonomous cognitive capabilities that exist in the layered architecture of World 3 itself .

Aside from the physical technology provided by microelectronics and the communications networks, the crucial factor that has allowed this incomprehensibly rapid growth of capabilities and knowledge in the Web has been the establishment of cognitively significant non-proprietary standards for the expression and communication the knowledge artefacts it stores and generates.

Demonstrating Semantic Retrieval

Because I was familiar with the topics to be covered up to this point in the project, the writing has been comparatively easy. However, the situation is different for many of the subjects relating to organizations covered in Episode 4. Given the limited time I have had available for work on this project and very real practical difficulties in physically accessing major research libraries, further progress would have been impossible without the tools the Web provides for extending cognition. Google, PubMed, and a number of Web-based "full-text" retrieval services available through libraries such as ISI's Web of Science and Elsevier's Science Direct have enabled me to rapidly find World 3 knowledge in World 3 relating to these interests. 

Most readers will have used Google or equivalent indexing services, and many of those in academic environments will also be familiar with Web of Science. However, it seems that only a few people other than research librarians have learned to make full use of the semantic power these tools provide. To close off this episode on personal productivity tools I will describe the procedure I use to quickly build a World 3 web of knowledge relating to a subject of interest. My example will demonstrate how I built the web forming the basis for a topic I have only recently added to the present Application Holy Wars project - the first section in Episode 4 on thermodynamics and the emergence of autopoiesis.

In abstract form, the following steps are used to build a web of World 3 knowledge:

  1. Select one or more reasonably well known older publications that introduce core concepts relating to the subject of interest. If you are not already familiar with some appropriate papers, search the Web or one of the academic literature databases using appropriate keywords. Select a small number of full-text articles or that at least provide the associated bibliography. Identify a small number of older papers that seem to define the concept(s) you are seeking knowledge about.

  2. To find recent work available free-to-the-Web, use the titles of key articles as search terms in Google; i.e., block copy the article title into Google's search term field and put quotation marks around the title. This will retrieve Web pages containing exact matches for the title (e.g., as found in bibliographic citations). For very short titles that may also correspond to ordinary strings of text, add the author's surname outside the quotation marks. This will return all free-to-the-Web articles and other pages referencing the key article - presumably because the "hit" includes text that has a specific semantic relationship to the referenced work. Note that the hits will often include articles from the formal literature. Even though the journals in which they are published are not themselves available free to the Web, many articles are exposed to the Web via individual authors' home pages. Note: given that Google's Web links may not be current, as long as you can retrieve the URL for an article that once existed on the Web, chances are good that the article can still be retrieved via the WayBack Machine (

    As described in the section on Information Science (see citation indexing), ISI's Web of Science provides a much more exact tool for semantically retrieving current academic literature on a subject. Here, the search term is usually no more than the primary author's surname and initial, or for common names or prolific authors you can further qualify the search by adding the year of publication and journal/book title. ISI's Web of Science indexes cover three major disciplinary areas:  

    Web of Science indexes are computer generated from electronic materials provided by journal publishers, and articles are indexed as soon (or in some cases even in advance of) they are published. Although ISI does not index all academic journals, they do attempt to index all of the more significant interdisciplinary journals and the more prestigious or key journals within each discipline. Thus, hits returned from a citation search will be expected to include a reasonable selection of the most recent and better quality papers. In most cases, the articles returned will include the abstract and bibliography.

  3. Search returns (whether from Google or Web of Science) are then scanned to determine whether they accurately represent the kind of subject matter you are seeking. If so, scan the bibliographies of the returned articles for additional references relating to the desired subject matter. In some cases bibliography references will link directly to full texts of the cited articles - some in full text (and in the case of Web of Science, links will also point to the records for cited articles in the bibliography that have also been indexed). Note: bibliographic references will probably include works not indexed directly by Google or Web of Science, including books and other materials still available only in paper formats. However, the fact that they have been referenced by papers you regard to be important and information in their titles will probably help you to determine whether it is worth the effort to track down a hard copy of the referenced document.

  4. Older articles referenced by the hits collected in Step 3 can be used as additional search terms to find other more recent articles until diminishing returns are achieved, at which point you can be reasonably confident that your bibliography on the subject in question  is reasonably complete.  Note that non-electronic references from the returns of your first search can still be used as additional search terms in Google or the Web of Science, in that you are searching for articles that reference the search term article, not the search article itself.

  5. Once a reasonably complete bibliography has been assembled, you can begin to assemble full texts of the desired articles. Some journals are available free to the Web, either independently or via services like the US National Institute of Health's PubMed Central. In some cases where journals are not available in toto, individual articles may still be found free to the Web via authors' or institutional Web sites. In my experience perhaps 1-5% of the articles I seek can be found on the public Web - which may be all that is required to explain or connect to the literature relating to a key idea. Where access to subscription journal databases can be achieved by logging into a research library (as has been provided to me by Monash University), possibly more than 50% of the important papers can be retrieved in full text format. Full text databases I now access regularly include Elsevier's Science Direct, Proquest, Emerald, Highwire, ACS Publications, ACM Digital Library, Journals@Ovid Fulltext, JSTOR, etc. as well as several publishers individual sites, e.g., Kluwer Online.

Assuming that you have a broadband connection to the indexing service(s), comprehensive bibliographies can be built in a few hours. Even working from home over a 56 K modem connection via a research library I can do vastly more at home with Web of Science that I was ever able to do working in a library with physical indexing services221. I illustrate the above process with some sample returns from the searches I performed to build a bibliography for my next section on 

From the time when I taught biology in the 1970s I have believed that 


These themes will be elaborated in the remainder of this work.

EPISODE 4 - Organizations Develop Minds of their Own