SemanticRetrieval

Retrieving Value from the Web Semantically

Considering the Web as a World 3 library of human knowledge, one methodology facilitating semantically retrieving information from it is to produce subject directories or classified catalogs in World 3 that users can query using browsers. These directory services are human generated Web equivalents of library card catalogs, and list Web pages in hierarchical subject categories, generally according to some “taxonomic” scheme ¹⁷⁹. The directory subject pages also generally include comments written by the editors describing each of the links provided. Directory users can then navigate the directory hierarchy to find subject pages likely to list relevant URLs, and then read the editor’s summary description associated with each link to determine its likely relevance¹⁸⁰.

The leading directories, all claiming more than one million links in November 2000¹⁸¹, are Netscape's Open Directory Project ¹⁸² – 3,800,000 (as at 6/1/2003); LookSmart ¹⁸³ – >3,000,000 (as at 3/6/2002)¹⁸⁴ and Yahoo ¹⁸⁵ – 1,700,000¹⁸⁶. These all attempt to catalog the full range of Web content. Many other directories focus on single issues (e.g., Robin Cover's XML Page¹⁸⁷). Most directories claim that their links have a high chance of being relevant to the human user because they have been hand picked by the human catalogers. However, given that catalogers reject sites they see as unimportant, try to list each site only once (i.e., only under one of several possible subjects), and cannot foresee the users’ specific needs for information, it may be difficult for users to locate specific knowledge via a directory¹⁸⁸.

Indexing

All of the indexing search engines have three main parts.

An automated spider or Web crawler or search bot¹⁸⁹ that visits each Web site, and working from the home page, follows hyperlinks to compile information on the contents of each linked Web page. Some crawlers list every word against the URL of the page being indexed. Other crawlers may only list terms found in titles and metadata or close to the beginning of the page. Some also record details associated with each link out of the page to other URLs¹⁹¹.
An index (some kind of table) that minimally associates keywords (i.e., words found in a Web page by the crawler) with the URL of the page containing the words. Depending on the search engine, the index may associate other kinds of information with the page's URL: e.g., links out of the page to other URLs, number of times a particular word occurs in the page, distance of the word from the top of the page, or even the number of other pages known to reference the target page's URL. All of the information in the index can then be processed to determine the best response to a user's query.
A "query server" that receives user requests for links containing one or more search terms and issues a response page in return, including a list of URLs to pages containing the requested terms. The details of how queries are processed and URLs are ranked in the response back to the user depend on the particular service. The processing rules may even be treated as trade secrets.

The primary limitation all search engines face with HTML tags is that HTML provides only the most limited semantic information about the particular search terms that occur within them. Many of the engines attempt to compensate for this by offering some form of Boolean and "proximity" searching. Proximity searches may be limited to searching for a phrase (as indicated by quote marks) or may include some kind of "near" function. Most engines also attempt to increase the epistemic quality of the hits they return by applying some kind of ranking algorithm to what may often be tens or even hundreds of thousand hits on the user's search term(s). The engines will then show only the most relevant hits as determined by the algorithm (e.g., occurrence of search terms in metadata tags, within 100 words of the top of the page, number of times the word is used on the page, etc.). For example, combined with other information, Google uses the number of links into the indexed page from other indexed pages to help rank relevance¹⁹¹. Note that these processes are carried out by World 3 logical algorithms - without the intervention of a knowing subject.

Which engine will provide the best results for a particular user request depends on how much information is indexed by the engine, what information the user is seeking and how the engine ranks the relevance of results relating to the user's search terms. Because search engines are so central to extracting information on the Web, many reviews have been published¹⁹². The technology leaders, in terms of the number of documents indexed,¹⁹³ are Google (Brin & Page 1998), which in December 2002 claims to index more than 3 billion HTML pages (4 billion documents)¹⁹⁴; Wisenut, claiming to index more than 1.5 billion pages in January 2002¹⁹⁵; Fast's AllTheWeb, claiming to index more than 2 billion pages¹⁹⁷, Lycos ¹⁹⁷ (Open directory and Fast's indexing with additional features), NorthernLight ¹⁹⁸, with more than 390,000 pages in January 2003¹⁹⁹; and AltaVista ²⁰⁰. Multi– or meta– search engines multiplex searches from several proprietary search engines by individually querying the proprietary engines and then combining results from all the queries before generating the response for the end user²⁰¹.

Using Portals

Given that people searching for information in the Web will usually want to use one of the search or directory tools, a number of organizations are aggregating search engines, general and specialized directories, along with a range of commercial content and advertising, into what have recently been termed portals. A portal may be one of the popular retrieval sites (e.g., Excite, Netscape, Yahoo, Lycos, LookSmart, About, Google) that many people access from other areas of the Web, or it may be the home page provided by users’ internet service providers (ISP). Some portals require paid subscriptions (e.g., ISP home pages, AOL), but the majority provide their services free to the users and seek financial return by presenting paid advertising to the users.

Multimedia

A comparatively new aspect of the Web I am personally still learning about is its ability to transmit live or from storage many other kinds of media besides simple text and pictures. For, example, as I am writing this paragraph at home in a small country town out of Melbourne, Australia, I am currently listening live to Classic FM, broadcasting on 100 to 102 FM from London, England. This station is also transmitting streaming audio at 28 K over the Web, through the world communication net, to my computer where it is being realized via the computer's audio system while I write²⁰², and query the Web for information on how streaming media works ²⁰³.

Currently there is a variety of often incompatible formats and software applications for capturing, storing, delivering and receiving media over the Web. These have been developed by three main players: Apple, which launched its first QuickTime video system in January 1992. The first systems were designed to author and deliver video from laser discs. "Streaming video" was only released with Quicktime 3.0, after ²⁰⁴. RealNetworks, which began marketing its Real products in 1995. Microsoft apparently introduced its Windows Media Player in 1998 or 1999 after working for a time with RealNetworks²⁰⁵.

Wrapping Up the Web

Within the last decade, the World Wide Web has vastly extended the memory and cognitive capabilities of individuals with access to personal computers and ISPs. This technology makes available to a single individual a respectable and rapidly growing fraction of the entirety of humanity's recorded knowledge. This overwhelming wealth of knowledge is made useful as well as accessible by virtual and autonomous cognitive capabilities that exist in the layered architecture of World 3 itself .

Layer 1, at the bottom, consists of
- artifacts recorded as persistent objects of knowledge by individuals; web pages, articles, scanned documents, music, videos, downloadable computer programs, etc; and
- "dynamic" artifacts of knowledge autonomously generated on demand in real time by databases and other processes; catalogues, price lists, product detail, weather reports²⁰⁶, etc.
Layer 2, built-in "static" semantics representing the authors' cognitive understanding of the relationships of their knowledge relative to the rest of World 3, including
- persistent links and reference citations built into knowledge artifacts by their authors
- human generated Web and portal directories
Layer 3, semantics generated by processes running autonomously in World 3 independently from knowing subjects, e.g.,
- 'bot' produced indexes,
- dynamic page ranking processes,
- autonomously generated classifications,
- "intelligent agents",
- etc.

Aside from the physical technology provided by microelectronics and the communications networks, the crucial factor that has allowed this incomprehensibly rapid growth of capabilities and knowledge in the Web has been the establishment of cognitively significant non-proprietary standards for the expression and communication the knowledge artefacts it stores and generates.

Demonstrating Semantic Retrieval

Because I was familiar with the topics to be covered up to this point in the project, the writing has been comparatively easy. However, the situation is different for many of the subjects relating to organizations covered in Episode 4. Given the limited time I have had available for work on this project and very real practical difficulties in physically accessing major research libraries, further progress would have been impossible without the tools the Web provides for extending cognition. Google, PubMed, and a number of Web-based "full-text" retrieval services available through libraries such as ISI's Web of Science and Elsevier's Science Direct have enabled me to rapidly find World 3 knowledge in World 3 relating to these interests.

Most readers will have used Google or equivalent indexing services, and many of those in academic environments will also be familiar with Web of Science. However, it seems that only a few people other than research librarians have learned to make full use of the semantic power these tools provide. To close off this episode on personal productivity tools I will describe the procedure I use to quickly build a World 3 web of knowledge relating to a subject of interest. My example will demonstrate how I built the web forming the basis for a topic I have only recently added to the present Application Holy Wars project - the first section in Episode 4 on thermodynamics and the emergence of autopoiesis.

In abstract form, the following steps are used to build a web of World 3 knowledge:

Select one or more reasonably well known older publications that introduce core concepts relating to the subject of interest. If you are not already familiar with some appropriate papers, search the Web or one of the academic literature databases using appropriate keywords. Select a small number of full-text articles or that at least provide the associated bibliography. Identify a small number of older papers that seem to define the concept(s) you are seeking knowledge about.
To find recent work available free-to-the-Web, use the titles of key articles as search terms in Google; i.e., block copy the article title into Google's search term field and put quotation marks around the title. This will retrieve Web pages containing exact matches for the title (e.g., as found in bibliographic citations). For very short titles that may also correspond to ordinary strings of text, add the author's surname outside the quotation marks. This will return all free-to-the-Web articles and other pages referencing the key article - presumably because the "hit" includes text that has a specific semantic relationship to the referenced work. Note that the hits will often include articles from the formal literature. Even though the journals in which they are published are not themselves available free to the Web, many articles are exposed to the Web via individual authors' home pages. Note: given that Google's Web links may not be current, as long as you can retrieve the URL for an article that once existed on the Web, chances are good that the article can still be retrieved via the WayBack Machine (http://www.archive.org).

As described in the section on Information Science (see citation indexing), ISI's Web of Science provides a much more exact tool for semantically retrieving current academic literature on a subject. Here, the search term is usually no more than the primary author's surname and initial, or for common names or prolific authors you can further qualify the search by adding the year of publication and journal/book title. ISI's Web of Science indexes cover three major disciplinary areas:
- Science Citation Index Expanded, with searchable author abstracts, covering the journal literature of the sciences. It indexes more than 5,700 major journals from 1945 onward across 164 scientific disciplines (currently with more than 17 million records);
- Social Sciences Citation Index, with searchable author abstracts, covering the journal literature of the social sciences. It indexes more than 1,725 journals spanning from more than 50 disciplines (currently with more than 3 million records), as well as covering individually selected, relevant items from over 3,300 of the world's leading scientific and technical journals.
- Arts & Humanities Citation Index covering the journal literature of the arts and humanities. It indexes 1,144 of the world's leading arts and humanities journals, as well as covering individually selected, relevant items from over 6,800 major science and social science journals (currently with 2.5 million records).
Web of Science indexes are computer generated from electronic materials provided by journal publishers, and articles are indexed as soon (or in some cases even in advance of) they are published. Although ISI does not index all academic journals, they do attempt to index all of the more significant interdisciplinary journals and the more prestigious or key journals within each discipline. Thus, hits returned from a citation search will be expected to include a reasonable selection of the most recent and better quality papers. In most cases, the articles returned will include the abstract and bibliography.
Search returns (whether from Google or Web of Science) are then scanned to determine whether they accurately represent the kind of subject matter you are seeking. If so, scan the bibliographies of the returned articles for additional references relating to the desired subject matter. In some cases bibliography references will link directly to full texts of the cited articles - some in full text (and in the case of Web of Science, links will also point to the records for cited articles in the bibliography that have also been indexed). Note: bibliographic references will probably include works not indexed directly by Google or Web of Science, including books and other materials still available only in paper formats. However, the fact that they have been referenced by papers you regard to be important and information in their titles will probably help you to determine whether it is worth the effort to track down a hard copy of the referenced document.
Older articles referenced by the hits collected in Step 3 can be used as additional search terms to find other more recent articles until diminishing returns are achieved, at which point you can be reasonably confident that your bibliography on the subject in question is reasonably complete. Note that non-electronic references from the returns of your first search can still be used as additional search terms in Google or the Web of Science, in that you are searching for articles that reference the search term article, not the search article itself.
Once a reasonably complete bibliography has been assembled, you can begin to assemble full texts of the desired articles. Some journals are available free to the Web, either independently or via services like the US National Institute of Health's PubMed Central. In some cases where journals are not available in toto, individual articles may still be found free to the Web via authors' or institutional Web sites. In my experience perhaps 1-5% of the articles I seek can be found on the public Web - which may be all that is required to explain or connect to the literature relating to a key idea. Where access to subscription journal databases can be achieved by logging into a research library (as has been provided to me by Monash University), possibly more than 50% of the important papers can be retrieved in full text format. Full text databases I now access regularly include Elsevier's Science Direct, Proquest, Emerald, Highwire, ACS Publications, ACM Digital Library, Journals@Ovid Fulltext, JSTOR, etc. as well as several publishers individual sites, e.g., Kluwer Online.

Assuming that you have a broadband connection to the indexing service(s), comprehensive bibliographies can be built in a few hours. Even working from home over a 56 K modem connection via a research library I can do vastly more at home with Web of Science that I was ever able to do working in a library with physical indexing services²²¹. I illustrate the above process with some sample returns from the searches I performed to build a bibliography for my next section on

From the time when I taught biology in the 1970s I have believed that

These themes will be elaborated in the remainder of this work.

Contents

The World Wide Web