Untangling the Web
URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.


Spinning a Web Search

Mark Lager
California Lutheran University


Copyright 1996, Mark Lager. Used with permission.

Abstract

Thesauri, subject headings, and keyword searching have been the customary strategies used for location of materials. Categorizing by subject or descriptor term creates a thesaurus of specified terms, a controlled vocabulary.

Searching this list identifies cataloged items. However, new computing techniques- artificial intelligence, natural language processing, relevancy feedback, query-by-example, and concept-based searching- have added a new sophistication to information retrieval. Robots, computers that search the web using these new techniques (AKABoolean searches spiders or wanderers), offer the 'net searcher a higher degree of precision in retrieval.

The presentation is targeted toward WEB searchers, in particular, reference librarians and those who navigate the Internet on a frequent basis. This presentation will look at search engines, comparing search techniques and noting differences. The workshop will identify use of new computing strategies for information retrieval within each engine.

"Spinning a Web Search" requires an electronic classroom with large screen projection for Netscape (or other browser), or classroom demonstration will focus the workshop: handouts (with exercises) will be provided for clarification and training. The workshop will be one hour.


Introduction

Information retrieval has always been the focus of information scientists and reference librarians, trying to provide relevant materials in response to a user's query. As reference librarians, we often pride ourselves on knowing our collections and what it holds. Or, we sleuth it out, knowing where to find the information. We use major indexing and abstracting tools, or our library catalogs, or online services to discover the needed information. It has been a domain left to those who can ferret out, or research, to uncover the user's information need. The World Wide Web (WWW) has burst that bubble! No longer is it a domain reserved in the library, or a special collection. The Internet, and especially the WWW, has made a major collection of information accessible to everyone, anywhere. The Web uses hypertext, a protocol or common language, to jump easily between files; the WWW opens the publishing arena to anyone with a computer. These hypertext links join databases, files, sounds and pictures; texts, library catalogs, songs, video, and more are now available to the computer-literate. Unlike the orderly world of the library collection, this new source of information, is chaotic, often not organized and includes information not of high quality. As Brian Pinkerton states, "The World Wide Web is decentralized, dynamic and diverse; nativagion is difficult and finding information can be a challenge." (Pinkerton, 1994). The useful and the innocuous are lumped together in this huge collection. Academic information (e.g., journal articles and course materials) is combined with social culture information and with personal home pages. There is no separation. Mark Nelson calls this information anxiety - the overwhelming feeling one gets from having too much information or being unable to find or interpret data. (Nelson, 1994). To be of any information value, the data must first be organized and retrievable, providing some structure. Search tools have begun to put some organization to these uncharted waters. Current trends in information retrieval offer better opportunities to make more efficient use of this information resource.

This workshop is designed to explain trends in retrieving information. The workshop will focus on techniques for retrieval used in information sciences and in WWW search engines. We will look at some search engines and their make-up to view techniques used. It is not designed to explore search strategies of the various search engines (i.e., + for plurals, - for negations, - for phrases, etc.).

Recall vs. Precision

The purpose of reference service and information science is to provide useful information in response to a query. Despite the methods used, whether I use my knowledge to know where to locate the information, or whether the computer searches its index, the information retrieved can be classified in one of two categories of measurable statistics: whether the information retrieved is considered relevant, or whether all the relevant material was retrieved. These two metrics of recall and precision serve to express information retrieval performance.

Recall is percentage of total relevant documents retrieved from all documents. Recall refers to how much information is retrieved by the search. Total recall would locate every document that matched the search criteria in a database. Precision is the percentage of documents retrieved that the searcher is actually interested in. Precision focuses on the relevant, most useful items retrieved in the search. Recall with high precision is the ultimate goal. The goal of information retrieval scientists is to provide the most precise or relevant documents in the midst of the recalled search results. (Dataflight, 1995)

Let me use three examples to illustrate recall, and then precision. When GM does a recall of its cars, it takes back out of the total number of GM manufactured cars, those that are of a certain model or year. I am searching for all red books in my collection of 1000 books. I pull from the shelves 50 books. Recall, is 50 out of 1000 or 5%. This figure recalls from the collection only those items that match my need. %. If I am looking for disk brakes, I may begin to search the index with the word "brakes." I get a high recall that matches my query since it includes disk drives and disk brakes.

Precision focuses on only the relevant documents out of the recalled items. For the GM example, precision is the number of recalled cars that actually have the defective part. This means they may have to check all the recalled cars to only find a few with the defect. Out of the 50 red books I collected, there are 25 that match the shade of red that I want. For the disk drives, who knew what kind of disk I meant? I must reinitiate the search to qualify my need.

The key phrase is "relevant." This is the quandary in information science and for librarians. Who determines what is relevant? I know what shade of red I want, but who else knows that? I know what kind of brakes I want, so why doesn't the computer read my mind? GM knows which cars have the problem, why don't they just let the owners crawl under the car to see if the part is defective?

Relevance

Traditional finding aids to assist in information retrieval have focused on Subject Headings, descriptors, or established vocabulary. (Examples include LCSH, ERIC thesaurus, UMI's Controlled Vocabulary List). Items cataloged using MARC format is one primary means to retrieve the item;. in fact, the MARC format is an information retrieval standard: Z39.2. (Bowker, 1991) The subject headings are used to determine the primary focus of a book/article, i.e. to help precision if the user makes use of the established subject heading. In the 70's, automated catalogs were created to use the traditional access points - author, title, series, subject and added entries. Over the years, catalogs were enhanced by adding search capabilities to additional fields - e.g. 5xx (notes) fields. More data, higher recall. Current online catalogs offer keyword searching and Boolean searching to assist in precision. By adding more word access, it is hoped that the searcher's terms will be in the index. With more data to search, the search engine can return more documents gaining greater recall. It is for the user to sift through the recalled documents to find useful ones. By adding pointers between words, builidng interrelationships, an index can becomes highly beneficial for retrieval of information. (EB, 1995)

The Web

The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web, however, did not gain widespread popular use until browsers like NCSA Mosaic became available in 1993, and Netscape in 1994. The task of making the Web more searchable began soon thereafter with search tools as the Wanderer and JumpStation in 1993. The Web doubles every five months which makes indexing and updating a formidable task. (9) With over 3 million host computers on the Web, (Pike, 1995) it is difficult to imagine finding only relevant, precise information in the midst of all the data. The computers on the Internet share a common language called TCP/IP. Based on this commonality, computers communicate with each other. One method for communication that is used on the Internet is called client/server. Think of this as "one speaker at a time" or only one computer "has the floor." Of course, with the high speed, it looks like conversations are simultaneous. Just as a patron asks a librarian for assistance on an information request, so the client (piece of software) asks the librarian (server that houses the information) who responds. The information, however, is only as good as what is indexed, retrievable.

Web Indexing

There are two major categories of searching tools on the Web: directories (what we know as an index) and search engines. Both require an indexing system. Building an index is done by either human or computer. For computers, the software program, called a robot, or spider, or wanderer, visits each site and gathers information. A robot gains the home page address (URL) and then recursively visits some or all of the links. The index contains URLs, titles, headings, and other words from the HTML document. (HTML is an acronym for hyptertext mark-up language). Each index is different depending on what information is deemed important, e.g., Lycos only does top 100 words, first 20 lines of text. AltaRelevance - marked with red X Vista and Infoseek index every word. Robots perform, resource discovery, the term for a robot's ability to summarize and index the data on the Web, to automatically update information and change or eliminate dead links. (Fischer, 1995)

Robots work in either of two ways:

Acting like a sophisticated web browser, the robot automatically retrieves documents or other information until told to stop. (Boutell, 1995) Creating the index is usually done by a robot, since that is the more efficient means of searching. The robot, a simple executable program, sweeps a portion of the web, retrieves, parses, and stores pieces of the HTML document and then reindexes the data. There are over 30 robots in existence. (Fischer, 1995) Below is a listing of a few robots:

Explorer Katipo Titan Aretha Northstar
Python Htmlgobbler Pagemator Websmurf Lycos
Arachnophillia Scooter Webforager

Directories, like Yahoo, {Infoseek Guide} or {Librarian-Built Subject Guides}, or {John Makulowich's Awesome List}, are examples of a created list by subject. These can be created by machine or by human. The searcher can click on the topic and see a listing of sites. Directories are an excellent place to start, and I recommend users to begin with these, especially if they are not familiar with the Web. Just as we would send persons to begin a search to the printed index volumes to look up the topic, so the online directory lets the user browse among a reviewed list. Indexes provide a clear condensed grouping by subject, saving time. However, lists are different since they have been created using specific criteria of the indexer. There is no standard for Directory terms.

Search engines

The search engine provides more control for the user in performing a search. Engines use the index to fetch terms of the query. This means that the more data in the index, the higher the recall. Indexing every word or the most used words can lead to higher recall depending on the search query. The larger the index, the more possibility of hitting upon the words of the query. And, with the size of the Web, the more often the index is updated, the greater the number of hits.

Search engines on the Web incorporate a number of techniques to assist in both recall and precision. There are search engines that employ traditional methods like thesauri or Boolean searching. Rather than being only a keyword search, the engine will make logical connections to a thesaurus to enhance recall. Using Boolean logic (and, or, not, adjacency operators) search engines can assist in making the query more precise. Different engines have different defaults.

Natural Language Processing: Relevancy feedback/weighing
probabilistic logic: query by example
fuzzy logic: query expansion
Bayesian networks: case-based reasoning
parallel computing (Inktomi): concept based searching
(For a listing of the variety of search engines, over 120, see {http://ugweb.cs.ualberta.ca/~mentor02/search/search-all.html})

New Trends in IR:

Artificial Intelligence

AI, the capacity of a digital computer or computer-controlled robot device to perform tasks commonly associated with the higher intellectual processes characteristic of humans, such as the ability to reason, discover meanings, generalize, or learn from past experience. The term is also frequently applied to that branch of computer science concerned with the development of systems endowed with such capabilities. (EB, 1996) Artificial intelligence refers to creating computers that can think and reason. AI focuses on finding a logical, mathematical way to represent knowledge. The computer can be programmed with this mathematical model to assist in decision making, information retrieval, and analysis. Then, when a query is asked, the computer follows the rules for a response. AI has many facets, including robotics, expert systems, and voice recognition and simulation. Search engines incorporate some of the fascinating trends in AI.

Probabilistic Logic

Will it rain today? What is the possibility of my car needing an oil change? Or, what is the chance of getting an A on my history test?. There are many questions like these that cannot be answered with an affirmative or negative answer. Uncertainty reigns. In an effort to make a decision which accounted for such doubt, in the midst of chaos, a branch of logic was defined to study probability. Since the 16th and 17th centuries, probability theory has been used to explain chance. Such questions rely on a factual information as history coupled with probability. In information retrieval, the same applies. By setting up a formula, an algorithm, that places values on words, their interrelationships, proximity, and their frequency, the computer can be used to help locate relevant sites. By computing these terms together, the search engine can produce a relevancy ranking that is then displayed to the user. (De Bra, 1995)

Probabilistic logic is founded on the presumption that certain factors can be established logically and mathematically to focus a search. It is similar to fuzzy logic where the central notion is that truth values (in fuzzy logic) or membership values (in fuzzy sets) are indicated by a value on the range [0.0, 1.0], with 0.0 representing absolute Falseness and 1.0 representing absolute Truth. (Brule, 1985)

One method of explaining possibilities was created by Rev. Thomas Bayes, a mathematician from the 18th century. His theorem tried to apply a mathematical, logical representation to various factors. Here is an example of his mathematical model of probability (Case, 1995):

p(h|e,i)=p(h|i)*p(e|h,i)/p(e|i)
p=probability
h=hypothesis
e=evidence
i=context

Returning the value of the possibility is called weighting. Weighting of terms is based on a number of factors, as used by the search engine: A. Relative frequency, the more times a word/phrase appears the more weight it carries. The frequency of the term places a higher weight on he document. B. Closer to top - documents that have the query term(s) in the URL or in the title are weight more strongly. Terms appearing in the top of the document is weighted more relevant than at the bottom. C. More occurrences - if a document uses the key terms often, it is ranked more highly than one that seldom uses that particular term. D. Adjacency or proximity - words from the query that are found next to each other in the document score higher.

Query by example

Query-by-example (QBE) is the concept of providing the search engine an example for which to Using this example, the system returns other like documents. For example, I want a book about gorillas, published in 1984, that has a green cover. I have set up an example of what I am looking for using all my qualifications. Search engines use the technique to set up queries to find similar pages or files. The search is reinitiated using the example as the new source for the query. This interactive searching gives the user more control over the search process. Users can find more documents like the one selected. The results returned are then more focused because of the qualified terms. (Sugihara, 1995)

Query Expansion

Once a search has been completed, it often tends to need to be enhanced or changed. A library patron who comes to the desk asks one question, but usually there is some other additional information need. The purpose of the librarian is to elicit that actua l request. The quest of the information scientist is to discover how the computer can assist in evoking that query and its modifications. Newer search engines provide the user with more control over the query, by adding a means to resubmit the search with any changes.

Automatic Summaries

Many search engines incorporate a feature that creates summaries of the document retrieved. This can be based on taking information from the first few lines, or by locating key statements from within the document.

Natural Language Processing

Natural Language Processing is the act and science of getting computers to understand natural language. It is a part of artificial intelligence. (Case.) Computers process language not only by exact match, using keywords. NLP involves using a set of concep ts to sort out the interrelationships of words. The computer breaks apart the sentence into its semantic parts: nouns, verbs, adjectives, etc., and then it creates links. Since language can be ambiguous, vague, or metaphorical. NLP seeks to compute the relationships between words, giving each a correlate to the words around it. Put into a formula, the computer then makes assumptions based on its logic. Although similar to a keyword search, the search engine allows a user to make the query as if asking a librarian.

Concept-based searching

Using the idea of a thesaurus, a search engine can expand upon the keyword that a user may input. In this manner, users do not have to know the exact words to use to retrieve relevant documents. And, instead of reinstituting the search based on "confidence" or "weighting," the search engine automatically includes the like terms.

Search Engines

A survey of the Search Engines available from Netscape's Net Search will help in explaining some of the techniques discussed. By conducting a search for current trends in information retrieval, differences can be seen in the structure and techniques of each engine.

Alta Vista {http://www.altavista.com/}

Techniques and features
Boolean - must use and, or, not, near (10 words) in Advanced Search
Allows user-influenced results ranking
Ranking: title words or first few words
Parentheses for nesting
Can restrict to field (qualifiers)

Excite http://www.excite.com/

Techniques and features
Concept based searching-use statistical strength of interrelationships between words
Creates its own knowledge base (or internal thesaurus)
QBE - "similar documents"
Boolean searches
Keyword searches
Relevance - marked with red X
Robot is called Architext

Infoseek {http://infoseek.go.com/}

Techniques and features
Weight terms (required, desirable, undesirable)
Similar pages - QBE
Boolean operators
Natural language
Search mechanisms

Lycos {http://www.lycos.com/}

Techniques and features
Probabilistic retrieval
Indexes top 100 words and 20 lines of abstracts
Keyword searching
Boolean searching
Automatic truncation
Adjacency 0.0 - 1.0
Results categorized
Terms in bold
Relevancy: early on vs. farther down

Magellan {http://magellan.excite.com/}

Techniques and features
Reviewed by writers
Boolean searching
Green light for information for all age groups
Web, ftp, gopher, newsgroups, telnet sites
Browse directory or Use search engine
Relevancy = frequency of words
Browse button
Robot named Verity
Lists up to 20 pages at the bottom of the screen

Open Text {http://www.opentext.com/omw/f-omw.html}

Techniques and features
Boolean searching
Field operators: anywhere, summary, title, first heading, URL
Query-by-example

Conclusion

Information search and retrieval is of major importance in locating relevant materials. The ability to aid and assist a user in finding relevant information is the goal of librarians and information scientists. On the Web, search engines have made the pr ocess easier by incorporating a number of newer techniques which include artificial intelligence, Bayesian statistics and probability theory, weighting, and query by example. With the goal of finding relevant materials, these new techniques locate infor mation and also refine the search query. Since search engines have different criteria in creating the indexes, it is most useful to use more than one engine in searching the Web to gain relevant information. As a rule, the more critical or focused the q uery, the more engines should be applied. With advances in the tools for information retrieval, the future holds exciting possibilities for searching on the World Wide Web.


Bibliography

"Alta Vista: Tips". [{http://www.altavista.com/cgi-bin/query?pg=tips} 1995.

Birnham, L. "Natural Language Processing". [{http://yoda.cis.temple.edu:8080/nlp/nlp-course/lecture1}]. 1994.

The Bowker Annual: Library and Book Trade Almanac. 35th edition. 1990-1991. New York: Reed Publishing, c. 1990.

Boutell, Thomas. "World Wide Web FAQ. Robots." [{http://www.ibiblio.org/boutell/faq/robots.htm}] 1995.

Broule. Thomas. "World Wide Web. FAQ." [{http://www.ibiblio.org/boutell/faq/robots.htm}]. 1995.

Brule, James F. "Fuzzy Systems- A Tutorial." [{http://www.csu.edu.au/complex_systems/fuzzy.html}]. 1985.

Case, J. "Natural Language Processing". [{http://bones.wcupa.edu/~jcase/ciir1a-report.html}].

Dataflight Software, Inc. "Concordance Information Retrieval System." [{http://www.dataflight.com/white.papers.html}]. 19 February 1996.

De Bra, Dr. P.M.E. "Hypermedia structures and systems." [{http://wwwis.win.tue.nl/2L670/static/index.html}]. 1996.

Excite. "Handbook: NetSearch." [{http://www.excite.com/cgi/comsubhelp.cgi?display=html;path=/query.html;section=search;Help=Help}]. 1996.

Encyclopedia Britannica. [{http://www.britannica.com/}] 1996.

Fischer, Keith. "Preliminary robot.faq. [{http://info.webcrawler.com/mak/projects/robots/active.html}] 6 Nov. 1995.

Gray, Matthew. "Measuring the Growth of the Web". [{http://www.mit.edu/people/mkgray/growth/}]. 1995.

"Infoseek Tips." [{http://infoseek.go.com/}] c.1996.

Koch, Traugot. "Robot-based WWW Catalogs." [{http://www.lub.lu.se/netlab/documents/nav_menu.html#robo}]. 1996.

"Magellan Frequently Asked Questions." [{http://magellan.mckinley.com:80/mckinley-txt/250.html#howperform}] 1995.

Needleman, Mark. Information Retrieval and the ASISKeyword searches Standards Committee. American Society for Information Science Bulletin. Feb 1995, 21(3), p. 25-26.

Nelson, Mark R. "We Have the Information You Want, But Getting It Will Cost You: Being Held Hostage by Information Overload." [http://www.acm.org/crossroads/xrds1-1/mnelson.html]. Sept 1994.

Notess, Greg R. "Searching the World Wide Web: Lycos, Webcrawler, and more." Online, July 1995. 19(4), p.48-53.

Pike, Mary. Using the Internet. Second edition. Indianapolis, IN: Que, 1995

Pinkerton, Brian. "Finding What People Want: Experiences with Webcrawler." [{ http://www.thinkpink.com/bp/WebCrawler/WWW94.html}]. 1994.

Sugihara, J. ICS421 Lecture Notes 1. [{http://www2.ics.hawaii.edu/~sugihara/course/ics421s95/note/3-06n13}]. Jan. 1995.

Van Rijsbergen, C. J. "Information Retrieval." [{http://www.dcs.glasgow.ac.uk/Keith/Preface.html}].

"WebCrawler Help." [{http://www.webcrawler.com/Help/Examples.html}]. 1996.

Winship, Ian R. "World Wide Web searching tools, an evaluation." [{http://www.bubl.bath.ac.uk/BUBL/IWinship.html}]. 1995

"Yahoo! Help." [{http://docs.yahoo.com/docs/info/help.html}] 1996.

HTML 3.2 Checked!