Untangling the Web
URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.


Yahoo! Cataloging the Web

Anne Callery
Cataloger
Yahoo! Inc.


Copyright 1996, Anne Callery. Used with permission.

Abstract

The Internet has the potential to be the ultimate information resource, but it needs to be organized in order to be useful. I will discuss how Yahoo! is different from most web search engines, and how best to search for information on Yahoo! Libraries are forging ahead and beginning to catalog the Internet, but Yahoo! catalogs differently, not following traditional library procedures. I will explain why this is so, and demonstrate Yahoo!'s entire cataloging process. This presentation should be of interest to general users as well as catalogers.

The Internet is full of information that needs to be organized and made accessible in order for it to be useful. Yahoo! organizes information on the Internet, particularly on the World Wide Web.

Yahoo! is Not Just a Search Engine

There is often confusion about the functions of subject guides, such as Yahoo!, as opposed to search engines, such as {Lycos}, {Alta Vista}, {WebCrawler}, et al. Yahoo! can perform as a search engine (through its Open Text searches), but its strength lies primarily in the subject hierarchy.

There are several advantages to searching a hierarchical subject index, for example:

Contrary to the claims of many enthusiastic users, Yahoo! doesn't presume to catalog everything on the web. Rather, Yahoo! is a filter and organizer of useful information, and we plan to continue in that capacity. While a search engine may have many individual URLs in its database, and searches on all of them, not all of these URLs may be truly useful individually. For example, someone might publish a 50-page thesis on the web. A search engine would index the URL of each of the 50 pages (being individual HTML documents, each with a different URL), so a search could pull up a few random pages in the middle of the thesis, such as a single illustration or reference. (And, unfortunately, not every web publisher thinks to add handy links back to the main page on these little side pages.) Yahoo! would index only the top page of the document (or its significant sections) and bring it together as a whole.

Yahoo! Creates its Own Classification System

Librarians are among those cataloging the Internet. One major project is {OCLC's Internet Cataloging Project}. Although this project is great for library cataloging, there are a few reasons why this kind of cataloging isn't practical for Yahoo!.

The decision to depart from standard library classification systems was a carefully considered one. With so much to do already, we would have been happy to adopt an existing system and save a lot of time and energy. However, for various reasons, no one system could meet all our needs. We do look to other systems P e.g., Library of Congress Classification (LCC) P for ideas and guidelines for the organization of certain areas. I like to compare Yahoo!'s subject hierarchy with the early Dewey Decimal Classification, except in our case, it's much easier to expand and grow! Yahoo! may have started out a little heavier in some areas than others, but some of those initially smaller areas have really taken off.

How Yahoo! Catalogs Web Sites

Yahoo! currently receives thousands of submissions each day. Although our cataloging staff is continually growing and we have made many improvements to the add process, we're still a little short of meeting this demand. Every site added to Yahoo! is examined by a human being. The suggested category (that which the submitter selects) is used as a guide, and we reserve final editorial judgment. Having the user suggest a category helps us organize the submissions, which are grouped each week by subject. The subject lists are organized on a dedicated server and distributed among the catalogers. Most of us specialize in certain areas, which ensures that each category has a small group of people who know it fairly well. The cataloger selects an item from the list, and a display is brought up.

There are fields for title, URL, contact person, geographic location, descriptive comment, and indicators for the presence of Java and VRML. We're not using all of these fields in the actual Yahoo! display at this time, but they could be implemented later. Below the fields is a snapshot of the submitted page. Occasionally, just looking at this top page will tell us enough about the site to place it in a category (e.g., an X-Files fan page), but more often we will explore the site a little to get a feel for its content. We select categories (using another application which is an interface to the Yahoo! database) and add the site. Then we send off an e-mail to let the submitter know the site has been added or, in some cases, explaining why the site was not added.

We do have some standardization in the form of add guidelines. The most important of these is just to use common sense. We look at the site carefully to determine the best subject area, sometimes consulting reference material and each other. The category the user submits it under may not be the best category for that site. (For example, the Texas Beef Council once submitted their site under Health/Fitness and Exercise.) Occasionally, we'll e-mail the submitter and ask for more information to help us place the site correctly. We also look for content. Often people put up a page and submit it to Yahoo! before any substantial content is added. We don't want to list a site containing nothing but "under construction" signs, or a company's site with nothing but an address and phone number with the instructions to call them for more information. Users are quickly turned off by underdeveloped sites, and it reflects poorly not only on the site itself, but on Yahoo! as well.

Because our subject hierarchy is dictated by whatever we find, we often create new subcategories and develop the hierarchy as we go. This is a "bottom up" approach, as opposed to more traditional "top down" systems. In some cases we try to use the most common terms a person might look for. For example, we have a category Recreation/Hobbies/Model Airplanes. When we first received a site about model helicopters, we wondered whether to change the name of the category to Model Aircraft. However, we decided to include the helicopter site and retain the first name because Model Airplanes is the more idiomatic phrase.

We also try to maintain a consistent vocabulary in the naming of common subcategories. A good example of this is Universities. Within the directory for a particular university, we have chosen to divide the institution's individual sites thus.

Some of these subcategories and many of the sites are linked back to their appropriate subject areas. For example, the Athletics directory is linked to Recreation/Sports/College/ and named for the University. Individual departments are linked to their academic disciplines.

Such a detailed structure became necessary because universities are such a large presence on the web. As a category grows, sites may become more specific and logical subdivisions emerge, so we make them. We are currently creating similar structures for the Regional category. In the early days of the web (which, of course, is fairly recent!) the majority of sites originated from large institutions or organizations. As more people and businesses get on the web now, more of them are smaller, regional operations. A good example is Internet service providers. It doesn't make sense for us to list them all in one big alphabetic list under Business/Companies/Internet Service Providers. The typical user is only going to be interested in providers in her area, and wouldn't be too happy to have to sift through hundreds of entries. By listing these services regionally, we make sure that someone interested in Internet services in, for example, Chicago could simply go to that area in the Regional category and find local providers listed there.

Specificity to a region is one of the main distinctions we look for when placing a site. Another important distinction is whether the site is commercial. All commercial sites are added under Business and Economy, in either Companies or Products & Services. This has to do with the nature of the material, which is usually just advertising the company or selling its products, as well as with the general attitude of the Internet community towards commercialism on the net.

Searching Yahoo!

Users can search for information in Yahoo! in two ways. One is by browsing the subject tree. The other is by keyword search. A keyword search looks in the Yahoo! database for words in the title and comment fields of individual entries and in the names of categories. The search results display has three parts. First, whole categories containing the keyword(s) are shown, then individual sites. The third section is the result of an Open Text search. A great feature of Yahoo! searching is the links to other search engines displayed at the bottom of the search results screen. Selecting one of these will not only take the user to the search engine, but will also automatically execute the same search.

The Yahoo! search can also be incorporated into browsing. When users select a Yahoo! category, they have the option of searching only within that category. This is a significant aid in categories which contain a large number of sites, or when searching a subject that could be listed under more than one subcategory. The results of this search show where the keywords occur within the category. As with a regular search, Open Text results are displayed, along with links to other search engines.

Yahoo! is continually evolving, and we welcome feedback and suggestions on how we can make Yahoo! better. If any of you has ideas or suggestions, please feel free to use the forms in the "Write Yahoo!" section under the Info button on the Yahoo! banner, or send me an e-mail, to anne@yahoo.com.

HTML 3.2 Checked!