SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Site Report for the Text REtrieval Conference chapter P. Nelson National Institute of Standards and Technology Donna K. Harman The modules used for indexing are as follows: * Parse Document: Looks for codes in the text database to locate fields such as title, headline, authors, etc. These fields can be indexed, ignored, or stored in a special database for fast access. Parse document takes a command file which describes the structure of the text database to be indexed and can handle a wide variety of different text file formats. * Tokenize: Divides a string of characters into words. This may include special processing for dates, phone numbers, floating point numbers, hyphens, etc. * Morphology: An advanced form of stemming, attempts to remove suffixes and perform spelling changes to reduce words to simpler forms which are found in the dictionary. For example, one morphology rule will take "babies," strip the "ies," add "y", and produce "baby" which is found in the dictionary. Irregular forms of words are stored direcdy in the dictionary and are not subject to morphological analysis. Morphology is a much more accurate form of word reduction than stemming, because the dictionary can be used to validate the transformations. Morphology will not reduce proper nouns, and will produce much more accurate reductions, especially for words endingin "e." * Find Idioms: Idioms are phrases which have a meaning beyond that of the individual words added together. For example, "kangaroo court" has nothing whatsoever to do with kangaroos. Also proper nouns, such as "United States," have a meaning beyond the sum of their component parts. This module finds idioms in the text and indexes the idiom as a single unit. This prevents idioms such as "Dow Jones Industrial Average" from getting confused with queries on "industrial history." Words inside of idioms can still be located individually, if desired. * Index: The final step is to store the reduced words and collected idioms into the indexes. The index is an inverted positional word index, which is conceptually similar to the index at the back of a textbook. The speed of indexing by ConQuest has been measured to be approximately 40 megabytes per hour. This evaluation was done on a Sun IPc Sparc-Station (a 14 MIP computer), with 32 megabytes of RAM. The same speed has been achieved on a 486 IBM-PC, running at a clock rate of 33 Mhz, with 16 megabytes of RAM. Query The query process is more complex than indexing, due to word expansion and ranking. Generally speaking, ConQuest attempts to refme and enhance the user's query. The result is then matched against the indexes to look for documents which contain similar patterns. Queries are not "understood" in the traditional sense of natural language processing. ConQuest makes no attempt to deeply understand the objects in the query, their interaction, or the user's intent. Rather, ConQuest attempts to understand the meaning of each individual word and the importance of the word. It then uses the set of meanings and their related terms (retrieved from the semantic networks) as a statistical set which is matched against document information stored in the indexes. 291