SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Site Report for the Text REtrieval Conference
chapter
P. Nelson
National Institute of Standards and Technology
Donna K. Harman
The speed of queries is based on many factors, including database size, amount of
expansion, size of dictionaries, etc. Many of these parameters can be modified to achieve
the appropriate trade-off between accuracy and speed. For TREC, a query over a 1
Gigabyte database with 15 terms took roughly 9 seconds to retrieve 500 documents on a
Sparc-IPC with 32 megabytes of RAM. Fquivalent speeds have been measured on 486
IBM-PC computers running at 33 Mhz with 16 megabytes of RAM.
Queries can be expanded to a very large number of terms, if desired. If the user wishes for
the greatest amount of recall, a 5 word query can be expanded to 200 or 300 related terms.
Ranking
Ran[OCRerr][OCRerr]g and retrieval with ConQuest uses a variety of statistics and critenon, which are
flexible and can be modified to handle varying requirements. The following are some of the
factors used in ranking:
Completeness: A good document should contain at least one term or related
term for each word in the original query.
Contextual Evidence: Words are supported by their related terms. If a
document contains a word and its related terms, then the word is given a
higher weight because it is surrounded by supporting evidence.
Semantic Distance: The semantic network contains information on how
closely two terms are related.
Proximity: A document is considered to be more relevant if it contains
matching terms which occur close together, preferably in the same
paragraph or sentence.
Quantity: The absolute quantity of hits in the document is also included, but
is not as strong a discriminator of relevance as the other factors.
ConQuest is the first truly "concept-based" search system to operate over unrestricted
domains. If a document contains the word and some of the related terms, the word is more
likely to be used in the correct meaning, using the "contextual evidence" factor above. In
this way, ConQuest can determine word meanings at query time.
Other Features
ConQuest contains a number of other search features, used to handle specific situations and
search requirements. These were not used for the Text REtrieval Conference. Some of
these features are listed below:
* Wildcards: Are useful for locating misspellings in the queries or in the database. The
user can specify a word with a wildcard, such as "compute*," and then choose which of
the matching terms from the indexes should be included in the query. Query words
derived from wildcards are not otherwise expanded.
* Boolean: The traditional boolean query mode is also available, and contains all of the
basic operators. These include AND, OR, NOT, WITHIN, thesaurus expansion, and
nested expressions.
293