SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Text Retrieval with the TRW Fast Data Finder
chapter
M. Mettler
National Institute of Standards and Technology
Donna K. Harman
space; is no problem. The "million" subquery uses proximity and an alphanumeric
sequence pattern and will find items like the following:
a million dollar contract
a $2.3 million system
a $ 12 billion program
a $2000000 machine
a $ 2,000,000 machine
Note that the phrase "a 2,000,000 dollar award" would not be found by this definition. This
was an oversight. The winning query was then simply
[50 words -> award and computer and million)
This finds documents which contain a 50 word sliding window in which all three
subqueries match. Note how the "award" subquery that uses a 3 word sliding window can
be nested inside a query using a 50 word sliding window.
4.2 Example of Bad Performance - Topic 36
Topic 36 was to find documents discussing how rewritable optical disks work.
To be relevant, a document must describe how rewritable
optical disk technology works at length and in significant and
comprehensive technical detail.
This topic was particularly challenging because the topic narrative describes attributes the
documents must have rather than specific concepts or keywords. We started by defining a
subquery to find documents mentioning rewritable optical disks.
define optical disk
[10 word ->[OCRerr]"rewrit" and "optical [disk I drivel technolog]") end
To find documents that describe the technology "at length", we wrote a subquery to find
places where there were at least 5OOO characters between the <FEXT> definition and the
<TEXTh marker.
define LONG TEXT [5000 char -> no TEXTEND) end
To find documents that contained "significant and comprehensive technical detail" we
manually extracted a list of keywords (Table II), and required that the documents to have
at least 10 or more of these terms present.
The tightest query (intended for high precision) was
[1 document -> optical[OCRerr]disk and LONG and 30+ <technical terms> I
The loosest (intended for high recall) was
[1 document -> optical[OCRerr]disk and 10+ <technical terms> )
315