SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Knowledge-Based Searching with TOPIC
chapter
J. Lehman
C. Reid
National Institute of Standards and Technology
D. K. Harman
No markup language (SGML) interpreter was used
during data preparation, and the opfional aiphabefical
word list (used only for display) and typographical error
index (used almost exclusively for OCR'd data) were not
employed. Special indicies such as correlated terms, and
paragraph/sentence posifioning were not produced. As
the fuzzy proximity operator was used in the tests, only
a word position index was produced. No document was
divided into logical or arbitrary sections for processing or
search result enhancement, although that approach is
used in virtually all non-newswire Verity installations.
The purpose of logical division (a forerunner of the
intelligence available in a standard markup language) is
to create domaln-specific logical documents, and
therefore to reduce the impact of larger, multi-subject
documents on results (they would appear in search
results simply because of their breadth of words).
3.2 TOPIC CONSTRUCTION
Verity personnel manually constructed the search rules
from the subject area descriptions and the training data.
No rule developer was identified or chosen as a subject
matter expert, and for certain of the contributors, this
was their initial interface with using Topic. [Search rule
libraries are created by approximately 6% of Topic's user
population and the remainder of Topic's users employ
the topics developed by others]. On the average, the
TREC-2 volunteers were considered novices on the
Topic product, particularly the search rule development
area. Volunteers were not encouraged to use specific
features of the product, and in at least one case,
inadequate communication produced potenfially
inaccurate search expectations. As search rules were
interacfively developed, the rule evidence was
automatically indexed for repeated use of the rule. The
twenty volunteers each produced between 3 and 8
retrospective and routing queries. The range in time
spent on individual query development, and result
production was from fifteen minutes to eight hours, over
a several week period. The average fime to produce the
TREC-2 result, obtained from interviewing the
volunteers, was approximately one hour.
3.3 EXPERIMENT PERFORMANCE
Typical response time performance on the searches was
two seconds per 8000-document partition, or
approximately two minutes to search the entire
collection. A single term, indexed as rule evidence, was
used to search the entire collection, and the 1.1 million
document collection was searched in 21 seconds.
For routing queries, the score threshold was set to zero;
any document containing evidence entered the routing
result list.
213
3.4 ANALYSIS OF OFFICIAL RESULTS
The post hoc analysis of Topic's TREC-2 results
generally found that the Topic system performed well.
When compared with other manual systems, the scores
are amongst the best. I the few cases where Topic
appeared to fail, we have generally been able to identify
easily correctable deficiencies, that, had they been noficed
during the experiment proper, would have resulted in
superior performance by Topic in TREC-2.
Based on our analysis, we believe that the prospects for
TREC-3 look very bright.
Our analysis of selected results from our TREC-2
submissions focuses mainly on the "failure cases" since
these are most likely to give us insights in how to
improve Topics (and users) performance in future TREC
experiments. This also allows us to investigate whether
there are any fundamental issues with using Topic to
model the information need statements used in ThEC.
We analyzed two routing and three ad-hoc topics in
detail. Our summary follows.
The following general observations applied to all
searches:
-Adhoc searches were submitted against all three disks,
which produced poorer quality results generally, as
documents from disc three appeared in some search
results.1
-Field value evidence was not used, and in some
domains/subject areas, domain knowledge about the
sources of information would favor (rank higher) sources
with the appropriate use of terminology. (e.g. business
sources about financial performance, or foreign datelines
have higher likelihood of describing foreign prominent
persons/activity, as in topic's 66 or 121)
-The queries which used attempted to use nomenclature
with hyphens (e.g. M- 1) failed to return an exact match
as the hyphen was not included as an indexed character.
-The fuzzy proximity (near) operator was undocumented,
only one volunteer used it and other users expected
sentence / paragraph proximity in their searches. The
index did not contain sentence / paragraph positional
data, and all uses of sentence or paragraph operators
produced erroneous results because the search arbitrarily
assigned sentence and paragraph boundaries.
1Reprocessing the adhoc searches against only disks I and
*2 produced a numberic result improvement of 0-70
percent, with a *few changes from under the median to over
the median.