SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
An Information Retrieval Test-bed on the CM-5
chapter
B. Masand
C. Stanfill
National Institute of Standards and Technology
D. K. Harman
the cosine measure's document normalization factor, we find
that it is possible to implement document-term weighting
based only in information in the structured posting file.
in general, term weights (both in queries and documents)
may be computed based on the following factors [5], all of
which are available at run-time in our architecture.
* Term Frequency. This is the number of times a term
occursin the database. This information is retaieed in the
lexicon.
* Document-Term Frequency. This is the number of docu-
ments in which a term occurs. It may be computed at run-
time by traversing posting list for the term, counting initial
occurrences of the term.
* In-Document Frequency. This is the number of times a
term occursin a given document. This may be computed at
run-time by traversing the posting list for the term, count-
ing the number of occurrences of the term having the same
document coordinate.
* Maximum In-Document Frequency. This is the maxi-
mum number of times a given term occurs in any docu-
menL It may be computed direedy from the in-document
frequencies.
Most conventional weighting schemes (e.g. tf.idf) may be
computed from these run-time factors. The advantage of
doing these computations at run-time is that it eliminates the
need to incorporate collection-specific information into the
database at indexing time. This is important for dynainic col-
lections as well as for distributed databases, as described
above.
The difficulty with these methods when applied to the
cosine norm is that the total-document-weight term cannot be
conveniently computed at run-time using an inverted file. It
must, therefore, be computed at index time and stored sepa-
rately (e.g. in the document-specific information data struc-
ture).
There are two solutions to this difficulty. The first is to
insist on using the cosine norm, accepting the difficulties that
this implies. The second is to look for alternatives to the
cosine norm. We believe that this is the more promising
approach, but this is in the realm of work-not-yet-completed.
in any event, the value of the cosine norm for large structured
documents has not been established at this time.
W RFIRIEVAL EXPERIMENTS:
Most of the time (about 3 to 4 person months) for our
[OCRerr]I[OCRerr]EC participation was spent on building the test bed. How-
ever we did finlsh some runs with automatically constructed
routing and acilloc queries for which we report the results here.
The experiments were done using the entire collection. We
compare words and phrases, mixed case vs. lower case and
also explore document length normalization and proximity
measures.
120
A. Retrieval for Routing Topics:
All queries were constructed and optimized automatically.
The terms consisted of words (ignoring stop words) as well as
all phrases consisting of adjacent words. Numbers were
ignored and case was preserved. No stemming or query
expansion with thesauri etc. was attempted.
Routing queries were constructed by loolting at each word
and (adjacent) phrase from the whole text of the topic tem-
plates and determing a weight based on the number of rele-
vant documents present in the first 100 retrieved documents,
by using just that term. An initial weighted query was con-
structed for each topic by the above process. Then each topic
query was optimized by choosing thresholds (per topic) for
the weights and rejecting all weighted query terms below the
threshold. The optimum threshold for each topic was chosen
by straightforward incremental search. Table 1 shows the
results for the routing experiments. Routing queries with both
Table 1: Routing Queries
Precisionat Average
Method 100 docs precision
tm[OCRerr]- routing-words-
phrases .3396 .2553
tmc7-routing-words .2920 .2045
tm[OCRerr]-routing-words-
phrases-ip .3750 .2716
tm[OCRerr]-routing-words-
phrases-doc-length-
sent-prox .3782 .2792
tm[OCRerr]-routing-words-
phrases-ip-sent-prox .3856 .3344
weighted words and phrases (queryid tmc6) did better than
queries using just words (queryid tmc7). Using the same (offi-
cial) queries, but adding sentence level proximity (sent-prox),
document length scaling (doc-length) and inverse weights
based on which paragraph the term appears in (ip), seem to
improve results (see the next section for more details about
the techniques).
B. Re frieval for Adhoc Topics
Adhoc queries were automatically constructed by using
words and phrases from different sections of the topic tem-
plates and using tf.idf weights (as derived from the training
collection). The "best" sections for the new topics were ch[OCRerr]
sen by experimenting with the training topics. Queries derived
from the description[OCRerr]oncept sections were used for most of
the experiments. A threshold for weights was used to select
terms for the final queries. Table 2 shows the results for the
adhoc queries.