SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Single Language Evaluation of a Multi-lingual Text Retrieval System
chapter
T. Dunning
M. Davis
National Institute of Standards and Technology
Donna K. Harman
-4-
4. Primitive Document Access
The posting vector for any particular word can be retrieved from the database with a single
procedure call. The posting vectors obtained in this way can be combined using a set of func-
tions which combine perform the customary boolean operations, as well as functions which
allow adjacent words to be found, as well as when specified words appear within a specified
neighborhood of each other. In addition, the basic query routines include a procedure which,
given a document number from within a posting, will read the contents of this document into
memory and return the resulting string.
On top of this primitive layer, another level is built which supports scoring of documents
based on a query. Given a vector of strings which represent query terms, procedures in this
level return a sorted score vector whose contents contain documents and scores. All scoring is
done based on inverse document frequency weighting. We plan to convert to a system based on
Bayesian decision rules if time permits in this original contract
5. Query Formulation
In the CRL TREC effort, conversion from topics to queries was done entirely automati-
cally. The retrieval topics were reduced to lists of one, two, three and four word phrases. Each
such phrase was included in the query if it met a statistical test based on generalized likelihood
ratios. On the average, this resulted in about 80 terms being retained for the retrieval.
Had time permitted, we would have included a manual relevance feedback operation in the
process of query formulation. We expect that this would have greatly improved the utility of
the multi-word phrases.
6. Document Scoring
Retrieval was accomplished by using the or[OCRerr]posting procedure to compute a posting
vector which contained references to all documents which potentially had a non-zero score.
These documents were then scored using a conventional inverse document frequence weighting
scheme and the results were sorted. Only the first 200 documents in the sorted document vector
were printed.
In the submiued results, the fact that there was a very significant difference in average
document length was partially compensated for by accumulating scores in quadrature rather than
simply summing them. There are much better methods available to normalize for document
length in a principled way.
196