SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Proximity-Correlation for Document Ranking: The PARA Group's TREC Experiment
chapter
M. Zimmerman
National Institute of Standards and Technology
Donna K. Harman
the TREC topic statements plUs a few obvious synonyms which occurred to me as I typed in the queries. I
converted all characters in the TREC document set to upper[OCRerr]ase before processing, so my regexps ignored
case issues.
To handle the proximity (boolean-ANDAike requirement) among separate terms, and to generate
estimated relevances, I invented a very simple scoring system. Every time a line in a document matched
one of my regexps, I added an arbitrary 5 points to that regexp's score. A line in the document got a score
equal to the product of the regexp scores for that query. When moving on to the next line of a document, I
multiplied all regexp scores by 0.9, to make them fade away with a characteristic length scale of 10
lines (=11(1-0.9)). The estimated relevance for a document as a whole was the maximum relevance of
any line in that document.
For terms which were to be negatively[OCRerr]weighted (boolean-NOT-like), as in TREC Topic 026, instead of
multiplying in the regexp's score to get a line's relevance, I multiplied in 5 minus that score. I
experimented briefly with different weights and relevanc[OCRerr]length[OCRerr]scales for different terms, but
decided that there was neither sufficient time nor much benefit to be gained that way, and so I settled
on the S-points-per-instance, 0.9 degradation-rate standard.
Implementation
I performed my TREC experiments over a period of a few weeks in background on a NeXT workstation
(an old Cube), using the (rather slow) built-in writeable optical disk to hold data copied from the
TREC CD-ROMs. On the average, each TREC query took about on[OCRerr]third of a second per document to
execute. My implementation ran 10 queries at a time and generated a line of output for each document,
listing the document ID followed by (the integer part of) its estimated relevance, in 10 columns. I then
used additional UNIX shell scripts containing coirm and sort and head to tabulate the 200 highest-
ranked documents for each topic. A final Gawk program converted my results into the standard TREC
format. The total wall[OCRerr]lock computation time which I used was about 11 days for each CD-ROM of
data. Late in the process of scoring documents, I discovered that a bug in my programs caused the last
document in a data set not to be ranked - but I had no time to fix the error, and it probably did not
affect my overall results significantly.
Further Work
I believe that the proximity[OCRerr]correlation approach used in my TREC experiment has promise for other
relevance[OCRerr]ranking tasks. Almost certainly, the use of a compiled language (and perhaps a simpler,
faster pattern-matching facility) rather than Gawk would result in a speed increase of an order of
magnitude or more. Much higher speeds should be achievable by inverted-index methods. User
feedback might be valuable in modifying the default settings for term weights and relevance-lengths.
A thesaurus option to automatically generate synonyms could help automate the query[OCRerr]reation process.
References
[1] Mark Zimmermann. "The FreeText Project: Large-Scale Personal Information Retrieval", in Delany
& Landrow, TEXT-BASED COMPUTING IN ThE HUMANmES (to be published by MIT Press, early
1993), pps 51[OCRerr]66.
354