SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Proximity-Correlation for Document Ranking: The PARA Group's TREC Experiment
chapter
M. Zimmerman
National Institute of Standards and Technology
Donna K. Harman
PROXIMITY-CORRELATION FOR DOCUMENT RANKING:
The PARA Group's TREC Experiment
by Mark Zimrnermann
P.O.Box 598
Kensington, Maryland 20895A)598
USA
Abstract
(zimm@alumni.caltech.edu)
The PARA Group's simple document routing method achieved surprisingly good results in the first
TREC experiment. The system works by awarding points to documents with many query terms in near
proximity to each other. The current implementation of this system is described in general terms; this
note is followed by a listing of the complete source code used to rank documents for the 50 TREC test
questions, written in Awk. Possible improvements1 and directions for further research, are suggested.
Acknowledgements
The PARA Group is a loose affiliation of people with common interests in free-text information
retrieval, hypermedia, and free software. (For further information, or to join, send a message to "para-
reques[OCRerr]cs.cmu.edu" via the Internet.) For the TREC relevance-ranking document routing test, I
consulted with other PARA Group members and implemented concepts that we discussed communally. I
would like to thank Dr. Donna Harman, MST, for allowing me to participate in TREC and for
encouraging me to write up my results. I also thank the members of the PARA Group for their helpful
advice. I made extensive use of, and am grateful for, software from the Free Software Foundation - in
particular, the GNU Emacs text editing system, and the Gawk version of the Awk programming
language. (Disclaimer: My Employer Is In No Way Responsible For This Work!)
Approach
I began with the subjective observation that, in my personal experience, the documents which I like
most tend to have local clusters of "interesting" words. I also began with the constraint that I had only
a few hours of programming time to invest in my TREC experiment; contrariwise, I had a NeXT
workstation with an optical disk and plenty of unused background CPU cycles available. This led me to
try a quick-and[OCRerr]irty approach using the regular expression pattern-matching and other programming
facilities of Gawk, a free version of the Awk language. I decided to work on the document routing task
using the full TREC data set.
I took the 50 TREC questions and manually constructed simple regular expressions ("regexps") for each
of the key terms in them. Thus, for Topic 001, on pending antitrust cases, I had /[OCRerr]TRUST/, ICASEI,
and /PEND/; for topic 002, acquisitions or mergers involving US and foreign companies, I came up with
IACQUISITION I BUYOUT I MERGER I TAKEOVERI, etc. For equivalent terms which were implicitly
boolean[OCRerr]R'd together, I wrote a single regexp with "I" joining the words. I spent approximately two
minutes per TREC query writing these patterns, a total of about two hours, and used words contained in
353