SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Proximity-Correlation for Document Ranking: The PARA Group's TREC Experiment chapter M. Zimmerman National Institute of Standards and Technology Donna K. Harman PROXIMITY-CORRELATION FOR DOCUMENT RANKING: The PARA Group's TREC Experiment by Mark Zimrnermann P.O.Box 598 Kensington, Maryland 20895A)598 USA Abstract (zimm@alumni.caltech.edu) The PARA Group's simple document routing method achieved surprisingly good results in the first TREC experiment. The system works by awarding points to documents with many query terms in near proximity to each other. The current implementation of this system is described in general terms; this note is followed by a listing of the complete source code used to rank documents for the 50 TREC test questions, written in Awk. Possible improvements1 and directions for further research, are suggested. Acknowledgements The PARA Group is a loose affiliation of people with common interests in free-text information retrieval, hypermedia, and free software. (For further information, or to join, send a message to "para- reques[OCRerr]cs.cmu.edu" via the Internet.) For the TREC relevance-ranking document routing test, I consulted with other PARA Group members and implemented concepts that we discussed communally. I would like to thank Dr. Donna Harman, MST, for allowing me to participate in TREC and for encouraging me to write up my results. I also thank the members of the PARA Group for their helpful advice. I made extensive use of, and am grateful for, software from the Free Software Foundation - in particular, the GNU Emacs text editing system, and the Gawk version of the Awk programming language. (Disclaimer: My Employer Is In No Way Responsible For This Work!) Approach I began with the subjective observation that, in my personal experience, the documents which I like most tend to have local clusters of "interesting" words. I also began with the constraint that I had only a few hours of programming time to invest in my TREC experiment; contrariwise, I had a NeXT workstation with an optical disk and plenty of unused background CPU cycles available. This led me to try a quick-and[OCRerr]irty approach using the regular expression pattern-matching and other programming facilities of Gawk, a free version of the Awk language. I decided to work on the document routing task using the full TREC data set. I took the 50 TREC questions and manually constructed simple regular expressions ("regexps") for each of the key terms in them. Thus, for Topic 001, on pending antitrust cases, I had /[OCRerr]TRUST/, ICASEI, and /PEND/; for topic 002, acquisitions or mergers involving US and foreign companies, I came up with IACQUISITION I BUYOUT I MERGER I TAKEOVERI, etc. For equivalent terms which were implicitly boolean[OCRerr]R'd together, I wrote a single regexp with "I" joining the words. I spent approximately two minutes per TREC query writing these patterns, a total of about two hours, and used words contained in 353