SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) The QA System chapter J. Driscoll J. Lautenschlager M. Zhao National Institute of Standards and Technology Donna K. Harman The document weight file is a binary file containing a list of floating point numbers. These numbers are ordered sequentially by document number, and represent the sum- marion of the weight in the query squared for a particular document's keywords. The inverted index file is ordered by stem code. Each code has two file pointers, one pointing to the first block of data in the data file, and the second pointing to the last block of data. The inverted data file then consists of blocks of data, containing pairs of document numbers that the code is found in, and the code's frequency within that document. The blocks are linked together to form a list. The document name file consists of a list of pointers into the original text file. For Vol.1, both the document weight file and the inverted index file were two megabytes. The inverted data file was approximately 385 megabytes, and the document name file was two megabytes. 4[OCRerr] Basic Pi[OCRerr])cedure For the TREC experiments, we did the following: Step 1: This step involves matching every legitimate stem in the document collection with a unique integer value. This is done with a linear hashing function. A table containing this mapping, along with the number of documents each code is found in, is temporarily saved for use in Step 2. Step 2: This step creates the four data files described above. The entire database is scanned, with the four files being created on the fly. Once this is accomplished, the table from Step 1 is no longer necessary, and is discarded. Step 3: Relevant documents for each query are selected using the Jaccard similarity coefficient. The top 200 docu- ments for each query are then determined. The above three steps were followed to create the results for the September deadline. In the next section we present our "official" experiments and results, and some "unofficial" experiments and results. 5. Experiments and Results Our experiments were intended to be Category A experiments with two results submitted for each ranking task. One ranking result would be for just keywords, the other ranking result would be for keywords combined with semantics. All query construction was automatic, and the treatment of ad-hoc routing queries was identical. We also performed experiments concerning the expected behavior of string hashing functions and the use of part[OCRerr]f- speech tagging to improve retrieval performance. These experiments are not reported here. As an example of our automatically built ad-hoc and routing queries, consider Topic 004 reproduced in Figure 3. Figure 4 indicates the keyword and semantic information generated by the QA System for this topic. The first part of Figure 4 indicates the stems along with their frequencies found in the query. The second part of Figure 4 indicates the semantic categories also found in the query along with their expected frequencies and probability present. It is important to note that the topic represented in Figure 3 and Figure 4 has generated many semantic categories and the probability present for most of them is close to, or at, 203 100%. This is mainly due to the length of the text involved. We discovered that, for the TREC document collection, each document generated many semantic categories with high probability present. Because we treated semantic categories like keywords, this caused semantic weights to be essentially useless. For the ThEc September deadline, we were only able to submit a routing experiment using a keywording approach. The results of this experiment, computed with the aid of Chris Buckley's SMART evaluation program [1), are shown in Figure 5. The results are not good. In Section 6, we discuss what impeded our experiments. Further "unofficial" experiments were designed to test the use of semantics. The main goal of our experiments was to demonstrate that our original routing results could be improved through the use of semantic analysis. In order to do this, we made two modifications to our approach. The first change involved dividing the original TREC documents into paragraphs. The second change involved a semantic analysis when calculating the list of relevant documents. Our experiments involved the use of only six routing queries (for topics 001,002,003,007,017, and 022). These topics were selected because our original results for them were poor. Through the use of semantic analysis, we hoped to significantly improve our results. Figure 6 shows the precision-recall statistics for the six "poor" queries using the retrieval results which created the statistics in Figure 5. When analyzing our results, we computed all precision- recall tables through the use of Chris Buckley's SMART evaluation program [1). The relevancy lists used were those produced before November 1 (the original qrels for the routing queries). We did not use the modified results that were distributed later for Query 017. This should not affect our results, though, because our experiments were aimed only at improving our precision-recall averages, and the relevancy results used were consistent from one experiment to the other. 5.1 Re-[OCRerr]nk'ng of Documents In an effort to demonstrate that semantics could affect retrieval in the TREC environment, we used the original QA System with a semantic lexicon containing TREC words as described in Section 3.2. We created a separate database for each query we considered, for a total of six databases. Each database contained only the documents that we originally judged in the top 250 for each query. Because of this, when we computed new relevancy lists we were simply rearranging the order of the same 250 documents, IIQL bringing in new documents. Figure 7 reveals the precision-recall statistics when orig- inally retrieved documents fora query are used as a document collection and re-ranked by imposing the query again. There is a 25.8% increase when comparing the 11-pt average here to the 11-pt average of the originally retrieved text (Figure 6). To determine the ranking of a particular document with paragraph divisions, we defined the similarity coefficient of a document to be equalto the highest coefficient associated with one of its corresponding paragraphs. The paragraph divisions were automatically constructed from the original text. The precision-recall statistics for paragraphs being used as documents are shown in Figure 8. There is an 18.8%