NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) TREC-2 Document Retrieval Experiments using PIRCS chapter K. Kwok L. Grunfeld National Institute of Standards and Technology D. K. Harman approach we achieve several objectives for our system: a) reduce the `dead' time between a collection being acquired and its availability for searching, since the full inverted file is not produced; b) satisfy the 2N bytes of space reqin]ement; c) support fast feedback learning and query expansion by having the direct file available. The price we pay is to have to create the network dynamically, unlike the flill inverted file which is produced once. However, since the flill inverted file is too large to reside in memory, readidg parts of it in for retrieval is also time consuming. 3.2 Reduced Network Size We produce our network in memory according to the queries under attention and the terms used in them. Since memory is limited, documents are divided into subeollections (Section 3.3) and queries are lined in five totenatatime. Wedefmetheactivetermset(ATS)ofa network as the set of terms used by the current queries and their feedback documents if any. The latter will provide terms for query expansion. Only edges that connect items to the active term set are initiated in the network, reducing network space requirements substantially. With the current implementation for 1GB subcollection and 7 queries, the node and edge files together take about 40 MB. Using clock time as a measure, producing the network requires about 40 minutes. learning is fast, about 2 minutes, and retrieval and ranking another 8 minutes. We hope that fwiher improvements in design and faster hardware in the future can improve these figures substantially. 3.3 Master and Subcollection File Structure We view the `IREC experiments as document retrieval from multiple collections, but reporting retrieval results in one single ranked list of documents for each topic (query). Although only four or five collections such as Wall Street 3ournal, Associated Press, etc. are given in ThEC, in reality it could be many more. We consider three methods in file design to approach the problem: a) A centralized approach where all documents from all collections are processed as if from a single document stream, producing a centralized dictionary containing full usage statistics and a giant direct file. From these, networks can be inltiallzed. The idea is simple, but it has drawbacks in that eventually the direct file would exceed file/disk size limitations and software has to be designed to handle data crossing file/disk boundaries. Moreover, it is inherently fragile to create single files of this size. The advantage of this approach is that RSVs calculated are directly comparable for all documents and a single ranked list is produced without difiiculty. 235 b) An independent collections approach where each source collection of about 0.5-1.0 GB say, forms a textbase with its own local dictionary and direct file, for network initiation and retrieval. Gne simply repeats the process for as many collections as necessary. This is the preferred approach, and if one has n processors and sufficient disk space, n separate textbases can be created for learning and retrieval in parallel saving substantial time. The problem is how to combine the retrieval lists from each into a single ranked list, since each textbase has its own term usage statistics and calculates RsVs for raithing within its own environment. Classical Boolean retrieval and coordinate matching pose no problem. Some retrieval strategy may produce RSVs that are comparable across collections in theory; but after approximations are taken, it is questionable that this is still true. Similar problem exists for retrieval from distributed databases such as the WMS environment. c) For ThEC2 we settle on a hybrid subcollections approach, treating each source as a subcollection within a master. We create a master centralized dictionary as in a) capturing full usage statistics serving all the subeollections, and create separate direct files for each subeollection as in b). The central dictionary has about 620,000 uhique terms after processing 2 GB from Diski and Disk2, and is relatively small. It captures global term usage statistics, while the individual direct files capture local usage statistics within items. Separate networks are then created for each subeollection with edge weights based on the correct global and local statistics as in a), assuring that retrieval lists contain RSVs that are directly comparable. This approach combines the advantages of both a) and b), and can also function in a parallel distributed environment. 4. Item Representation As in ThECi, a number of preprocessing mainly for the purpose of improving the representations of documents and queries are done as follows: 4.1 Vocabulary Control In addition to a manual stopword list of about 630 words and another 528 manually identified 2-word phrases, we also process samples from Diski and Disk2 using all five source types (WSJ, AP, FR, ZIFF and DOE, of about 100MB each) to produce two-word phrases based on adjacency within sentence context. Gur objective is to remedy losses of recall and precision due to the removal of high frequency terms. Gur criteria for phrases is that each word pair must have a frequency of 40, and either one or both components must be high frequency (>=10000). A casual scan of the resultant list generated led us to remove