SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Effective and Efficient Retrieval from Large and Dynamic Document Collections chapter D. Knaus P. Schauble National Institute of Standards and Technology D. K. Harman 4 Experiments In this section, we present the evaluation of the method described above and we compare it to methods to other weighting schemes. We focus on the efficiency of modi- fying documents and on the correlation between the re- trieval efficiency and the retrieval effectiveness. We will also see what is the influence of the vocabulary restric- tion on the retrieval effectiveness and on the retrieval efficiency. For the final evaluations we concentrated on the adhoc queries. Before discussing the results, we define what we mean by a p[OCRerr]rtitio[OCRerr] a vttm and an ezperimemt. The document collection has been split up into several p[OCRerr]vtition8, each consisting of at most 100,000 documents. Thus, the large collections DOEl (Department of Energy, Disk 1) and ZIFF3 (Ziff-Davids Publishing, Disk 3) were divided into three and two partitions respectively. A r[OCRerr]m con- sists of the evaluation of 50 queries (either the set of routing queries or the set of ad hoc queries) against all documents of one partition. For each query, the 1000 top ranked documents have been retrieved. An e[OCRerr]e[OCRerr]- ime[OCRerr]t consists of several runs and the merging of the lists of ranked documents for each query. For TREC-2, the two sets of experiments "Topics 51-100 versus Disk 3" and "Topics 101-150 versus Disks 1 and 2" have been evaluated. All efficiency evaluations are based on CPU time rather than on real time in order to eliminate side effects froni other jobs running on the same machine. In these experiments, we used a SUN SPARCserver MP690 with 128 MBytes RAM. We derived the document descriptions directly from the CD's. The indexing process included the elimination of stop words (van Rusbergen's stop list [12, pp.18]) and Porter's word reduction algorithm [6]. The normalized inverse document frequencies have been derived from the documents of disks 1 and 2 only. Uncompressing and indexing a single document needs around 100 msec on an average depending on the length of the document. The computation of the inverse document frequencies from the descriptions took about 1.5 hours of CPU time. The average time for inserting a document description into the access structure is on a scale of 10 msec - again depending on the number of features [OCRerr] per document. Inserting a document description into an inverted file would need more time because the postings had to be inserted into the different lists associated to each fea- ture. The restriction of the vocabulary was accomplished by omitting features occurring in more than 15% of all documents (from the disks 1 and 2), i.e. in more than 111'337 documents. We have chosen 15% of the collec- tion although also a stronger limit of 10% should not affect the retrieval effectiveness [1]. In our experiments we compare the 15% limit ("dflS") to a non restricted 166 vocabulary ("all"). We now have three parameters which can be com- bined to specify eight different retrieval methods. Each method can be identified by a string built from the labels for the document feature weighting, the query feature weighting and the vocabulary restriction: [OCRerr]doc[OCRerr]feat[OCRerr]weight[OCRerr]. (quer[OCRerr]feat[OCRerr]weight) . (vocab) In what follows, we present the results of the following nine methods: MO ntc.ntn.all Ml lnc.ltn.all M2 lnc.ltn.dflS M3 lnc.ntn.all M4 lnc.ntn.dflS MS ltc.ltn.all M6 ltc.ltn.dflS M7 ltc.ntn.all M8 ltc.ntn.dflS First, we compare the retrieval effectiveness of our method (Ml) described in Section 2 to the standard tf * idf method (MO) by means of the precision-recall graph in Figure 2. As expected, the method Ml is more effective than MO and achieves a retrieval effectiveness among the best methods presented at TREC-2. In order to find out the reason for this difference in the retrieval effectiveness we must have a closer look at the influ- ences of each parameter (document and query feature weighting). In Figure 3' the 1 1-pts average precisions of each method (MO to M8) are plotted on the left axis, and they are connected to the median response times (for the top ranked document) plotted on the right axis. The most obvious conclusion from this graph is the follow- ing: the higher the precision, the slower the response, and vice versa. The method MO performs clearly worse than the methods Ml to M8 in respect to both retrieval effectiveness and retrieval efficiency. We concentrate on the response times of the top ranked document because the response times of all fur- ther ranked documents are of secondary interest, since a user is supposed to read the top ranked document be- fore looking at the other documents and the retrieval system can retrieve further documents while the user is reading the top ranked document. We can also see from Figure 3, what are the imfl[OCRerr]- ences of the different p[OCRerr]rameters on the [OCRerr]ver[OCRerr]ge prec?- sion. Regarding the weighting of the document features, the "inc" weighting achieves a 4-10% higher precision than the "ltc" weighting. The "ntc" looses 5% of pre- cision compared to "ltc". In the case of query feature weighting, again the logarithmic "ltn" weighting is more