SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models chapter N. Fuhr C. Buckley National Institute of Standards and Technology Donna K. Harman A.2 Ad-hoc runs Two automatic ad-hoc runs were done; one in which the documents and queries were indexed with single terms only, and one in which they were indexed with both single terms and adjacency phrases. The overall procedure for both runs was: 1. Index D1 (the learning document set) and Qi (the learning query set). 2. For each document d [OCRerr] D1 2.1 For each q C Qi 2.1.1 Determine the relevance value r of d to q 2.1.2 For each term tin common between qT (set of query terms) and [OCRerr] (set of document terms) 2.1.2.1 Find values of the elements of the relevance description involved in this run and add values plus relevance information to the least squares matrix being constructed 3. Solve the least squares matrix to find the coefficient vector a 4. Index D1 U D2 (both sets of documents together) with term-freq weights. 5. For each document d C D1 U D2 (both sets of documents together) 5.1 For each term t C (1T 5.1.1 Find values of the relevance description x[OCRerr]t, d) involved in run. 5.1.2 Give t the weight a[OCRerr] v[OCRerr]i(t, d)) where a is the value determined in step 3. 5.2 Add d to the inverted file. 6. Weight Q2 (test query set) with tf. idf weights (ntc variant). 7. Run an inner product inverted file similarity match of Q2 against the inverted file formed in step 5, retrieving the top 200 documents. In an operational environment, the learning document set and test document set would be the same, so step 1 (or step 4) would be omitted. Once coefficients have been found, they should remain valid unless the character of the collection changes. So new documents can be added to a dynamic collection by just going through step 5 for each new document. The algorithm above is different from that implemented in the past for this learning approach. Earlier versions iterated over queries (instead of documents) in step 2 and used inverted files for speed (with the document vectors still being needed). However, the size of TREC required a reformulation since document vectors and inverted files could not be kept in memory. The coefficient determination is valid only if the relevant judgements used are representative of the entire set of reolevance judgements. In this first TREC, the initial judgements are fragmentary and not very representative; it's impossible to tell how much this affects things. A.3 Single term automatic ad-hoc run The single term ad-hoc run used 5 factors (described in 2.1): * constant * tf logidf. irnaxtf * if imaxif * logidf * lognurnierms imaxif 96