SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
A.2 Ad-hoc runs
Two automatic ad-hoc runs were done; one in which the documents and queries were indexed with
single terms only, and one in which they were indexed with both single terms and adjacency phrases.
The overall procedure for both runs was:
1. Index D1 (the learning document set) and Qi (the learning query set).
2. For each document d [OCRerr] D1
2.1 For each q C Qi
2.1.1 Determine the relevance value r of d to q
2.1.2 For each term tin common between qT (set of query terms) and
[OCRerr] (set of document terms)
2.1.2.1 Find values of the elements of the relevance description
involved in this run and add values plus relevance
information to the least squares matrix being constructed
3. Solve the least squares matrix to find the coefficient vector a
4. Index D1 U D2 (both sets of documents together) with term-freq weights.
5. For each document d C D1 U D2 (both sets of documents together)
5.1 For each term t C (1T
5.1.1 Find values of the relevance description x[OCRerr]t, d) involved in run.
5.1.2 Give t the weight a[OCRerr] v[OCRerr]i(t, d))
where a is the value determined in step 3.
5.2 Add d to the inverted file.
6. Weight Q2 (test query set) with tf. idf weights (ntc variant).
7. Run an inner product inverted file similarity match of Q2 against
the inverted file formed in step 5, retrieving the top 200 documents.
In an operational environment, the learning document set and test document set would be the same,
so step 1 (or step 4) would be omitted. Once coefficients have been found, they should remain valid
unless the character of the collection changes. So new documents can be added to a dynamic collection
by just going through step 5 for each new document.
The algorithm above is different from that implemented in the past for this learning approach. Earlier
versions iterated over queries (instead of documents) in step 2 and used inverted files for speed (with
the document vectors still being needed). However, the size of TREC required a reformulation since
document vectors and inverted files could not be kept in memory.
The coefficient determination is valid only if the relevant judgements used are representative of the
entire set of reolevance judgements. In this first TREC, the initial judgements are fragmentary and
not very representative; it's impossible to tell how much this affects things.
A.3 Single term automatic ad-hoc run
The single term ad-hoc run used 5 factors (described in 2.1):
* constant
* tf logidf. irnaxtf
* if imaxif
* logidf
* lognurnierms imaxif
96