SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 chapter C. Buckley J. Allan G. Salton National Institute of Standards and Technology D. K. Harman ally assigned. The assumption is that t terms in all are available for the representation of the information. In choosing a term weighting system, low weights should be assigned to high-frequency terms that occur in many documents of a col- lection, and high weights to terms that are im- portant in particular documents but unimpor- tant in the remainder of the collection. The weight of terms that occur rarely in a collec- tion is relatively unimportant, because such terms contribute little to the needed similar- ity computation between different texts. A well-known term weighting system fol- lowing that prescription assigns weight Wik to term Tk in query Q[OCRerr] in proport;on to the fre- quency of occurrence of the term in Q[OCRerr], and in inverse proportion to `the number of doc- uments to which the term is assigned.[12, 10] Such a weighting system is known as a tf x idf (term frequency times inverse document fre- quency) weighting system. In practice the query lengths, and hence the number of non- zero term weights assigned to a query, varies widely. To allow a meaningful final retrieval similarity, it is convenient to use a length nor- malization factor as part of the term weighhng formula. A high-quality term weighting for- mula for Wik, the weight of term Tk in query Q[OCRerr] is Wik = (log(j;k) + 1.0) * log(N/nk) [OCRerr]Ztk[OCRerr]i[(log(i;k) + 1.0) * log(N/nk)]2 (1) where fik is the occurrence frequency of Tk in Q[OCRerr], N is the collection size, and flk the num- ber of documents with term Tk assigned. The factor log(N/nk) is an inverse collection fre- quency ("idf") factor which decreases as terms are used widely in a collection, and the denom- inator in expression (1) is used for weight nor- malization. This particular form will be called "ltc" weighting within this paper. The weights assigned to terms in documen[OCRerr]s are much the same. In practice, for both effec- tiveness and efficiency reasons the idf factor in the documents is dropped. [1] The terms Tk included in a given vector can in principle represent any entities assigned to a document for content identification. In the Smart context, such terms are derived by a text transformation of the following kind:[10] 1. recognize individual text words 46 2. use a stop list to eliminate unwanted func- tion words 3. perform suffix removal to generate word stems 4. optionally use term grouping methods based on statistical word co-occurrence or word adjacency computations to form term phrases (alternatively syntac- tic analysis computations can be used) 5. assign term weights to all remaining word stems and/or phrase stems to form the term vector for all information items. Once term vectors are available for all informa- tion items, all subsequent processing is based on term vector manipulations. The fact that the indexing of both doc- uments and queries is completely automatic means that the results obtained are reasonably collection independent and should be valid across a wide range of collections. No human expertise in the subject matter is required for either the initial collection creation, or the ac- tual query formulation. Phrases The same phrase strategy (and phrases) used in TREG 1 ([1]) is used for TREC 2. Any pair of adjacent non-stopwords are regarded as potential phrases. The final list of phrases is composed of those pairs of words occur- ring in 25 or more documents of the initial TREC 1 document set (Dl, TREC 1 initial collection). Phrase weighting is again a hy- brid scheme where phrases are weighted with the same scheme as single terms, except that normalization of the entire vector is done by dividing by the length of the single term sub- vector only. In this way, the similarity con- tribution of the single terms is independent of the quantity or quality of the phrases. Text Similarity Computation When the text of document D[OCRerr] is represented by a vectors of the form (d[OCRerr]1, ....... , d[OCRerr][OCRerr]) and query Q[OCRerr] by the vector (qil, ....... , qit), a similarity (S) computation between the two items can conveniently be obtained as the in- ner product between corresponding weighted