SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2
chapter
C. Buckley
J. Allan
G. Salton
National Institute of Standards and Technology
D. K. Harman
ally assigned. The assumption is that t terms
in all are available for the representation of the
information.
In choosing a term weighting system, low
weights should be assigned to high-frequency
terms that occur in many documents of a col-
lection, and high weights to terms that are im-
portant in particular documents but unimpor-
tant in the remainder of the collection. The
weight of terms that occur rarely in a collec-
tion is relatively unimportant, because such
terms contribute little to the needed similar-
ity computation between different texts.
A well-known term weighting system fol-
lowing that prescription assigns weight Wik to
term Tk in query Q[OCRerr] in proport;on to the fre-
quency of occurrence of the term in Q[OCRerr], and
in inverse proportion to `the number of doc-
uments to which the term is assigned.[12, 10]
Such a weighting system is known as a tf x idf
(term frequency times inverse document fre-
quency) weighting system. In practice the
query lengths, and hence the number of non-
zero term weights assigned to a query, varies
widely. To allow a meaningful final retrieval
similarity, it is convenient to use a length nor-
malization factor as part of the term weighhng
formula. A high-quality term weighting for-
mula for Wik, the weight of term Tk in query
Q[OCRerr] is
Wik =
(log(j;k) + 1.0) * log(N/nk)
[OCRerr]Ztk[OCRerr]i[(log(i;k) + 1.0) * log(N/nk)]2
(1)
where fik is the occurrence frequency of Tk in
Q[OCRerr], N is the collection size, and flk the num-
ber of documents with term Tk assigned. The
factor log(N/nk) is an inverse collection fre-
quency ("idf") factor which decreases as terms
are used widely in a collection, and the denom-
inator in expression (1) is used for weight nor-
malization. This particular form will be called
"ltc" weighting within this paper.
The weights assigned to terms in documen[OCRerr]s
are much the same. In practice, for both effec-
tiveness and efficiency reasons the idf factor in
the documents is dropped. [1]
The terms Tk included in a given vector can
in principle represent any entities assigned to
a document for content identification. In the
Smart context, such terms are derived by a
text transformation of the following kind:[10]
1. recognize individual text words
46
2. use a stop list to eliminate unwanted func-
tion words
3. perform suffix removal to generate word
stems
4. optionally use term grouping methods
based on statistical word co-occurrence
or word adjacency computations to
form term phrases (alternatively syntac-
tic analysis computations can be used)
5. assign term weights to all remaining word
stems and/or phrase stems to form the
term vector for all information items.
Once term vectors are available for all informa-
tion items, all subsequent processing is based
on term vector manipulations.
The fact that the indexing of both doc-
uments and queries is completely automatic
means that the results obtained are reasonably
collection independent and should be valid
across a wide range of collections. No human
expertise in the subject matter is required for
either the initial collection creation, or the ac-
tual query formulation.
Phrases
The same phrase strategy (and phrases) used
in TREG 1 ([1]) is used for TREC 2. Any
pair of adjacent non-stopwords are regarded
as potential phrases. The final list of phrases
is composed of those pairs of words occur-
ring in 25 or more documents of the initial
TREC 1 document set (Dl, TREC 1 initial
collection). Phrase weighting is again a hy-
brid scheme where phrases are weighted with
the same scheme as single terms, except that
normalization of the entire vector is done by
dividing by the length of the single term sub-
vector only. In this way, the similarity con-
tribution of the single terms is independent of
the quantity or quality of the phrases.
Text Similarity Computation
When the text of document D[OCRerr] is represented
by a vectors of the form (d[OCRerr]1, ....... , d[OCRerr][OCRerr]) and
query Q[OCRerr] by the vector (qil, ....... , qit), a
similarity (S) computation between the two
items can conveniently be obtained as the in-
ner product between corresponding weighted