MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Other Potentially Related Research
chapter
Mary Elizabeth Stevens
National Bureau of Standards
1. For each term in the initial formulation of a search request, the
appropriate term-profile is obtained, which gives weighted values
for those other terms that had significantly co-occurred with it.
2. The profiles of each term in a multi-term request are compared
and those additional terms common to all or a specified number of
the profiles are selected and added to the initial set. [OCRerr]lI
3. The "first generation" terms resulting from step 2 are next treated
as though they also were request terms, and steps 1 and Z are repeated
for them.
4. A selection is made from some reasonable proportion of the profiles
associated with the first generation terms to produce the 1'second
generation" terms
5. The expanded list of search terms is then compared with the index
terms assigned to each document in the collection, and whenever a
match is found the weight of the request term is assigned to the
matching document term. These weights are then summed to provide
a numeric measure of probable document relevance to the original
request.
6. Documents responding to the expanded request are printed out in the
order of document relevance scores.
Some experiments have been made using a computer program which accepts up
to 300 weighted terms in an expanded request vocabulary. Representative results have
been reported, in part, as follows:
..... We asked a qualified engineer to examine these documents and specify which
were related to `Thin Films' and which were not.. . This engineer was not
familiar with our project. .. yet... we found a remarkably high correlation between
his evaluation and the document relevance numbers... We then checked to see how
the documents containing information on `Thin Film' had been indexed. We found
that the first five documents on our list had been indexed by both `Thin' and `Film'.
Three more documents had been indexed by `[OCRerr]i1rr[OCRerr]' alone, and other related terms.
Two documents had not been indexed by either `Thin' or `Film', but only by a group
of related terms, yet they contained information on `Thin Films' and had a high
document relevance number. By using association factors and a series of statisti-
cal steps, easily programmed for a computer, we were thus able to locate
1/
2/
These are called "first generation terms" and tend to reflect only statistical asso-
ciations without including synonyms and near-synonyms which, over the course of
time, have occurred in the indexing vocabulary.
Stiles, 1961 [OCRerr] 571], p.274: "Among these we find words closely related in meaning
to the request terms." An example given in Ref. L572], pp. 200-201, is the
derivation of `weathering, ` `fungicidal', `deterioration', and `preservatives' as
second generation terms when the initial request included the terms `plastics',
`fungus', `coating', and `tests'.
121