SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
The ConQuest System
chapter
P. Nelson
National Institute of Standards and Technology
D. K. Harman
strength of each word, frequency in query, expansion terms,
inverse document frequency, and query structure.
Once a document is found using coarse-grain rank, a second
phase of relevancy ranking is applied, called "fine-grain"
rank. This second phase uses a different ranking function
which has access to more local information within the
document. The inputs to this function include all of the
inputs used in coarse-grain ranking, plus word location,
proximity, frequency in document, and document structure.
Query
FTh[OCRerr]e[OCRerr]rain
Lmj
Document List
`Ti-ime-Grain
Corn
Final Document
Rank
Figure 2 Fine and Coarse Grain Ranking
In general, the coarse-grain rank of a document represents
global information on the document. It is a score that
applies more to the document as a whole. The coarse-grain
rank will be high for a document if it contains a large
number of query words and related terms, ignoring the
position of those terms in the document.
The fine-grain rank, on the other hand, represents local
information, because the proximity (physical closeness) of
the terms is the strongest contributor. The fine-grain rank
of a document will be high if there is a single strong
reference contained in the document.
As shown in Figure 2 above, the final document score is
computed as a combination of the coarse- and fine-grain
scores.
Pre-TREC Experiments
In preparation for ThEC-2, ConQuest performed numerous
experiments to improve the coarse-grain ranking algorithms
and data. These experiments included the following:
1. Statistical word studies (statistical regressions to
predict the probability that a document containino a
word is relevant)
2. Statistical word-pair studies
3. Various weighting formulae
4. Various query structuring techniques
267
These studies were all performed under the assumption that
the coarse-grain ranking formula used for TREC was weaker
than the fine-grain ranking formula. The concern was that
coarse-grain ranking did not retrieve a large enough
percentage of relevant documents in the initial retrieval set.
It was thought that once these documents were retrieved, the
fine-grain algorithms would effectively use proximity and
term frequency information to sort the documents and put
all of the truly relevant ones at the top of the list.
Unlike other systems, ConQuest did not have funding for
these TREC studies. This put the TREC studies in direct
conflict with other more pressing concerns, such as
supporting customers, or providing new functionality such
as client/server.
As a result, the testing from these early studies proved
ambiguous and unreliable. We believe that this was due to
the following:
* Since time and resources were limited, tests were
performed on only a small number of queries (5-10).
This did not provide a large enough sample set of
queries to produce reliable test results.
* ConQuest never tested the original assumption that
coarse-grain was the limiting step in improving
accuracy.
* The queries for this testing were taken from the
TREC-1 final test queries. However, many of these
queries were hastily constructed and thus added noise
to the test results.
Just before the TREC-2 results were due, ConQuest decided
to concentrate most of its effort on improving the tools
used to generate queries. The tools and processes created are
described in the next section.
Generating Queries for TREC-2
Generating queries was primarily an automatic process,
based on the initial TREC-2 topic descriptions. Manual
input was used primarily to remove things: Words, word
meanings, and expansions. This produced queries with only
the terms that are relevant. If needed, a user can also set
weights for query terms.
Note that all manual steps were performed for all queries
before any documents were retrieved. In other words, no
feedback information was used in generating the queries.
This makes ConQuest fully compliant with the rules for ad-
hoc queries in TREC-2.
Automatic Query Generation Steps
A special program was created to convert TREC-2 topic
descriptions into ConQuest query log files. The architecture
of this program is show in Figure 3.