SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
a partiCular query should match was split across two
document lines, then the match would fail. After the con-
ference, we found a relatively easy way to rectify this
design deficiency and were able to run tests a33 though
[OCRerr]6, and'6 and r7. Unfortunately, this change greatly
affected the system's response to the query and negation
thresholds, and we were not able to run enough tests to
find the optimum values for these parameters. The end
result is that the best results we have for this improved
system are still not as good as the official results we
reported.
* Another serious problem with the original system came
to light during the official conference. As Donna Harman
pointed out in her closing talk, no one has yet really
explored the full ramifications of changes in term
weighting strategies. Our original system used a very
simple-minded linear-falloff weighting scheme. Our
assumption was that concept strings appeared in reverse
order of importance. However, we began to suspect,
based on different concepts that were mentioned in a
number of the conference presentations, that this was
overly simplistic. We decided to implement a straight-
forward exponential-decay weighting scheme. In this
approach, the first query string gets a weight of 100, the
second gets a weight of, say, 90, the third a weight of 81,
the fourth a weight of 72, and so on, with each succeed-
ing weight tacing 90% of its predecessor's value. Unfor-
tunately, we did not have time to tune the system's
response to this change either, and its results (a34, a35,
a36, [OCRerr], and r7) are worse than the official results as
well. However, it appears that there is plenty of room for
experimentation with different term weighting schemes
and we will continue working with them.
* In our early work with the system before sending in the
official results, we did a fair amount of testing using just
the Associated Press documetit set. We were perhaps
misled by how well the system did on these documents,
and missed some chances to improve the system earlier.
The test labelled a35-AP shows what the results look
like for [OCRerr]5 if we restrict the new system to just return-
ing AP documents, and restrict the relevance judgements
to just AP documents. Even with this imperfectly tuned
new version, we see that the system is capable of signifi-
cantly better performance. It is unclear why there should
be such variation between the retrievability of the AP
documents and the other document collections.
At its best, our system performed as well as most of the sys-
tems that participated in [OCRerr][OCRerr]C-l. However, there is ample
room for improvement, as we have noted above, especially
in comparison to many of the systems that came back for
ThEC-2.
178
5.0 Further Research
The [OCRerr]fl[OCRerr]C-2 task is the first real application for our
N-gram-based multiple-query system. As in any experiment
of this nature, the results and problems suggest many more
possible avenues of research. These ideas fall into two cate-
gories.
5.1 Analyzing the Current System's
Performance
Further analysis of the existing system will allow us to bet-
ter understand its behavior and limitations. Some ways to do
that include:
* It is likely that generating query strings from the topic
concept strings may have significantly limited perfor-
mance. For example, Topic 74 about instances where the
U.S. government propounds conflicting policies com-
pletely failed to mention terms such as policy or regula-
tion in the concept list. Thus, our system had only a very
small chance whatsoever of finding matching docu-
ments. Zirnmerman's filtering system [4] did well with
handcrafted queries, so we should also try manually gen-
erated queries.
* Currently the system has a hard[OCRerr]oded cutoff threshold
of 40 for the weighted aggregate score. The purpose of
the threshold was to prevent the system from returning
results that were guaranteed to be noise because of their
very low score. This value was set more or less arbi-
trarily, so we should experiment with changing this
threshold to determine its true effect. In all likelihood, it
could be a fair amount higher, preventing the system
from generating other useless low-scoring results.
* Currently the system sets a cap of three times the maxi-
mum N-gram score for any query string score. Again,
this value was determined only by a very rough empin-
cal process, so we should experiment with changing this
cap, to see how much impact it has.
5.2 Extending the System
We can also make some significant changes to the system to
explore possibilities for other performance improvements.
* Currently the system treats upper and lower case alike
for both documents and queries. Since acronyms and
brand names have different meanings sometimes from
uncapitalized words having the same letters, perhaps
there is a way to take the case of letters into account
when computing a match. That is, we could count a