SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Text Retrieval with the TRW Fast Data Finder
chapter
M. Mettler
National Institute of Standards and Technology
Donna K. Harman
When the pipeline's processor cells detect that a series of database characters have passed
by that match the desired pattern, a hit is indicated and passed by external circuiuy back to
the memory of the host processor and to the user. The FDF pipeline runs at a constant speed
as it performs character comparisons and logical operations, regardless of query
complexity. The system we used for the TREC conference searched at 10 MB/sec.
The queries or patterns are specified in the FDF's Pattern Specification Language (PSL).
The hardware directly supports all the features in the PSL query language without the need
for software post-processing. The processors in the pipeline may all be used to evaluate a
single large query or may assigned to evaluate numerous smaller queries. The number of
pipeline cells a query needs is proportional to the size of the query. PSL provides numerous
search functions, which may be nested in any combination, including:
* Boolean logic including negative conditions
* Proximity on any arbitrary pattern
* Wildcards and "don't cares"
* Character alternation
* Term counting, thresholds, and sets
* Error t6lerance (fuzzy matching)
* Term weighting
* Numeric ranges
2.1 Advantages and Disadvantages of Hardware Scanning
There are four principle advantages in using a hardware scanning approach for information
retrieval. First, the FDF can perform pattern matching functions much faster and more
cost-effectively that a general purpose CPU. This benefit comes in part because of the
parallelism of the FDF architecture. Second, a hardware scanner like the FDF can begin
processing the data immediately upon its receipt. There is no need to wait for the data to
be preprocessed or[OCRerr]ndexed before it can be searched. This is especially important for
dissemination (routing) applications. Third, no extra disk space is needed to store inverted
index data or other vector data beyond the text itself. Finally, the system's response time
in evaluating a query is independent of the query's complexity and thus easily predictable.
The FDF can perform fuzzy pattern matching on a term like "krasnoyarsk" with three
missing, incorrect, or extra characters, as easily as performing an exact string match.
There are two disadvantages to the FDF scanning approach. First, it is moderately
expensive to buy the hardware and adapt application programs to work with it. The
approach is therefore not cost effective for low-end applications or systems that don't do
significant amounts of text processing. Second, since the search is not complete until the
entire database has been scanned, the time to complete a single simple query will be greater
than with indexing methods.
3.0 Query Generation Approach
Our objective for the TREC conference was to see if we could utilize the pattern matching
power of the FDF to achieve superior recall and precision. This in turn revolved around
how well we could construct the queries for the FDF to execute. We were able to identify
310