SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Text Retrieval with the TRW Fast Data Finder chapter M. Mettler National Institute of Standards and Technology Donna K. Harman When the pipeline's processor cells detect that a series of database characters have passed by that match the desired pattern, a hit is indicated and passed by external circuiuy back to the memory of the host processor and to the user. The FDF pipeline runs at a constant speed as it performs character comparisons and logical operations, regardless of query complexity. The system we used for the TREC conference searched at 10 MB/sec. The queries or patterns are specified in the FDF's Pattern Specification Language (PSL). The hardware directly supports all the features in the PSL query language without the need for software post-processing. The processors in the pipeline may all be used to evaluate a single large query or may assigned to evaluate numerous smaller queries. The number of pipeline cells a query needs is proportional to the size of the query. PSL provides numerous search functions, which may be nested in any combination, including: * Boolean logic including negative conditions * Proximity on any arbitrary pattern * Wildcards and "don't cares" * Character alternation * Term counting, thresholds, and sets * Error t6lerance (fuzzy matching) * Term weighting * Numeric ranges 2.1 Advantages and Disadvantages of Hardware Scanning There are four principle advantages in using a hardware scanning approach for information retrieval. First, the FDF can perform pattern matching functions much faster and more cost-effectively that a general purpose CPU. This benefit comes in part because of the parallelism of the FDF architecture. Second, a hardware scanner like the FDF can begin processing the data immediately upon its receipt. There is no need to wait for the data to be preprocessed or[OCRerr]ndexed before it can be searched. This is especially important for dissemination (routing) applications. Third, no extra disk space is needed to store inverted index data or other vector data beyond the text itself. Finally, the system's response time in evaluating a query is independent of the query's complexity and thus easily predictable. The FDF can perform fuzzy pattern matching on a term like "krasnoyarsk" with three missing, incorrect, or extra characters, as easily as performing an exact string match. There are two disadvantages to the FDF scanning approach. First, it is moderately expensive to buy the hardware and adapt application programs to work with it. The approach is therefore not cost effective for low-end applications or systems that don't do significant amounts of text processing. Second, since the search is not complete until the entire database has been scanned, the time to complete a single simple query will be greater than with indexing methods. 3.0 Query Generation Approach Our objective for the TREC conference was to see if we could utilize the pattern matching power of the FDF to achieve superior recall and precision. This in turn revolved around how well we could construct the queries for the FDF to execute. We were able to identify 310