SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman sanction == E(member sanction sanctions 2140 AND 2138 2139 disinvestment) 2142 AND 2141 2029 <Sullivan Principles> 2143 OR 2137 2140 2142 <punitive *2 measures>] 2148 OR 2144 2145 2146 2147 2151 AND 2149 2150 sa[OCRerr]rica == E(member Buthelezi Pretoria 2154 OR 2153 52 anti-apartheid apartheid) 2155 AND 2152 2154 <De Kierk> 2156 OR 2148 2151 2155 <South (member Africa African)> ] 2157 AND 2143 2156 2158 AND 2156 2143 ,; rule 1 $sanction * $safrica => (mark-topic 52) TOPICOS2 OR 2157 2158 This description says that any matching text must have both an indicator of South Africa ($safrica) and one of sanctions ($sanction), and that the sanction phrase and South Africa phrase must appear in the same paragraph in the document. A sanction phrase can I)e any of the simple words sanc- tion, sanctions, or disinvestment, or any phrase includ- ing punitive measures with no more than two intervening words (like punitive economic measures). A South Africa phrase can also be either one of a group of simple words, or a phrase, like De Kierk, South Africa, or South African. These queries or topic descriptions can be quite com- plex, and the method has been designed to handle many queries simultaneously, so the rule compiler is designed to produce expressions that can be efficiently applied within a large set of queries. This is important because many queries can share the same simple terms or combinations of terms, and because the Boolean matcher must match the simplest expressions first. For the topic description given above, the output of the rule compiler will include the following tests: 52 TERM AFRICAN 2029 TERM MEASURES 2134 TERM SANCTION 2135 TERM SANCTIONS 2136 TERM DISINVESTMENT 2138 TERM SULLIVAN 2139 TERM PRINCIPLES 2141 TERM PUNITIVE 2144 TERM BUTHELEZI 2145 TERM PRETORIA 2146 TERM ANTI-APARTHEID 2147 TERM APARTHEID 2149 TERM DE 2150 TERM KLERK 2152 TERM SOUTH 2153 TERM AFRICA 2137 OR 2134 2135 2136 Each line in the above data gives a unique number (or topic designator) to the test, a test identifier (either TERM for a simple word test, OR, or AND), and a list of simple terms or previous tests. For example, test 2137 depends on tests 2134, 2135, and 2136, and is true if any of those tests is true, namely, if the text includes any of the words sanction, sanctions, or disinvestment. The tests are automatically ordered so that all tests that are dependent on other tests will have higher numbers than the tests they depend on; thus all TERM tests appear first. In this case, the TERM test AFRICAN appears with a much lower number simply because it is used in many different queries. The matcher, which can work either on complete doc- uments or paragraphs (but we used paragraph matching only in TREC-2) goes through every word in its input and, using a fast table look-up, sets the TERM tests to true for every word it encounters. At the end of input, ei- ther the end of the paragraph or end of each document, it runs through the table of possible tests from low numbers to high numbers and sets tests to true if their conditions are satisfied. A topic test produces a match if it has be- come true at the end of this process, meaning that the paragraph or document has passed the pre-filter for that query. A single paragraph, of course, can satisfy multiple queries. This portion of the system was implemented in the space of a few days, and is almost entirely the same as in TREC-1. Our focus since last year has been on query con- struction and ranking rather than matching or retrieval. 2.2 Query construction Our approach assumes, in general, that manual query construction is acceptable for routing. In ad hoc retrieval, query time can be of the essence, but in many routing applications, queries are developed and refined over time. The amount of time spent on query construction using a manual method in our system is comparable to the amount of time spent on the topic descriptions used for automatic query generation. 194