NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman First component Expanding the first component of 9 on the basis of term independence assumptions, and also making the assumption that eliteness is independent of document length (on the basis of the Verbosity hypothesis), we can obtain a formula for the weight of a term t which occurs tf times. This formula is similar to equation 2 in the main text, except that A and 1Ł are replaced by Ad/[OCRerr] and sd/[OCRerr]. The factors dIA in components such as Atf cancel out, leaving only the factors of the form e[OCRerr]AdlA Analysis of the behaviour of this function with varying If and d is a little complex. The simple function used for the experiments (formula 4) exhibits some of the correct proper- ties, but not all. In particular, the maximum value obtained as d 0 should be strongly dependent on tf; formula 4 does not have this property. B Extracts from a searcher's notes Choice of search terms Suitable words and phrases occurring in title, description, narrative, concept and definition fields were underlined- often this provided more than enough material to begin with. Sometimes they were supplemented by extra words, e.g. for a query on international terrorism I added "nego- tiate", "hostage", "hijack", "sabotage", "violence", "propa- ganda", as well as the names of known terrorist groups likely to fit the US bias of the exercise. I did not look at reference books or other on-line databases, and tended to avoid very specific terms like proper names from the query descriptions, as I found they could lead the search astray. For instance, the 1986 Immi- gration Law was also known as the Simpson-Mazzoli Act, but the name Mazzoli also turned up in accounts of other pieces of legislation, so it was better to use a combination of "real" words about this topic. In some queries, it was necessary to translate an ab- stract concept, e.g. "actual or alleged private sector eco- nomic consequences of international terrorism" into words which might actually occur in documents, e.g. "damage", "insurance claims", "bankruptcy", etc. For this purpose the use of a general (rather than domain-specific) thesaurus might be a useful adjunct to the system. Like the other participants I was surprised at the contents of the stop[OCRerr]word list, e.g. "talks", "recent", "people", "new", but not "these"! However it was usually possible to find synonyms for stop-words and their absence was not seriously detrimental to any query. Grouping of terms, use of operators Given the complexity of the queries, it was obviously nec- essary to build them up from smaller units. My original intention was to identify individual facets and create sets of single words representing each, then put them together to form the whole query. [OCRerr]...] For example, for a query about 32 the prevention of nuclear proliferation I had a set of "nu- clear" words (reprocessing, plutonium, etc.), a set of "con- trol" words (control, monitor, safeguards, etc.) and sets of words for countries (argentina, brazil, iraq, etc.) suspected of violating international regulations on this point. This proved a bad strategy-the large sets (whether ORed or BMed7 together) had low weightings because of their collec- tively high frequencies, and the final query was very diffuse. A more successful approach was to build several small, high-weighted sets using phrases with OP=ADJ or OP=SAMES[entence] (e.g. economic trends, gross national product, standard of living, growth rate, productivity gains), and then to BM them together, perhaps with a few extra singletons (e.g. decline, slump, recession). Because of the TREC guidelines, I didn't look at any documents for the small sets as I went along, although under normal circum- stances I would have done so. Our initial instructions were to use default best-matching if at all possible, rather than explicit operators. As al- ready suggested, ADJ and SAMES were an absolute neces- sity given the length of documents to be searched, but AND and OR were generally avoided-on the occasions when I tried AND (out of desperation) it was not particularly use- ful. For one query where I thought it might be necessary (to restrict a search to documents about the US economy) it luckily proved superfluous because of the biased nature of the database, indeed it would have made the results worse as the US context of these documents was implied rather than stated. Viewing results, relevance feedback Normally I looked at about the top 5-10 records from the first full query. If 40% or more seemed relevant, the query was considered to be fairly satisfactory and I went on down the list trying to accumulate a dozen or so records for the ex- traction phase. As ... noted by other participants, there was a conflict between judging a record relevant because it fitted the query, and because it was likely to yield useful new terms for the next phase. On the one hand were the "newsbyte" type of documents containing one clearly relevant paragraph amidst a great deal of potential noise, and on the other the documents which were in the right area, contained all the right words, but failed the more abstract exclusion condi- tions of the query. I tried to judge on query relevance, but erred on the side of permissiveness for documents containing the right sort of terms. The competition conditions discouraged a really thorough exploration of possibilities when a query was not initially successful. In one very bad case, having seen more than 20 irrelevant records and knowing that they would appear at the head of my output list, I felt that the query would show up badly in the [results] anyway and that it was not worth exploring further, as I might had there been a real question to answer. 7BM = "best match"; the default weighted set combination operation was BM15 (see Section 2.6)