SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman associated information. The result of applying these techniques to the TREC data is described in Section 2.5. 2.1 Text databases Text databases provide access to text documents on the basis of their content. We assume that access is via the inverted file indexing scheme we have described elsewhere [16]. In this scheme, each distinct word in the database is held in a vocabulary, which may be an array, or may be a search structure such as a B-tree. For each word in the vocabulary the inverted index stores an index entry, a list of the identifiers of documents containing the word. Interleaved with the identifiers are the number of occurrences of each word in each document. Since the index entries contain ordinal document identifiers rather than disk addresses, a document mapping is needed to convert identifiers into addresses. This will require either a non-trivial amount of memory space, or must be stored on disk in an auxiliary file. Similarly, an index mapping is required to turn an ordinal word number into the address of the corresponding index entry. Query [[OCRerr]bularY Inverted --------- Index L Search __ Entries Algorithm mApproximate Lengths [OCRerr]: Ranking Algorithm F;Hidexm Mapping Compression L;-iodel Compressed Text Answers [OCRerr]Compression: Algorithm Figure 1: Organisation of text databases ~ment LL-jrmation Other components of the database are the compress[OCRerr]on model, used for text compression, and the document lengths, used for ranking. These are discussed later. In our experiments, the vocabulary and compression model were held in memory, together with an array of approximate lengths, also described below; all of the other structures were stored on disk. Figure 1 shows the relationship between these various components. 2.2 Query evaluation Given an inverted file index of the format described, it is straightforward to rank the documents in a database with respect to an informally phrased query, returning, say, the top ten documents as answers. As an example, consider the cosine measure, one of the most effective ranking 230