SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
An Information Retrieval Test-bed on the CM-5
chapter
B. Masand
C. Stanfill
National Institute of Standards and Technology
D. K. Harman
the index. Again, there are many variants on inverted file
representations [10].
Inverted files have proved the only technique which sup-
ports interactive access to very large databases; this is prima-
rily due to the excessive I/O requirements of the other
methods. Within the family of inverted file methods, several
variants are possible:
* Simple Inverted Files. In a simple inverted file, the index
consists of a list of documents in which each word occurs.
* Inverted Files with Weights. In an inverted file with
weights, the above information is supplemented with
weights, in support of vector retrieval methods.
* Structured Inverted Files. In a structured inverted file,
the index captures structural information (typically the
paragraph/sentence/word coordinates of each occurrence
of each term), in support of boolean retrieval methods
which rely on proximity operators (e.g. word adjacency).
In this research, we have decided to explore the use of
structured inverted files for the following reasons:
* They support the proximity operations required as part of
boolean retrieval systems. This support makes architec-
tures based on structured inverted files more likely to be
adopted by the major on-line vendors.
* They provide a flindamentally richer representation of
document structure than is available with the other meth-
ods.
* They are collection-independent and retrieval-method
independent.
This last point requires some explanation. In a distributed
environment, it would be useful to be able to search text files
at multiple locations as if they were a single text file. Once
weights - which are both collection-dependent and retrieval-
method dependent - are incorporated into the index, trans-
parent distributed access becomes impossible. The product of
this is that the results of a single query applied to multiple
databases cannot be meanin[OCRerr]ly combined, since there is no
way to compare the ranks or scores applied to the documents
returued.
One of the primary challenges associated with the use of
structured inverted files is their bulk: there are approximately
as many index entries in the inverted file as there are words in
the database, and each entry must represent a document, para-
graph, sentence, and word coordinates for that word. This
problem has been substantially solved by use of a novel coin-
bination of compression techniques [11], which allow struc-
tured indexes having a bulk on the order of 1/3 the size of the
full text to be constructed.
The second challenge associated with the use of structured
inverted files is to implement the standard information
retrieval model using them. The results of this effort are
reported in this paper.
118
The final challenge associated with structured inverted files
is using them to implement methods which go beyond the
standard model, taking into account the added richness the
structured representation to improve retrieval system peffor-
mance for databases having long, structured documents. This
final challenge remains a topic for future research, and there
are no results to report at this time.
B. The Database Architecture
The database consists of the following structures:
* A Compressed Structured Posting File. The first compo-
nent of the database is an array of compressed postings.
Each posting gives the location of a word expressed as a
document-paragraph-sentence-word 4-tuple. The postings
are sorted by word D, in ascending document order.
A Lexicon/Index. The second component of the database
is a lexicon[mdex. For each word in the database, it stores
the number of times itoccurs plus the location of its post-
ings in the posting file. The lexicon is structured so that the
information pertaining to a given word may be qulckly
located.
* Document Information. The final component of the data-
base is an array of document-related information. Each
entry contains a mapping from an internal document iden-
tifier (an integer) to an external document identifier (a
string). Additional information, such as the length of the
document or normalization information, may be added as
necessary.
The following operations are supported:
* Adding Postings. A 5-tuple consisting of a word plus its
four coordinates may be added to the database.
* Extracting Postings. A word is presented to the database.
An array having the decompressed coordinates of all
occurrences of that word is returned. The array is sorted by
ascending document coordinate, with paragraph, sentence,
and word coordinates used as additional sort keys as
needed.
* Extracting Lexicon Information. Lexicon information
such as the number of occurrences of a word may be
extracted.
* Setting Lexicon Information. Supplemental lexicon
information (e.g. part-of-speech information) may be
added to the lexicon. This feature is not used in the work
reported herein.
* Delining a Document. Defines document-specific infor-
mation such as external document identifier, document
location on disk, etc., and associates it with a document
coordinate.
* Extracting Document Information. Extracts the docu-
ment-specific information mentioned above, given a docu-
ment coordinate.