SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Single Language Evaluation of a Multi-lingual Text Retrieval System
chapter
T. Dunning
M. Davis
National Institute of Standards and Technology
Donna K. Harman
-2-
to search for phrases, or to find instances of words appearing near each other. We plan in the
near future to alter one of the position indices to store sentence position instead of word position
relative the beginning of a file. In other CRL systems, the ability to determine whether words
appear in the same sentence has proved very useful.
3. Indexing Operations
Indexing for the Multi-Lingual Text-Retrieval system consists of a 4-pass process. The
passes include tokenization, relabelling, sorting and stripping.
3.1. Tokenization
Tokenization in English consists of putting each word in the input text on a single line
along with positional and document information. Kill list processing is done at this stage, as
well, as creation of a word and document table. Lemmatization and suffix stripping can be done
at this stage, but no savings result since this merely increases the average length of posting vec-
tors during retrieval. Further, early lemmatization or suffix stripping makes it impossible to
later avoid lemmatization or suffix stripping.
In Japanese, tokenization down to the level of strings of consecutive kanji or katakana is
supported as well as tokenization down to individual kanji and strings of katakan[OCRerr] Since
strings of hiragana typically are used for inflection, they are ignored. This results in words
being ignored, since there are words in Japanese (mostly verbs) which are written in hiragana.
Using kanji strings results in indexing a large number of phrases, while indexing each kanji ide-
ograph separately results in an increasing number of phrase searches later in the retrieval pro-
cess with an accompanying drop in performance.
Preprocessing to mark and normalize company, place and person names and dates is best
included into the tokenization process. We plan to incorporate elements of CRL's Tipster
Extraction software into this program at some time to investigate the impact this has on perfor-
mance, but we have not done so at the present time.
Indices for fielded data can be created by using a conventionalized word location such as
0. mis approach allows all such data to be handled in a uniform manner.
3.2. Relabelling
The output of the tokenization process is still ASCII or some form of extended ASCII
such as Unicode or JIS. The next step is the conversion to a uniform binary form which assists
194