Indexing

Indexing is the process that creates an Index from Document objects. An Index is a structure that matches data sources to contained data, and vice versa. For example, when indexing text in documents, a given document will be considered a source, while a term in this text will be called a feature. For a given source, the index will be able to retrieve all the features that are contained in the source. From a feature, it may find every source that contains it.

Each document is indexed by using reflection to find its constituent field-level data elements (e.g., title, author, etc.). Then each such data element is asked for its indexing features, which are derived by getIndexingFeatures() methods for the data element (e.g., DeString, DeInteger, etc.) in terms of which the field has been defined. The indexing features are then stored in the index.

The indexingModalityUnit class captures information used to control indexing: the document feature (field) to be indexed, the name of the index being created for this field, and the indexer to use.

The Index class and its structure

To hide data and algorithms within an Index, it is decomposed into several provider classes, and it is specialized (not shown here) by two subclasses representing different types of Indexes (Keyword-matching and IDF-based) where the weight of features and scores are computed in different way. Here is the hierarchy:

Index
IdxIntern
PersistentDualKeyContainer
PersistenIrfHashtable

The main inner object of an Index is an instance of a class called IdxIntern. It provides all the indexing methods an actual Index needs, the class Index being used only as a "face-plate", explaining all methods.

Within the IdxIntern the index is modeled by a class called PersistentDualKeyContainer (PDKC), which in turn comprises a pair of special Hashtables that support access to indexing features by source document and by data element (e.g., a term from a text document). A way existed to use regular Java Hashtables, but efficiency matters and persistence issues led us to define our own Hashtable. As it is persistent and mainly meets IRF needs, it is called PersistentIrfHashtable. The provided versions of these classes use buffered random access files to support persistence.

We assumed there are only a few indexes in use at any one time in most research IR applications and that an Index shouldn't go to and from disk too often, as it is a big structure. Thus, IRF currently uses standard Java serialization to store Indexes (in a lightweight state) on disk. This way, even if it is a bit slower than if we had tuned the mechanism as we did for every other object, it saved us a lot of development time and we could be sure of the right behavior of the persistence for these objects from the very beginning. Standard serialization may be a lasting solution to store indexes on disk for some applications.

At the lowest level is the PersistentIrfHashtable. It was first designed to be a two level hashing mechanism, allowing only parts of the table to be in memory at one moment, and thus saving memory space. Because of that, it implements its own persistence mechanism based on Java serialization.


National Institute of Standards and Technology Home Last updated: Tuesday, 01-Aug-2000 06:34:24 MDT

Date created: Monday, 31-Jul-00
For further information contact Paul Over (over@nist.gov) with
copy to Darrin Dimmick (ddimmick@nist.gov)