Internals of the PersistentDualKeyContainer (PDKC)

This class is the heart of the Index. It sorts the whole information about the data base in a complex structure that allows one to retrieve this information and carry out computations on it. This way, scores for documents and features can be calculated, stored and retrieved in this structure.
For such a purpose, several classes are needed; each of them is going to be detailed here, from the smallest to the largest:

IoAddrIntern

This class is the internal representation of a data source. It mainly contains a reference to a Document (source) and two variables, the number of features that this source contains and the score this source can be given by the kind of Index chosen. The two important methods in this class are hashcode() and equals(). They are both used in the hashtable mechanism described below. The first one returns a hashcode based only on the document that it represents. The second one tries to make it the easiest possible to compare Documents (instances) with eventually their representation (such as IoAddrInterns).

DeIntern

This second class is analogous to the first one, but for data elements, i.e., pieces of a document (data elements), and not for whole sources, to the extent that it just contains a reference to a data element and the two variables that are needed for the indexing mechanisms, i.e., a score and the number of different sources where this data element can be found. Often, a DeIntern will be able to be used as the data element it contains, in methods such as hashcode() and equals().

HashBlock

A HashBlock is a small Hashtable that manages itself its own writing and reading to/from disk. It can only store ProxyFeatureLists as values associated to keys which are DeInterns or IoAddrInterns. Unlike a Hashtable, it just deals with a chunk of the hashcode space, but only its client class, PersistentIrfHashtable, is aware of this.

PersistentIrfHashtable

This class implements a classic hashtable, but it adds functionality: first, it divides the table in chunks that can be stored to disk so that they don't fill up the memory. Each chunk is a HashBlock. Also, it implements the getActualSource() method which is at the center of the indexing mechanism.

PersistentDualKeyContainer

The core of IdxIntern, this class combines two hashtables to keep track of every indexing feature, where it comes from, and prepares the job for retrieval, "sorting" values and organizing them in feature lists. At first, it was designed to be generic, so everything could be stored in this object, and the keys could be of any type. The class was actually compliant to the Dictionary interface. But for efficiency and persistence sake, this class is now very specific to the framework. It defigns a contract, and anyone wishing to use a different persistence scheme, for example, will have to define a class obeying the same contract. So its interface has become more and more specific, and so are the interfaces underneath (PersistentIrfHashtable, IoAddrIntern, DeIntern, ...).

PDKC and persistence

The PDKC class is (de)serialized by the IdxIntern using Java serialization (write/readObject()). IdxIntern uses the same methods to (de)serialize the two contained PersistentIrfHashtable objects. The latter contain methods to manage the file-based persistence of the HashBlocks they contains. The PersistentIrfHastable class computes the name of the file in which the needed HashBlock is stored. A static method in HashBlock (readFrom(BufferedRandomAccessFile)) then reads it from the given file if it's available; otherwise it creates the file and assumes the Block was empty, as it had never been accessed. The block itself provides file-based persistence, with help from the PersistentObjectManager class, for reading and writing the DeInterns, IoAddrInterns and ProxyFeatureLists which the block refers to. If need be, the block can be asked to store itself on disk.

National Institute of Standards and Technology Home

Last updated:

Date created: Monday, 31-Jul-00
For further information contact Paul Over ([email protected]) with
copy to Darrin Dimmick ([email protected])