This class is the heart of the Index. It sorts the whole information
about the data base in a complex structure that allows one to retrieve
this information and carry out computations on it. This way, scores for
documents and features can be calculated, stored and retrieved in this
structure.
For such a purpose, several classes are needed; each of
them is going to be detailed here, from the smallest to the largest:
This class is the internal representation of a data source. It mainly
contains a reference to a Document (source) and two variables, the
number of features that this source contains and the score this source
can be given by the kind of Index chosen. The two important methods in
this class are hashcode()
and equals()
. They
are both used in the hashtable mechanism described below. The first
one returns a hashcode based only on the document that it
represents. The second one tries to make it the easiest possible to
compare Documents (instances) with eventually their representation
(such as IoAddrInterns).
This second class is analogous to the first one, but for data
elements, i.e., pieces of a document (data elements), and not for
whole sources, to the extent that it just contains a reference to a
data element and the two variables that are needed for the indexing
mechanisms, i.e., a score and the number of different sources where
this data element can be found. Often, a DeIntern will be able to be
used as the data element it contains, in methods such as
hashcode()
and equals()
.
A HashBlock is a small Hashtable that manages itself its own writing and reading to/from disk. It can only store ProxyFeatureLists as values associated to keys which are DeInterns or IoAddrInterns. Unlike a Hashtable, it just deals with a chunk of the hashcode space, but only its client class, PersistentIrfHashtable, is aware of this.
This class implements a classic hashtable, but it adds functionality: first, it divides the table in chunks that can be stored to disk so that they don't fill up the memory. Each chunk is a HashBlock. Also, it implements the getActualSource() method which is at the center of the indexing mechanism.
The core of IdxIntern, this class combines two hashtables to keep track of every indexing feature, where it comes from, and prepares the job for retrieval, "sorting" values and organizing them in feature lists. At first, it was designed to be generic, so everything could be stored in this object, and the keys could be of any type. The class was actually compliant to the Dictionary interface. But for efficiency and persistence sake, this class is now very specific to the framework. It defigns a contract, and anyone wishing to use a different persistence scheme, for example, will have to define a class obeying the same contract. So its interface has become more and more specific, and so are the interfaces underneath (PersistentIrfHashtable, IoAddrIntern, DeIntern, ...).
The PDKC class is (de)serialized by the IdxIntern using Java serialization (write/readObject()). IdxIntern uses the same methods to (de)serialize the two contained PersistentIrfHashtable objects. The latter contain methods to manage the file-based persistence of the HashBlocks they contains. The PersistentIrfHastable class computes the name of the file in which the needed HashBlock is stored. A static method in HashBlock (readFrom(BufferedRandomAccessFile)) then reads it from the given file if it's available; otherwise it creates the file and assumes the Block was empty, as it had never been accessed. The block itself provides file-based persistence, with help from the PersistentObjectManager class, for reading and writing the DeInterns, IoAddrInterns and ProxyFeatureLists which the block refers to. If need be, the block can be asked to store itself on disk.