gov.nist.nlpir.irf.index.braf
Class PersistentDualKeyContainer

java.lang.Object
  |
  +--gov.nist.nlpir.irf.index.braf.PersistentDualKeyContainer

public class PersistentDualKeyContainer
extends java.lang.Object
implements java.io.Serializable

This class is the "heart" of the Index. It mainly contains two PersistentIrfHashtables of feature lists. The first one, called valuesBySource, stores all the indexing features found, classified by source. It will allow user to retrieve all the features of a given document, for example. The second one, called sourcesByValue, stores the different values whatever their source is. It will allow user to retrieve all the sources where a given feature appear.

Each entry in a table corresponds to a list of features. For the first table, the vector contains all the features found for the source entry. In the second table, one vector contains all the sources in which the entry feature can be found. The features are shared by the lists, ie each feature belongs to two lists.

An example of using the class is below

 setIndexingFeature = new PersistentDualKeyContainer("myFile", true); 
 
 // Add an element to DualKeyContainer using two keys
 setIndexingFeature.put(source, key, value);
 // More precise example
 setIndexingFeature.put("doc1", "word1", "word1InDoc1");
 setIndexingFeature.put("doc1", "word2", "word2InDoc1");
 setIndexingFeature.put("doc2", "word1", "word1InDoc2");
 // We now have:
 // setIndexingFeatures.getSourceVector("word1") == {"word1InDoc1", "word1InDoc2"}
 // setIndexingFeatures.getValuesVector("doc1") == {"word1InDoc1", "word2InDoc1"}

 // But you will actually have to define more precise objects:
 // the keys have to be DeIntern and the sources have to be
 // IoAddrIntern, the values have to be ProxyIndexingFeature.
 

The comments of this class are going to define a Contract that any class being used to replace this one must conform to. One may wish to get rid of the current PersistentDualKeyContainer class because it embeds its own persistence mechanism. Thus, instead of modifying it, the easiest may be to write completely another class using the same interface. If parts of PersitentDualKeyContainer are to be reused, the ways those parts are currently managed will have to be studied closely (management of references count, collection, shutdown issues) in order to avoid any strange side effect, even if most of them are commented. The contract is only defined for public methods, obviously, because the interface doesn't constrain to a certain type of inner mechanisms.

Version:
$Revision: 1.8 $
Author:
This software was produced by NIST, an agency of the U.S. government, and by statute is not subject to copyright in the United States. Recipients of this software assume all responsibilities associated with its operation, modification and maintenance.
See Also:
DualKeyContainer, Serialized Form

Field Summary
private  java.lang.String DB_Directory
          Directory containing files that comprise the PDKC
private static int FEATURE_VECTOR_CAPACITY_INCREMENT
           
private static int FEATURE_VECTOR_INITIAL_CAPACITY
           
private  java.lang.Object lastSource
           
private  ProxyFeatureList lastValues
          Caching mechanism objects: This cache is only used by put().
private  java.lang.String poolNameSbV
          Name of the pool file for sources-by-value
private  java.lang.String poolNameVbS
          Name of the pool file for values-by-source
(package private) static long serialVersionUID
          serial version universal id - put here so Java does not insert one which may change due to revisions and make it impossible to deserialize earlier versions of serialized objects
private static int SOURCE_VECTOR_CAPACITY_INCREMENT
           
private static int SOURCE_VECTOR_INITIAL_CAPACITY
           
private  PersistentIrfHashtable sourcesByValue
          Table of features accessed by document
private  int sourcesNumber
          Size of sourcesByValue
private  int uniqueValuesNumber
          Size of valuesBySource
private  PersistentIrfHashtable valuesBySource
          Table of features accessed by feature
private  int valuesNumber
          Number of values stored
 
Constructor Summary
PersistentDualKeyContainer(java.lang.String DB_Directory, java.lang.String indexName)
          Constructor for PersistentDualKeyContainer Contract: This constructor is only used by IdxIntern.
 
Method Summary
 void clear()
          Clears both tables.
 java.util.Enumeration elements()
          Returns all the features stored.
Contract: The Enumeration returned contains all the ProxyIndexingFeatures stored in the PDKC.
 java.lang.Object getActualFeature(java.lang.Object feature)
          When a value is stored, it appears in a FeatureList corresponding to its feature (a key of a hashtable).
 java.lang.Object getActualSource(java.lang.Object source)
          When a value is stored, it appears in a FeatureList corresponding to its source (a key of a hashtable).
 java.util.Vector getAllValues()
          Returns a Vector containing all the values stored in the PersistentDualKeyContainer.
Contract: This Vector is the concatenation of all Vectors or Lists of ProxyIndexingFeatures that may be found in the PDKC.
 int getFeatureBinCount()
          Returns the length of the base array in the PersistentIrfHashtable by feature.
 ProxyFeatureList getFeatureVector(java.lang.Object sourceKey)
          Contract: The returned ProxyFeatureList must contain every ProxyIndexingFeature that was stored in the PDKC with a sourceKey matching the one given in this method's parameter.
 int getNumberOfSourcesFor(java.lang.Object feature)
          Gives the number of sources containing the given feature.
 int getNumberOfValuesFor(java.lang.Object source)
          Gives the number of features stored for the given source.
 int getSourceBinCount()
          Returns the length of the base array in the PersistentIrfHashtable by source.
 java.util.Enumeration getSources()
          Returns the enumeration of Objects used as sources in this PersistentDualKeyContainer.
 int getSourcesNumber()
          Gives the number of sources in the DualKeyContainer.
Contract: Each time a source is added to the PDKC, the sourcesNumber variable must be increased (see put()).
 ProxyFeatureList getSourceVector(java.lang.Object featureKey)
          Returns the Vector associated to the parameter key.
Contract: The returned ProxyFeatureList must contain every ProxyIndexingFeature that was stored in the PDKC with a featureKey matching the one given in this method's parameter.
 int getUniqueValuesNumber()
          Returns the number of different values stored in the PersistentDualKeyContainer.
Contract: The name of this method is a bit ambiguous.
 java.util.Enumeration getValues()
          Returns the enumeration of Objects used as features for this PersistentDualKeyContainer.
Caution: these are the actual keys, ie an object used as a feature may not be returned if another object equal to the first one (considering hashCode() and equals()) had already been used as a feature for this DualKeyContainer.
 int getValuesNumber()
          Gives the total number of features stored in the table.
Contract: Each time a feature is added to the PDKC, the valuesNumber variable must be increased (see put()).
 boolean isEmpty()
           
 void put(java.lang.Object sourceKey, java.lang.Object featureKey, ProxyIndexingFeature proxyObject)
          Puts a ProxyIndexingFeature instance in the container.
 void showHashtableStatistics()
          To retrieve the statistics concerning the hashtables inside the PDKC.
 void showStatistics(int depth, int maxLengthIfDepth3)
          Prints statistics about the PersistentDualKeyContainer, ie its size and the size of its elements.
private  void showStats(PersistentIrfHashtable table, int depth, int maxLengthIfDepth3)
          Prints statistics for ONE HVtable.
 java.lang.String toString()
          Classic representation method.
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait
 

Field Detail

serialVersionUID

static final long serialVersionUID
serial version universal id - put here so Java does not insert one which may change due to revisions and make it impossible to deserialize earlier versions of serialized objects

sourcesByValue

private PersistentIrfHashtable sourcesByValue
Table of features accessed by document

valuesBySource

private PersistentIrfHashtable valuesBySource
Table of features accessed by feature

DB_Directory

private java.lang.String DB_Directory
Directory containing files that comprise the PDKC

poolNameSbV

private java.lang.String poolNameSbV
Name of the pool file for sources-by-value

poolNameVbS

private java.lang.String poolNameVbS
Name of the pool file for values-by-source

valuesNumber

private int valuesNumber
Number of values stored

sourcesNumber

private int sourcesNumber
Size of sourcesByValue

uniqueValuesNumber

private int uniqueValuesNumber
Size of valuesBySource

SOURCE_VECTOR_INITIAL_CAPACITY

private static int SOURCE_VECTOR_INITIAL_CAPACITY

SOURCE_VECTOR_CAPACITY_INCREMENT

private static int SOURCE_VECTOR_CAPACITY_INCREMENT

FEATURE_VECTOR_INITIAL_CAPACITY

private static int FEATURE_VECTOR_INITIAL_CAPACITY

FEATURE_VECTOR_CAPACITY_INCREMENT

private static int FEATURE_VECTOR_CAPACITY_INCREMENT

lastValues

private transient ProxyFeatureList lastValues
Caching mechanism objects: This cache is only used by put(). It dramatically improves performance.

lastSource

private transient java.lang.Object lastSource
Constructor Detail

PersistentDualKeyContainer

public PersistentDualKeyContainer(java.lang.String DB_Directory,
                                  java.lang.String indexName)
Constructor for PersistentDualKeyContainer Contract: This constructor is only used by IdxIntern. Thus, its signature may change if you need to provide it with more paramaters, as the switch of class will lead to a change in the client class (here, IdxIntern). As a classic contructor, this method is responsible for initializing the entire structure of the PDKC.
Parameters:
DB_Directory - the name of the directory
indexName - name of the index this PDKC supports or the radical of the names for the DB files.
See Also:
IdxIntern
Method Detail

getSourceVector

public final ProxyFeatureList getSourceVector(java.lang.Object featureKey)
Returns the Vector associated to the parameter key.
Contract: The returned ProxyFeatureList must contain every ProxyIndexingFeature that was stored in the PDKC with a featureKey matching the one given in this method's parameter. A key is considered to match another one if they both return the same result to hashcode() and return true when the method equals() is called on each other.
Parameters:
featureKey - value used to search the container.
Returns:
a list of lightweight proxies for indexing features. The list will never be null. If no feature matches the value, the returned list is empty.

getFeatureVector

public final ProxyFeatureList getFeatureVector(java.lang.Object sourceKey)
Contract: The returned ProxyFeatureList must contain every ProxyIndexingFeature that was stored in the PDKC with a sourceKey matching the one given in this method's parameter. A key is considered to match another one if they both return the same result to hashcode() and return true when this method is called on each other.
Parameters:
sourceKey - source used to search the container.
Returns:
a list of proxies for indexing features. As for getSourceVector, the list will never be null.

put

public void put(java.lang.Object sourceKey,
                java.lang.Object featureKey,
                ProxyIndexingFeature proxyObject)
Puts a ProxyIndexingFeature instance in the container. It may then be retrieved either by source with an equivalent IoAddrIntern or (Proxy)Document or by value with an equivalent DeIntern or (Proxy)DataElem. For equivalence of DEs/DeInterns and IoAddrIntern/Documents, see those classes.
Contract: If ProxyFeatureLists are used beneath that, they can be made lightweight right after they are created, thus not loading memory (their mechanism allows them to grow without being in memory unless you need to read them, not just extend them). This method must take care of three variables: sourcesNumber , uniqueValuesNumber and valuesNumber . The first two are incremented by one if sourceKey and featureKey respectively are encountered for the first time. The third variable has to be increased at every call, because each of them will result in an extra feature in the PersistentDualKeyContainer.
Parameters:
sourceKey - an IOAddrIntern
featureKey - a DeIntern
proxyObject - Must be a ProxyIndexingFeature. Declared as an Object only for compliance with DualKeyContainer.
See Also:
DeIntern, IoAddrIntern, Document, DataElem

toString

public java.lang.String toString()
Classic representation method.
Overrides:
toString in class java.lang.Object

getValuesNumber

public final int getValuesNumber()
Gives the total number of features stored in the table.
Contract: Each time a feature is added to the PDKC, the valuesNumber variable must be increased (see put()). This method returns the value of this variable.
Returns:
The number of values in the table.

isEmpty

public final boolean isEmpty()
Returns:
true if no key table has been initialized.
Contract: Returns true if no feature is stored in the PDKC, i.e. it has just been created or emptied.

elements

public final java.util.Enumeration elements()
Returns all the features stored.
Contract: The Enumeration returned contains all the ProxyIndexingFeatures stored in the PDKC. The order isn't important.
See Also:
DualKeyContainer.put(java.lang.Object, java.lang.Object, java.lang.Object)

getAllValues

public java.util.Vector getAllValues()
Returns a Vector containing all the values stored in the PersistentDualKeyContainer.
Contract: This Vector is the concatenation of all Vectors or Lists of ProxyIndexingFeatures that may be found in the PDKC.

showStatistics

public void showStatistics(int depth,
                           int maxLengthIfDepth3)
Prints statistics about the PersistentDualKeyContainer, ie its size and the size of its elements. Keys are also printed if their length is less than 15.
Parameters:
depth - 1, prints the size of the contained vectors for each hash table,
2, gives for each vector its key and the number of its elements,
3, gives the key of each vector and prints its content, each element truncated to 25 characters if longer,
4 is for a special 2, ie only the size of the vectors are printed one behind another.
Any other value will result in an empty display.

showStats

private void showStats(PersistentIrfHashtable table,
                       int depth,
                       int maxLengthIfDepth3)
Prints statistics for ONE HVtable.
Parameters:
table - The hashtable of vectors to be presented.
depth - Same as showStatistics() depth.

showHashtableStatistics

public void showHashtableStatistics()
To retrieve the statistics concerning the hashtables inside the PDKC.

getSourcesNumber

public final int getSourcesNumber()
Gives the number of sources in the DualKeyContainer.
Contract: Each time a source is added to the PDKC, the sourcesNumber variable must be increased (see put()). This method returns the value of this variable. "A source is added" means a ProxyIndexingFeature is given for storage with a sourceKey that cannot be found yet as being a source for at least one ProxyIndexingFeature.

getUniqueValuesNumber

public final int getUniqueValuesNumber()
Returns the number of different values stored in the PersistentDualKeyContainer.
Contract: The name of this method is a bit ambiguous. It must return the uniqueValuesNumber variable. This variable counts the number of different features there is in the pDKC, ie the number of different featureKeys put() has been called with.

getSources

public final java.util.Enumeration getSources()
Returns the enumeration of Objects used as sources in this PersistentDualKeyContainer. Caution: these are the actual keys, ie an object used as a source may not be returned if another object equal to the first one (considering hashCode() and equals()) had already been used as a source for this PersistentDualKeyContainer. Basically, it's an enumeration of IoAddrInterns.
Contract: Every different IoAddrIntern used as a source for put() must be present in the Enumeration result. "different" for two IoAddrInterns means they return false when called one another for equals().

getValues

public final java.util.Enumeration getValues()
Returns the enumeration of Objects used as features for this PersistentDualKeyContainer.
Caution: these are the actual keys, ie an object used as a feature may not be returned if another object equal to the first one (considering hashCode() and equals()) had already been used as a feature for this DualKeyContainer. Thus, this object would be the one returned. It is an enumeration of DeInterns.
Contract: Same as getSources(), except it returns an Enumeration of DeInterns and not IoAddrInterns.

getSourceBinCount

public final int getSourceBinCount()
Returns the length of the base array in the PersistentIrfHashtable by source.
Returns:
count of bins valuesBySouce table

getFeatureBinCount

public final int getFeatureBinCount()
Returns the length of the base array in the PersistentIrfHashtable by feature.
Returns:
count of bins sourcesByValue table

getActualSource

public final java.lang.Object getActualSource(java.lang.Object source)
When a value is stored, it appears in a FeatureList corresponding to its source (a key of a hashtable). But the key in the hashtable may not be the object the value was stored with. This method gives this object from the object used as a source. This way, information about the FeatureList of sources (like its size, ...) can be stored in the key used to access this list.
Contract: This method incarnates the problem addressed several times: two IoAddrInterns may return true but still be different objects. The only condition for this is that they "represent" the same IRF_Document. But IoAddrInterns contains more data than just a reference to a document. Thus, to gather information in the IoAddrIntern concerning a document, this method must be called to ensure the correct IO_AdrIntern is actually used.
See Also:
PersistentIrfHashtable.getActualKey(java.lang.Object)

getActualFeature

public final java.lang.Object getActualFeature(java.lang.Object feature)
When a value is stored, it appears in a FeatureList corresponding to its feature (a key of a hashtable). But the key in the hashtable may not be the object the value was stored with. This method gives this object from the object used as a feature. Thus, information about the values stored for a feature can be stored in the key (the feature).
Contract: The problem to understand here is the same as in getActualSource(): DeInterns and DEs can be considered equals as soon as they represent the same data, not necessarily in the same place. The storage in the PDKC must takes this in account.
See Also:
PersistentIrfHashtable.getActualKey(java.lang.Object)

clear

public void clear()
Clears both tables.

getNumberOfValuesFor

public final int getNumberOfValuesFor(java.lang.Object source)
Gives the number of features stored for the given source.
Parameters:
source - the source for which the number of values will be computed.
Returns:
The number of values,
0 if the source doesn't appear.

getNumberOfSourcesFor

public final int getNumberOfSourcesFor(java.lang.Object feature)
Gives the number of sources containing the given feature.
Parameters:
feature - the feature for which the number of sources will be computed.
Returns:
The number of sources,
0 if the feature doesn't appear.