gov.nist.nlpir.irf.index
Class IdfIdxIntern

java.lang.Object
  |
  +--gov.nist.nlpir.irf.index.IdxIntern
        |
        +--gov.nist.nlpir.irf.index.IdfIdxIntern

public class IdfIdxIntern
extends IdxIntern

This class extends IdxIntern to support indexes based around the Inverse Document Frequency (IDF) term weighting measure. Note: The Math.log() function returns the natural log ("ln" or log to the base e) of its argument (see java.lang.Math.log()). We have used it here for the sake of performance, though traditionally IR applications use log base 2.

Version:
$Revision: 1.10 $
Author:
This software was produced by NIST, an agency of the U.S. government, and by statute is not subject to copyright in the United States. Recipients of this software assume all responsibilities associated with its operation, modification and maintenance.
See Also:
Serialized Form

Field Summary
private static int numCalcDocumentScoreThreads
          Number of threads to be started for calculation of document scores
private static int numCalcFeatureScoreThreads
          Number of threads to be started for calculation of feature scores
 
Fields inherited from class gov.nist.nlpir.irf.index.IdxIntern
flagUpToDate, indexingFeatures, serialVersionUID
 
Constructor Summary
IdfIdxIntern(java.lang.String loc, java.lang.String name)
          Basic constructor with the Index location and name.
 
Method Summary
 void calcCombinedScore(ResultForDocMatchingQueryCondition aResult)
          Compensates for differences in the scores of retrieved features, e.g.
 void calcDocumentScore(IoAddrIntern anIO_Addr)
          Computes the document score and sets it in the passed IoAddrIntern.
 void calcFeatureScore(DeIntern aDEI)
          Calculates the score of the given feature/DeIntern and sets it in the DeIntern (key to the sources-by-value table).
 void calcInputForRSV(ResultForDocMatchingQueryCondition aResult)
          Positions fields of aResult to allow the computation of Retrieval Status Value (RSV).
 double calcRSV(ResultForDocMatchingQueryModalityUnit aResult)
          Calculates the Retrieval Status Value (RSV) of a single modality unit of the result of a query.
 void calcWeightOfFeatureInDocument(ResultForDocMatchingQueryCondition aResult)
          Computes the weight of the given feature for the documents it appears in and sets it in the result.
 void calcWeightOfFeatureInQuery(ResultForDocMatchingQueryCondition aResult)
          Calculates the weight of the feature in the query and sets it in the result.
 java.lang.String toString()
          Returns a string representation of this index
 void update()
          Updates the index after it has been created.
 void updateFeatures()
          Computes the scores of the stored features in one or more separate threads.
 void updateSources()
          Computes the scores of the stored documents in one or more separate threads.
 
Methods inherited from class gov.nist.nlpir.irf.index.IdxIntern
addIndexingFeatures, calcNrOfSources, clear, evalQueryCondition, evalQueryCondition, evalQueryModalityUnit, existsSource, getActualSource, getFeature, getFeature, getFeatures, getFeatures, getFeatures, getFeatures, getFeaturesFromEnumeration, getFeaturesNumber, getFlagUpToDate, getIFsWithFeature, getIFsWithFeature, getIFsWithSource, getIndexingFeatures, getNrOfFeatures, getNrOfFeatures, getNrOfIndexingFeatures, getNrOfSources, getNrOfSources, getNrOfUniqueFeatures, getNrOfUniqueSources, getNrOfUniqueSources0, getSources, getSources, getSourcesNumber, init, makeQueryConditions, makeResultForDocMatchingQueryCondition, makeResultsForDocMatchingQueryModalityUnit, present, presentFeatures, presentSources, presentStatistics
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait
 

Field Detail

numCalcFeatureScoreThreads

private static final int numCalcFeatureScoreThreads
Number of threads to be started for calculation of feature scores

numCalcDocumentScoreThreads

private static final int numCalcDocumentScoreThreads
Number of threads to be started for calculation of document scores
Constructor Detail

IdfIdxIntern

public IdfIdxIntern(java.lang.String loc,
                    java.lang.String name)
Basic constructor with the Index location and name.
Parameters:
loc - index location info
name - index name
Method Detail

toString

public java.lang.String toString()
Returns a string representation of this index
Returns:
a list of the features contained, with their description, the number of docs containing them and their score. If no feature is present, returns
no Features
Overrides:
toString in class java.lang.Object

update

public void update()
Updates the index after it has been created. In fact, an Index isn't updated each time an object is added to so that efficiency is better when several objects are added before using the index.
Overrides:
update in class IdxIntern

updateFeatures

public void updateFeatures()
Computes the scores of the stored features in one or more separate threads. The scores are stored in the DeIntern keys.

updateSources

public void updateSources()
Computes the scores of the stored documents in one or more separate threads. The scores are stored in the IoAddrIntern keys.

calcFeatureScore

public void calcFeatureScore(DeIntern aDEI)
Calculates the score of the given feature/DeIntern and sets it in the DeIntern (key to the sources-by-value table). The score is the Inverse Document Frequency (IDF), after Spärck-Jones (1972): IDF(t) = log(N/n) +1 where N == the number of documents in a collection, and n == the number of documents in the collection which contain term t. See note above on the nature of Math.log().
Parameters:
aDEI - the DeIntern whose score is to be computed.

calcDocumentScore

public void calcDocumentScore(IoAddrIntern anIO_Addr)
Computes the document score and sets it in the passed IoAddrIntern. The document score is the sum over all the document's features (e.g., terms in a text doc) of the squared weight of each feature. The weight of a feature is 1 + the log of the frequency of the feature in the document.
Parameters:
an_IOAddr - the IoAddrIntern of the document whose score is to be calculated

calcInputForRSV

public void calcInputForRSV(ResultForDocMatchingQueryCondition aResult)
Positions fields of aResult to allow the computation of Retrieval Status Value (RSV).
Parameters:
aResult - the Result... used.
Overrides:
calcInputForRSV in class IdxIntern

calcCombinedScore

public void calcCombinedScore(ResultForDocMatchingQueryCondition aResult)
Compensates for differences in the scores of retrieved features, e.g. in the case where the retrieved features are not exact matches for the query feature. For now, this compensation takes the form of calculating the average score of the retrieved features. Sets the combinedScore in the passed parameter to this value.
Parameters:
aResult - partial result on which to calculate the combined score

calcWeightOfFeatureInDocument

public void calcWeightOfFeatureInDocument(ResultForDocMatchingQueryCondition aResult)
Computes the weight of the given feature for the documents it appears in and sets it in the result. The weight is equal to the log of the frequency of the term in the document plus 1. Note: The term frequency *ought* never to be zero (giving a log value of -Infinity) because IdfIdxIntern.calcInputForRSV() - the method that calls this one - is not invoked by IdxIntern.makeResultForDocMatchingQueryCondition() unless one BasicRetrievalResult has been returned for the query condition.
Parameters:
aResult - partial result for which to compute the weight...

calcWeightOfFeatureInQuery

public void calcWeightOfFeatureInQuery(ResultForDocMatchingQueryCondition aResult)
Calculates the weight of the feature in the query and sets it in the result. The weight is equal to the combined score (see calcCombinedScore()) * (( log of the frequency of the term in the query) + 1). Note: The query frequency *ought* never to be zero (giving a log value of -Infinity) because IdfIdxIntern.calcInputForRSV() - the method that calls this one - is not invoked by IdxIntern.makeResultForDocMatchingQueryCondition() unless one BasicRetrievalResult has been returned for the query condition.
Parameters:
aResult - partial result on which to calculate the weight...

calcRSV

public double calcRSV(ResultForDocMatchingQueryModalityUnit aResult)
Calculates the Retrieval Status Value (RSV) of a single modality unit of the result of a query. This is the score attained by a document for a single modality of the query. The RSV == sigma(1 to n) (wiq * wid) / sqrt( sigma(1 to p)(wiq'²) * docscore) where: n == the number of query terms ocurring in the document wiq == combinedScore * (1 + log(qf)) wid == 1 + log(tf) p == the number of terms in the query wiq' == IDF * (1 + log(qf)) docscore == sigma(1 to q)(1 + log(tf))² q == the number of unique terms in the document qf == the frequency of a term in the query tf == the frequency of a term in the document Note: In practice wiq and wiq' take the same value, as the method for producing the combinedScore returns the IDF.
Parameters:
aResult - partial result for which to calculate the RSV
Overrides:
calcRSV in class IdxIntern