Retrieving documents

The retrieval process operates on a document collection index, using retrieval modalities, to create a list of documents in response to a supplied query. The application then asks the result to sort itself in order of the retrieval score, or Retrieval Status Value (RSV), which reflects the degree to which a document matches features in a supplied query.

Retrieval is initiated through a retrieval request, and results are returned as an unsorted retrieval result.

Retrieval request

The document retrieval request consists of a query, represented as a document object, together with weighting information for the query components. The request is initiated through a call to the retrieve() method of the query.

The query is represented as a document of the same form as those comprising the collection to be searched, which should be a subclass of Document. In the supplied sample collections, for example, the corresponding document type would be either HciDoc or CranfieldDocument. The query features (e.g., the query terms in the case of a text query) for each document field are placed in the corresponding field of the query document.

The weighting information describes how the scores obtained for each field/index should contribute to a document's final score (RSV). The information is represented as a RetrievalModalities object, and has two components: an IndexingModalities object, which specifies the indexing modality units associated with the documents in a collection, and a Combinator object. Each IndexingModalityUnit specfies the name of a document feature (i.e., field like title, author, etc.) to be indexed, the name of the indexer to use, and the name of the index to hold the results of indexing. In what follows, where processing is by IndexingModalityUnit, we choose to describe it as by field to be more concrete.

Combinator object

The Combinator specifies how the score for each document field will be combined to produce the final document score. It associates an integer weight with each field and specifies a combining function to be applied to the weighted field-level scores to produce the document-level RSV.

The weight associated with a field defines its contribution to the document RSV, relative to those of any other fields the document class may have. The supplied weights are used in the creation of weighting factors used when the document RSV is calculated.

IRF supports three combining functions:

max: returns the highest weighted score;
min: returns the lowest (non-zero) weighted score;
lin: returns the mean weighted score.

The unstructured form of the Combinator is intended for use when the same combining function is to be applied across all fields. Thus in a a collection of documents with three fields (a, b and c) operation lin(a, b, c) would return the mean value of the weighted field-level scores as the document RSV.

The structured form of the combinator allows for different combining operations to be applied between different indexing modalities. In this case, the combinations are always binary in nature, so that the weighted score of a given field may be combined either with the weighted score of another field or with the result of such a combination. Thus, referring to the example above, such combinations as max(a, lin(b, c)) are possible, returning the larger of a or the mean of b nd c as the document RSV.

The internal representation of a query is the QueryRep object, which recombines the information supplied in the components of RetrievalModalities in a form which mirrors the structure of the Combinator.

Retrieval result - its structure

The retrieval result, a RetrievalResult object, constitutes the highest level of a hierarchy of result objects. Each level of the hierarchy combines and contains the results of the level immediately beneath it.

There is one RetrievalResult per search and it contains a list of ResultForDocMatchingQuery objects - one for each document that is part of the result. The RetrievalResult is returned unsorted but is then normally asked to sort itself in descending order of RSV.

Each ResultForDocMatchingQuery object contains a reference to a document, the document's RSV, the query, and a list of ResultForDocMatchingQueryModalityUnit objects - one for each field contained by the document associated with the enclosing ResultForDocMatchingQuery. Some documents may not contain instances of all fields.

Each ResultForDocMatchingQueryModalityUnit contains a list of ResultForDocMatchingQueryCondition objects - one for each query feature that was present in the document field (and therefore in the index) for the enclosing ResultForDocMatchingQueryModalityUnit.

Finally, each ResultForDocMatchingQueryCondition contains a list of BasicRetrievalResult objects - one for each occurrence of the query feature associated with the enclosing ResultForDocMatchingQueryModalityUnit.

More concisely,

ResultForDocMatchingQuery (result at document level) contains one or more

ResultForDocMatchingQueryModalityUnit (result at document field level) contains one or more
ResultForDocMatchingQueryCondition (result at query feature level) contains one or more
BasicRetrievalResult (result at single occurrence of query feature level)

Calculation of document RSV

The final retrieval score/RSV of a document is calculated as follows:

Each BasicRetrievalResult carries with it the score held in the index for the particular occurence of the feature it describes. An example of this might be the Inverse Document Frequency (IDF) of a term in a text-based index where IDF is used in the indexing procedure.

Each ResultForDocMatchingQueryCondition contains a score (combinedScore) derived from a combination of the scores of the set of BasicRetrievalResults it holds. The nature of this combination depends on the indexing and retrieval techniques used. For IDF-based indexing and retrieval, the score is taken as the mean of the individual BasicRetrievalResults scores, in an attempt to take account of the possibility that features with different IDF values may be retrieved as the result of a matching technique other than exact match (if this were not the case, the combined score would be just the IDF of the query feature for that index).

Each ResultForDocMatchingQueryModalityUnit contains a score derived from a combination of the scores of the set of ResultForDocMatchingQueryCondition elements it holds. The precise nature of this score, again, depends on the indexing and retrieval techniques in use, and their associated weighting algorithms. The IDF-based retrieval code supplied with the IRF uses such quantities as the frequency of a term in the query, its frequency in a document (i.e. the number of BasicRetrievalResults in the ResultForDocMatchingQueryCondition corresponding to the term) and, of course, the IDF of the term (the combinedScore from the ResultForDocMatchingQueryCondition object corresponding to the term).

The final document score is calculated in three stages:

First, a normalized score for each field in the document is calculated by dividing the score for the document field by the highest score for the corresponding field in any of the retrieved documents.
Next, the normalized score for each field of every document is mutiplied by the weighting factor for that field, as expressed in the Combinator object of the retrieval request. The weighting factor for each field is just its weight divided by the sum of the weights for all fields. Thus, referring back to our example above, if the weight for field a were 2 and the weights for b and c were each 1, the weighting factor for a would be 2/4 = 0.5, while those for b and c would each be 1/4 = 0.25.
The weighted scores for the individual fields are then summed to give the final document score.

Note: The scores for BasicRetrievalResult, ResultForDocMatchingQueryCondition and ResultForDocMatchingQueryModalityUnit are calculated by methods residing in a subclass of class IdxIntern. In particular, methods IdxIntern.calcInputForRSV() and IdxIntern.calcRSV() must be supplied in such a subclass (see, for example, IdfIdxIntern).

National Institute of Standards and Technology Home

Last updated:

Date created: Monday, 31-Jul-00
For further information contact Paul Over ([email protected]) with
copy to Darrin Dimmick ([email protected])