PRISE Search Design Notes


The NIST PRISE search engine treats documents and queries as lists of words and responds to a query with a list of documents ranked in order of their statistical similarity to the query. The PRISE search engine assumes the documents to be searched have been indexed using the PRISE indexer. The steps taken to process the query and return a list of documents (a result set) are:

  1. Preprocess the query
    This process identifies and normalizes the query terms.
  2. Generate query enhancement terms, query expansion terms, using relevance feedback; if the caller requests them.
  3. Append the reformulated query to the initial query; if enhancement terms are requested and the return of the reformulated query is requested.
  4. Return the reformulated query; if requested.
  5. Reweight the query using relevance feedback processing; if the caller requests it.
  6. Create a new result set; if a search is requested.
  7. Free information created by the search which is not needed by the caller.
  8. Return the result set; if a search is performed.

Processing Details and Locations of Source Code by Step

This section describes in greater detail the major steps executed in the PRISE search engine. Direct access to the PRISE search engine should be achieved using the PRISE Programming Interface. This interface consist of three functions init_search0(), search0(), and close_search0() located in DISTRIBUTION_ROOT/prise_search/src/lib/search/search0.c.
What follows is a short description of search0() processing.

  1. Preprocessing the query involves:

    Source: init_qrep() DISTRIBUTION_ROOT/prise_search/src/lib/search/qindex.c

  2. Creating the list of enhanced terms. See Generate the enhancement terms under relevance feedback.
  3. Appending the reformulated query into the query involves:

    Source: appendReformulatedQuery() in DISTRIBUTION_ROOT/prise_search/src/lib/search/prise_query.c

  4. The caller now has access to the reformulated query, because of the append, through the query.
  5. Reweighting the query. See Reweight the query under relevance feedback.
  6. Creating a result set requires running the user query against the document collection to create a list of documents ranked by similarity to the query.

    Source: returnResultSet() in DISTRIBUTION_ROOT/prise_search/src/lib/search/search.stem.neg.c

    Returning a result set involves:


Query indexing

Source: qinit() DISTRIBUTION_ROOT/prise_search/src/lib/search/qindex.c.

The process of indexing the query consist of:
  1. Parsing the string of document keywords in the query.

    Source: qparser() in .../misc/src/lib/client_util/qparser.dfa.c

  2. Creating the internal data structures to represent the parsed terms.


Relevance Feedback

Two modes of relevance feedback operation, controlled by four flags and a pointer, are available in the current search engine.
The modes are: The four flags are: The pointer is:

Transparent Relevance Feedback

In transparent mode the caller issues a query and receives as a response a reformulated query which contains the original query plus additional key words, enhancement terms, which may prove helpful in finding documents if added to the query when searching. In this mode the caller has the option to add zero or more of the enhancement terms to the query before requesting a search.
The basic sequence of events in transparent mode is:

  1. The caller requests a search with some initial query getting a result set.
    This initial search is denoted in the PRISE query by the following:
  2. The caller selects one or more documents from the result set and provides information about the relevance of the document(s) - relevance feedback.
  3. The caller submits the query with the list of selected documents (called "relevance documents" in what follows) and associated relevance judgements, requesting a list of candidate enhancement terms. .
    This step is denoted in the query as follows:
  4. The search engine generates the candidate enhancement terms from the relevance documents.
  5. The search engine returns the reformulated query to the caller.

    Source: appendReformulatedQuery() in DISTRIBUTION_ROOT/prise_search/src/lib/search/prise_query.c

  6. The caller selects zero or more enhancement terms and adds them to the query.
  7. The caller submits the expanded query with the list of relevance documents requesting a search be executed and result set returned.
  8. The search engine reweights the query to adjust for any user changes to the query.
  9. The search engine performs the requested search using the weights based on the relevance documents and returns the result set.

Opaque Relevance Feedback

In opaque mode the caller may issue a query and ask that it be automatically expanded with enhancement terms. In this mode the enhancement terms are append to the query, without a chance for caller intervention, and the expanded query is used to create a result set.
The basic sequence of events in opaque mode is:

  1. The caller requests a search with some initial query and gets a result set.
    This initial search is denoted in the PRISE query in the same manner as in transparent mode.
  2. The caller selects documents from the result set and provides relevance feedback for these (relevance) documents.
  3. The caller submits the query with the list of relevance documents requesting enhancement terms be automaticly added to the query and a search be executed to return a result set.
  4. The search engine generates the enhancement terms from the relevance documents.
  5. The search engine adds the enhancement terms to the original query.

    Source: appendQterms() in DISTRIBUTION_ROOT/prise_search/src/lib/search/prise_query.c

  6. The search engine performs the requested search using the weights based on the relevance documents and returns the result set.

Generating enhancement terms

This process is intended to:

Source: Create_enhancedList() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c

The fact that many terms from the relevance documents can have the same stem raises the issue of which term for a given stem is returned for user consideration in transparent mode. PRISE currently returns a term group to the caller.
A term group is a comma separated list of all the unstemmed terms which occur for the stem of an enhancement term in the relevance documents.
The steps in generating enhancement terms are:
  1. Parse the relevance documents creating a list of terms and stems. Include in the list the information used in term sorting and reweighting for each stem returned.

    Source: RelDocQterms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c

    The steps are:
  2. Remove parsed duplicates of original terms and update original term data.

    Source: RemoveDuplicatesStems() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c

  3. Sort the unique terms remaining in the list created by the parsing.

    Source: SortTerms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c

    This allows the search engine to place at the top of the list the terms which, by the sorting method used, are determined to be most likely to help find information relevant to the user's query.
    The following is a list of the term sorting methods available with the string used in the options.spec file of a collection to select the method:
  4. Return the top N unique terms from the sorted list.

    Source: BestTerms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c

    This reduces the list of unique terms form relevance documents to at most the number of terms the caller wishes returned or added to the query. This avoids any additional processing on terms which are guaranteed not to be used.
    The current default value of N is 10.
    The value of N can be set in the options.spec file in PRISE indexed collections using the maxrfterms option. A minrfterms option may also appear in the options.spec file but the current implementation of BestTerms ignores this option.
  5. Reweight both the terms returned by the previous step and the original query terms.

    Source: ReweightTerms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/reweightTerms.c

    The goal of this step is to calculate weights for terms based on the relevance documents which will improve the quality of the search results.
    The following is a list of the term reweighting methods available with the string used in the options.spec file of a collection to select the method:

Reweighting the query

This process calculates the weights of the terms in the expanded query during transparent relevance feedback.

Reweighting the query is a very similar to generating the enhanced list. Currently the implementation is a distinct function with an explicit step for re-parsing the relevance documents. Future versions may improve on this leaving the interface and its callers unaffected. Currently we re-parse in order to account for user changes to the query terms and relevant document list.

  1. A query term can be added/changed.
    If a query term is added/changed then the reweight value of this new term must be calculated. Reweighting in the this case could be accomplished by remembering all terms found in the relevant documents with the information needed for reweighting. Supporting last item in this list makes re-parsing necessary and so the decision was made to re-parse in this case as well.
  2. A user term can be deleted.
    If a user term is deleted it is just forgotten and reweighting for that term is not required.
  3. Relevance documents can be added/deleted/changed.
    If the relevance document set is changed by the user while selecting enhancement terms or for some other reason then the reweight value of the terms, which is derived from information in the relevance documents on the list at the time this function is called, must be recalculated. To obtain the information needed to recalculate the reweight values parsing of the relevance documents is required.
This function is executed every time there are relevance documents in the query and transparent relevance feedback mode is in use even if no query enhancement terms were added by the caller. Thus the caller's query terms will be reweighted with respect to the relevance documents supplied even though no enhancement terms are in use.

Source: ReweightQuery() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c

The steps in reweighting the query are:
  1. Parse the relevance documents creating a list of terms and stems.
  2. Place in the original query terms the data from the parser needed for term reweighting.

    Source: CopyQtermRFData() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c

  3. Reweight the query terms with one of the reweight methods.

Special restrictions

  1. Document phrase search
  2. This restriction requires that, of the documents found by the keyword search, only those containing the phrase supplied by the caller should be returned. For a document to pass phrase processing it must contain the same string the caller has supplied, with the following conditions

  3. Document proximity search
    For this restriction the caller supplies two terms and a distance measured in words. Of the documents found by the document keyword search only those which contain the two terms where the distance between the terms is less then or equal the distance supplied will be returned. The following conditions apply to proximity processing:
  4. Field keyword search
    For this restriction the caller supplies a field name and a list of keywords to search for in the field. Of the documents found by the document keyword search only documents which contain the specified field where the field contains at least one of the listed keywords are returned. The following conditions apply to fielded keyword restrictions:

The use of special restrictions has the effect of refining the set of documents returned. This occurs because special restrictions can only remove documents after the initial document keyword search. It cannot add documents.

Special restrictions should be used carefully. Although an effort has been made to minimize the use of string searching, a very time consuming process, it is still the last step in supporting special restrictions. Therefore, if it is applied using terms which occur in many documents in the collection and produces a large number of documents even after Stage 1 of restriction processing, then this system may prove very slow.

Limitations

The PRISE Search Engine has some limitations
  1. PRISE will process a query of up to 6000 terms.
  2. Only the first 32 query terms used to retrieve documents will be accounted for in the bitmaps associated with the documents. Any additional query terms will add documents to the result set and be eligible to influence document scores but will not be represented in the bitmap (and will not be elligible for highlighting by the zclient).
    Note: When using special restrictions, any special terms which do not occur in the bitmaps will not contribute to Stage 1 of special restrictions. This will result in more documents remaining to be processed in Stage 2 of special restrictions, string searching, which will slow response time.
  3. Keyword searching, document or fielded, on numbers is not possible because numbers are not indexed by the PRISE Indexer.
  4. The string searches used to support special restrictions will time out after 5 minutes and the result set returned will consist of the documents which have passed special restrictions at that time. This time out interval may be adjusted by changing the #define TIME_OUT_SECS.
    Source: DISTRIBUTION_ROOT/prise_search/include/restrictions.h
  5. Punctuation in the query is ignored for keyword searching.


PRISE Programming Interface


Includes

#include <stdio.h>
#include <malloc.h>
#include <math.h>
#include <fcntl.h>
#include <string.h>

#include "type_defs.h"
#include "symb_defs.h"

#include "tkn.h"
#include "query.h"

#include "coll.h"
#include "options.h"
#include "result.h"
#include "search_proto.h"
#include "search0.h"

#include "io.h"
#include "errors.h"

Main Data Structures

Collection

typedef struct {
  DICTFILE*     dict;
  POSTFILE*     postings;
  TPIFILE*      tpi;
  STOPFILE*     stoplist;
  COLLSTATS*    stats;
  DOCNO_TABLES* docno_tables;
  FIELD_TABLE*  field_table;
  char*         lexnmzn;      /* Name of lexical normalization function.  */
  int           stemfn;
  char*         locale;
  char*         dir;
  char*         name;
  int           mode;      /*mode refers to SEARCH_MODE or INDEXER_MODE*/
  OPTIONS*      options;
} COLL;

Query

typedef struct TEXT_QUERY
  {
  /*-------------------------------------------------------------------------*
   * Tags delimiting the part of the document to search
   *-------------------------------------------------------------------------*/
  char* start_end_tags;          /* When full document is intended, this     */
                                 /* contains a pointer to a null string      */
  /*-------------------------------------------------------------------------*
   * Data indicating the sort of search to [perform on the document part
   *-------------------------------------------------------------------------*/
  /*-------------------------------------------------------------*
   * Keyword search
   *   No limit on number of terms
   *   Stemming
   *-------------------------------------------------------------*/
  struct
    {
    char* terms;
    char* weights;
    int   numterms;
    } keywds;       
  /*-------------------------------------------------------------*
   * Proximity search
   *   2  terms
   *   no stemming
   *   order does *not* matter
   *-------------------------------------------------------------*/
  struct 	
    {                           
    char* terms;              /* in the order they must appear */
    int   dist;
    } prox;
  /*-------------------------------------------------------------*
   * Phrase search
   *   No limit on number of terms
   *   No stemming
   *   Order matters
   *-------------------------------------------------------------*/
  struct 				
    {
    char* terms;
    } phrase;
  struct TEXT_QUERY* next;  /* text query for next document section */
                            /* NULL if none follows                 */   
  } TEXT_QUERY;


/*---------------------------------------------------------------------*
 * Typedefs to support relevant/nonrelevant document list and relevance
 * feedback operations.
 *---------------------------------------------------------------------*/
typedef struct docrelnode{ /* structure for passing relevant document
			      lists */
  unsigned long docno;     /* document id number */
  float relevance;         /* how relevant the document is */
  struct docrelnode *next; /* pointer to next list member */
} DOCRELNODE;

typedef struct 
  {
  int doSearch;                /* Execute a search */
  int returnReformulatedQuery; /* Send the reformulated query back to
				* the client */
  int reformulateQuery;        /* Use the relevance documents to create a
				* list of unique terms which can be added
				* to the query.                        */
  int newQuery;                 /* This query is not related to any
				 * previous query aka we have not done
				 * relevance feedback to get to this
				 * query
  /*-------------------------------------------------------------------*
   * Limits used to control the number of additional relevant terms
   * used/returned for review. Supporting relevance feedback created the
   * need for these so they are being kept with the relevance document
   * lists.
   * These may also be useful if we were to support boolean queries and
   * the expand operation.
   *-------------------------------------------------------------------*/
   int minrelterms;              /* The minimum number of additional
				  * relevant terms to be used/return for
				  * review. This may be useful if we use
				  * a selection method based on something
				  * other then position in ranked list, 
				  * ex. top N terms.                   */
   int maxrelterms;              /* The maximum number of additional
				  * relevant terms to be used/return for
				  * review.                            */


  DOCRELNODE *doc_rels;      /* List of document relevance information */
  int  num_wanted;            /* Desired number of items in the result */
  enum query_type 
    {
    Q_TYPE_TEXT,
    Q_TYPE_DOCID,
    Q_TYPE_BOOLEAN,
    Q_TYPE_LIMIT	      /* The number of query types */
    } qtype;
  union q_type_union 
    {
    TEXT_QUERY* text_query;
          void* boolean;      /* Not defined, yet.  Not implemented, yet. */
    } u;
union q_type_union reformed;  /* Reformed query returned by searcher. */
  } QUERY;

Result

typedef struct {
  DOC* docs;		  /* Data representing retrieved documents.    */
  int  doc_ct;		  /* The number of documents retrieved.	       */

  int  query_id;	  /* For experimentation and evaluation.       */
  QUERY *query;           /* A PRISE query structure                   */
  QTERM* qterms;	  /* A list of structures representing query   */
                          /* terms.*/
                          /* This is typically a subset of 'tknlist'   */
                          /* list. */
			  /* Stopwords, for example, have been removed.*/
  			  /* QTERMs are not strictly necessary, since  */
                          /* they are redundant given TKNs, but are    */
                          /* provided for compatibility with other     */
                          /* components, e.g., feedback.               */
  			  /* Allocated and set via qtinit().	       */
                          /* qterms added by dld to allow for          */
                          /* highlighting                              */
  TKNLIST* tknlist;	  /* A list of structures representing the     */
                          /* original                                  */
} RESULT;

Functions

These functions are located in DISTRIBUTION_ROOT/prise_search/src/lib/search/search0.c.
int init_search0
	(	     
  	char*  coll_name	Collection
  	char*  coll_dir		Collection directory	     
  	COLL** coll    		Structure representing collection components
  	RESULT** result		Search results.			        	
	)

	Returns:
		RET_OK
  		RET_ERROR	Failed to initialize the collection
				Failed to malloc RESULT
				Failed to allocate result space
		RET_FAIL	Initonly option - show specs no retrieval
The interface function init_search0() must be called prior to the first call to search0.
int search0
	( 
	COLL*   coll 		Structure representing collection components
	QUERY*  query 		Query 
	RESULT* result		Search results
       	)

	Returns:
		RET_OK		Retrieved %d document(s)
		RET_ERROR	Error returned by search()
		RET_FAIL	No documents retrieved
The interface function search0() activates the PRISE search engine function, search() in DISTRIBUTION_ROOT/prise_search/src/lib/search/search.stem.neg.c.
int close_search0
	( 
	COLL*   coll		Structure representing collection components
	RESULT* result		Search results
	)

	Returns:
		RET_OK


References


National Institute of Standards and Technology Home Page Last updated: Tuesday, 01-Aug-2000 07:16:48 MDT

Date created: Monday, 31-Jul-00