PRISE Search Design Notes
The NIST PRISE search engine treats documents and queries as lists of
words and responds to a query with a list of documents ranked in order
of their statistical similarity to the query. The PRISE search engine
assumes the documents to be searched have been indexed using the
PRISE indexer.
The steps taken to process the query and return a list of documents
(a result set) are:
-
Preprocess the query
This process identifies and normalizes the query terms.
-
Generate query enhancement terms, query expansion terms,
using relevance feedback; if the caller requests
them.
-
Append the reformulated query to the initial
query; if enhancement terms are requested and the return of the
reformulated query is requested.
-
Return the reformulated query; if requested.
-
Reweight the query using
relevance feedback processing; if the caller
requests it.
-
Create a new result set; if a search is requested.
-
Free information created by the search which is not needed by the caller.
-
Return the result set; if a search is performed.
Processing Details and Locations of Source Code by Step
This section describes in greater detail the major steps executed in the
PRISE search engine. Direct access to the PRISE search engine should be
achieved using the PRISE Programming Interface.
This interface consist of three functions init_search0(), search0(), and
close_search0() located in
DISTRIBUTION_ROOT/prise_search/src/lib/search/search0.c.
What follows is a short description of search0() processing.
-
Preprocessing the query involves:
-
Allocating the space for the PRISE internal query representation.
-
Initializing the internal query representation data structures.
-
Indexing the query which transforms the
caller's query into the internal data structures used in the
search process.
Source: init_qrep()
DISTRIBUTION_ROOT/prise_search/src/lib/search/qindex.c
-
Creating the list of enhanced terms.
See Generate the enhancement terms under
relevance feedback.
-
Appending the reformulated query into the query
involves:
-
Duplicating the original query in the reformed query.
-
Removing the original key words from the reformed query key words
to make room for the new key words.
-
Removing the original weights from the reformed query to make room for
the new weights.
-
Placing the original key words and their new weights in the reformed
query key words. This step is performed because the weights may change
due to the use of relevance information.
If the original query key words contained duplicate
terms and duplicate terms are being removed during query
preprocessing only one of the original duplicate query terms will
appear in the reformulated query.
-
Appending the enhancement terms or term groups
and their weights in the reformed query key words.
Source: appendReformulatedQuery() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/prise_query.c
-
The caller now has access to the
reformulated query, because of the append,
through the query.
-
Reweighting the query.
See Reweight the query under
relevance feedback.
-
Creating a result set requires running the user
query against the document collection to create a list of documents
ranked by similarity to the query.
Source: returnResultSet() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/search.stem.neg.c
Returning a result set involves:
- Performing the document keyword search
the steps of which are:
-
Sorting the tokens into descending order by working weight.
This step is optional.
Source: linksort() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/linksort.c
-
Creating a set of accumulators, a set of simple data structures to be
used temporarily by the searcher, to hold term occurrence and
document score information.
Source: MakeAccs() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/Accs.neg.c
- Reading the terms postings data.
Source: readpost() in
DISTRIBUTION_ROOT/prise_tools/src/lib/coll/ix.c
This file was previously located in DISTRIBUTION_ROOT/prise_search/src/lib/search.
-
Including the posting data for the terms in the accumulators.
Source: include_term_posting() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/search.stem.neg.c
Which involves:
- Unpacking the postings data for the documents containing the
term.
Source: getpost() in
DISTRIBUTION_ROOT/prise_tools/src/lib/coll/ix.c
- Calculating the strength of the query term and document match.
Source: qd_term_sim() in
DISTRIBUTION_ROOT/prise_tools/src/lib/weights/weights.c.
This file was previously located in
DISTRIBUTION_ROOT/prise_search/src/lib/search.
-
Adding the score from the previous step and recording term
occurrence in the accumulators for the documents which contain
the term.
Source: OldNewAccs() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/Accs.neg.c.
-
- Loading documents into the document array from the accumulators.
For each document the array will contain, from the accumulators,
document number, score, and the query term occurrence bitmap.
Source: loaddocs() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/prune.c.
-
Destroying the accumulators now that the desired information has
been placed in the document array.
Source: DestroyAccs() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/Accs.neg.c
- Applying stage one of special
restrictions to remove documents from the document array
according to caller requirements involves:
- Examining the query for restrictions and translating them into a
usable format if they exist.
Source: query2restrictions() in
DISTRIBUTION_ROOT/prise_search/src/lib/special/restrictions.c
-
Compare the term occurrence bitmaps produced by
query2restrictions for the terms occurring in special
restrictions to the term occurrence bitmaps in the document array
produced by the document keyword search. If a document is proven
to not contain a minimal subset of the terms needed to meet
caller restrictions, remove it from the result set.
Source: applyRestrictionMasks() in
DISTRIBUTION_ROOT/prise_search/src/lib/special/restrictions.c
-
Ranking of the documents of the result set is achieved
by sorting them into descending order based on their document score.
Source: rank_results() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/rank.c.
- Applying stage two of special
restrictions involves doing string search on the piece of text,
document or field, to determine if it meets the caller's special
requirements. Documents which do not meet the caller's requirements are
removed from the result set of documents which will be returned.
Source: applyRestrictionScans() in
DISTRIBUTION_ROOT/prise_search/src/lib/special/restrictions.c
Query indexing
Source: qinit()
DISTRIBUTION_ROOT/prise_search/src/lib/search/qindex.c.
The process of indexing the query consist of:
-
Parsing the string of document keywords in the query.
Source: qparser() in
.../misc/src/lib/client_util/qparser.dfa.c
-
Creating the internal data structures to
represent the parsed terms.
-
Creating a token list for the terms.
Source: qindex(), term_index()
DISTRIBUTION_ROOT/prise_search/src/lib/search/qindex.c.
-
If the token is a stop word an indicator flag is set in its token list
entry.
-
If the token is not a stop word it is lexically normalized.
Currently this involves stemming only.
Source: lexnmzn() in
DISTRIBUTION_ROOT/prise_tools/src/lib/lexnmzn/lexnmzn.c.
This library was previously located in
DISTRIBUTION_ROOT/prise_search/src/lib.
-
The token's dictionary information is placed in the token.
If no dictionary information is found for the token its dictionary
information is set to NULL.
Source: setde() in
DISTRIBUTION_ROOT/prise_search/src/lib/prise_search/qindex.c.
-
If the token is a repeat it is marked as such.
A token, term, is considered a repeat,
duplicate, if a token already found in the query has the same
stem. The reason the stem is used to determine if a term is
present in the query is that PRISE indexed collection dictionary
information is stored based on stems. Thus all tokens with the
same stem will retrieve the same dictionary information.
Therefore only the first occurrence of a stem in the query will
retrieve any documents associated with the stem which have not
already been retrieved.
-
Create a list of qterms for the tokens.
Source: qtinit(), qterm_init()
DISTRIBUTION_ROOT/prise_search/src/lib/search/qterm.c.
This process creates qterms for any tokens which pass the token filter.
The token filter removes:
-
Stop words.
-
Terms which do not occur in the dictionary.
-
Duplicate tokens. The removal of duplicate tokens is optional.
Relevance Feedback
Two modes of relevance feedback operation, controlled by four flags and a
pointer, are available in the current search engine.
The modes are:
The four flags are:
-
doSearch
-
returnReformulatedQuery
-
reformulateQuery
-
newQuery
The pointer is:
Transparent Relevance Feedback
In transparent mode the caller issues a query and receives as a response
a reformulated query which contains the original query plus additional
key words, enhancement terms, which may prove helpful in finding
documents if added to the query when searching. In this mode the caller
has the option to add zero or more of the enhancement terms to the query
before requesting a search.
The basic sequence of events in transparent mode is:
-
The caller requests a search with some initial query getting a result set.
This initial search is denoted in the PRISE query by the following:
-
doSearch = 1
-
returnReformulatedQuery = 0
-
reformulateQuery = 0
-
newQuery = 1
-
doc_rels = NULL
-
The caller selects one or more documents from the result set and provides
information about the relevance of the document(s) - relevance feedback.
-
The caller submits the query with the list of selected
documents (called "relevance documents" in what follows) and
associated relevance judgements, requesting a list of
candidate enhancement terms.
.
This step is denoted in the query as follows:
-
doSearch = 0
-
returnReformulatedQuery = 1
-
reformulateQuery = 1
-
newQuery = 0
-
doc_rels != NULL
-
The search engine generates the candidate enhancement
terms from the relevance documents.
-
The search engine returns the reformulated query to the caller.
Source: appendReformulatedQuery() in DISTRIBUTION_ROOT/prise_search/src/lib/search/prise_query.c
-
The caller selects zero or more enhancement terms and adds them to the
query.
-
The caller submits the expanded query with the list of relevance
documents requesting a search be executed and result set returned.
-
doSearch = 1
-
returnReformulatedQuery = 0
-
reformulateQuery = 0
-
newQuery = 0
-
doc_rels != NULL
-
The search engine reweights the query to adjust
for any user changes to the query.
-
The search engine performs the requested search using the weights based
on the relevance documents and returns the result
set.
Opaque Relevance Feedback
In opaque mode the caller may issue a query and ask that it be
automatically expanded with enhancement terms. In this mode the
enhancement terms are append to the query, without a chance for caller
intervention, and the expanded query is used to create a result set.
The basic sequence of events in opaque
mode is:
-
The caller requests a search with some initial query and gets a result set.
This initial search is denoted in the PRISE query
in the same manner as in transparent mode.
-
The caller selects documents from the result set and provides relevance
feedback for these (relevance) documents.
-
The caller submits the query with the list of relevance documents
requesting enhancement terms be automaticly added to the query and a
search be executed to return a result set.
-
doSearch = 1
-
returnReformulatedQuery = 0
-
reformulateQuery = 1
-
newQuery = 0
-
doc_rels != NULL
-
The search engine generates the enhancement
terms from the relevance documents.
-
The search engine adds the enhancement terms to the original query.
Source: appendQterms() in DISTRIBUTION_ROOT/prise_search/src/lib/search/prise_query.c
-
The search engine performs the requested search using the weights based
on the relevance documents and returns the result
set.
Generating enhancement terms
This process is intended to:
-
Return a ranked list of unique terms, from the relevance documents which
will be most useful when added to the existing query in retrieving the
desired information.
Unique, in this case, means the term is not a
duplicate term for one of the caller' current
query terms.
-
Calculate weights, based on the relevance documents, which will be more
effective in retrieving the desired information for both the
existing query terms and the list of unique terms.
Source: Create_enhancedList() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c
The fact that many terms from the relevance documents can have the same
stem raises the issue of which term for a given stem is returned for user
consideration in transparent mode. PRISE currently returns a term group
to the caller.
A term group is a comma separated list of all the
unstemmed terms which occur for the stem of an enhancement term in the
relevance documents.
The steps in generating enhancement terms are:
-
Parse the relevance documents creating a list of
terms and stems.
Include in the list the information used in term sorting and reweighting
for each stem returned.
Source: RelDocQterms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c
The steps are:
-
Apply a document parser very similar to the one used by the
PRISE indexer.
See:
-
Source: Parse() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c
This function:
-
Determines the number of relevance documents on the list.
-
Calls parser()
-
Condenses parser's output based on stems.
-
Source: parser() in DISTRIBUTION_ROOT/prise_tools/src/lib/rf.parser/rf.parser.c
This parser along with dividing documents into a list of terms returns
with each term:
-
Its stem.
-
The number of documents it occurs in.
-
This ids of the documents which contain the term.
-
The number of occurrences of the term for each document it appears in.
-
Create the internal data structures to represent the parsed stems and
terms created by the previous step. This step uses the same functions to
create the internal data structures as query
indexing.
-
Place the relevance information for each of the stems in its corresponding QTERM.
-
Remove parsed duplicates of original terms and update original term data.
-
Remove, from the list of terms found by parsing the relevance documents,
any terms already found in the query so that only
unique terms remain.
-
This step also updates the relevance feedback data for the terms already
in the query so reweighting may be accomplished. A term is considered to
be in the query if its stem is already part of the query.
Source: RemoveDuplicatesStems() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
-
Sort the unique terms remaining in the list created by the parsing.
Source: SortTerms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
This allows the search engine to place at the top of the list the terms
which, by the sorting method used, are determined to be most likely to
help find information relevant to the user's query.
The following is a list of the term sorting methods available with the
string used in the options.spec file of a collection to select the
method:
-
Term Idf. Options.spec string: TermIdf.
Reference: Harman1
-
Postings - the number of retrieved relevant documents containing the
term. Options.spec string: TermPostrec
Reference: Harman1
-
Idf within postings. Options.spec string: TermIdfWIPostrec
Reference: Harman1
-
Idf * Frequency within postings, where frequency is the Log2 of total
frequency of the term in the retrieved relevant set. Options.spec
string: TermIdfFreqWIPostrec
Reference: Harman1
-
Idf * Frequency * Postings. Options.spec string: TermIdfFreqPostrec
Reference: Harman1
-
Idf * Frequency. Options.spec string: TermIdfFreq
Reference: Harman1
-
Porter Options.spec string: Porter.
Source: Porter() in
DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
Reference: Harman1
-
Smeaton and vanRijsbergen Options.spec string:SmeatonvanRijsbergen
Source: SmeatonvanRijsbergen() in
DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
Reference: Harman1
-
Robertson where w_ij is calculated using the Croft reweighting formula.
Options.spec string: RobertsonCroft
Source: RobertsonCroft() and Robertson() in
DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
Reference: Harman1
-
Robertson where w_ij is the idf.
Options.spec string: RobertsonIdf
Source: calculate_rf_RobertsonIdfrankValues() and Robertson() in
DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
Reference: Harman1
-
Return the top N unique terms from the sorted list.
Source: BestTerms() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
This reduces the list of unique terms form relevance documents to at most
the number of terms the caller wishes returned or added to the query.
This avoids any additional processing on terms which are guaranteed not to
be used.
The current default value of N is 10.
The value of N can be set in the options.spec file in PRISE indexed
collections using the maxrfterms option. A minrfterms option may also
appear in the options.spec file but the current implementation of
BestTerms ignores this option.
-
Reweight both the terms returned by the previous step and the original
query terms.
Source: ReweightTerms() in
DISTRIBUTION_ROOT/prise_search/src/lib/relevance/reweightTerms.c
The goal of this step is to calculate weights for terms based on the
relevance documents which will improve the quality of the search
results.
The following is a list of the term reweighting methods available with
the string used in the options.spec file of a collection to select the
method:
-
Robertson & Sparck Jones. Options.spec string: Rsj.
Source: Rsj()
in DISTRIBUTION_ROOT/prise_tools/src/lib/weights/reweights.c
Reference: Okapi1
-
Robertson & Sparck Jones with negative weights being set to zero.
Options.spec string: RsjNeg0
Source: RsjNeg0() in
DISTRIBUTION_ROOT/prise_tools/src/lib/weights/reweights.c
-
Robertson & Sparck Jones using the Bookstein modification.
Options.spec string: BooksteinRsj
Source: BooksteinRsj() in
DISTRIBUTION_ROOT/prise_tools/src/lib/weights/reweights.c
-
Idf. Use the term's idf as its weight.
Options.spec string: ReweightIdf
Source: ReweightIdf() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/reweightTerms.c
-
Croft. Options.spec string: Croft
Source: Croft() in
DISTRIBUTION_ROOT/prise_tools/src/lib/weights/reweights.c
Reference: Harman1
-
Croft with negative weights being set to zero.
Options.spec string: CroftNeg0.
Source: CroftNeg0() in
DISTRIBUTION_ROOT/prise_tools/src/lib/weights/reweights.c
Reference: Harman1
-
Croft using the Bookstein modification. Options.spec string:
Bookstein.
Source: Bookstein() in
DISTRIBUTION_ROOT/prise_tools/src/lib/weights/reweights.c
Reference: Harman1
This process calculates the weights of the terms in the expanded query
during transparent relevance feedback.
Reweighting the query is a very similar to generating the enhanced list.
Currently the implementation is a distinct function with an explicit step
for re-parsing the relevance documents. Future versions may improve on this leaving the interface and its callers unaffected.
Currently we re-parse in order to account for user changes to the query
terms and relevant document list.
-
A query term can be added/changed.
If a query term is added/changed then the reweight value of this new term
must be calculated. Reweighting in the this case could be accomplished by
remembering all terms found in the relevant documents with the
information needed for reweighting.
Supporting last item in this list makes re-parsing necessary and so the
decision was made to re-parse in this case as well.
-
A user term can be deleted.
If a user term is deleted it is just forgotten and reweighting for that
term is not required.
-
Relevance documents can be added/deleted/changed.
If the relevance document set is changed by the user while selecting
enhancement terms or for some other reason then the reweight value of the
terms, which is derived from information in the relevance documents on
the list at the time this function is called, must be recalculated.
To obtain the information needed to recalculate the reweight values
parsing of the relevance documents is required.
This function is executed every time there are relevance documents in the
query and transparent relevance feedback mode is in use even if no query
enhancement terms were added by the caller. Thus the caller's query
terms will be reweighted with respect to the relevance documents supplied
even though no enhancement terms are in use.
Source: ReweightQuery() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/feedback.c
The steps in reweighting the query are:
-
Parse the relevance documents creating a list of
terms and stems.
-
Place in the original query terms the data from the parser needed for
term reweighting.
Source: CopyQtermRFData() in DISTRIBUTION_ROOT/prise_search/src/lib/relevance/termSort.c
-
Reweight the query terms with one of the
reweight methods.
Special restrictions
-
Document phrase search
This restriction requires that, of the documents found by the keyword
search, only those containing the phrase supplied by the caller should be
returned. For a document to pass phrase processing it must contain the
same string the caller has supplied, with the following conditions
-
Any string of contiguous white space in the query phrase matches any
string of contiguous white space in the document phrase regardless of the
lengths of the white space strings. Thus
Four score and seven years
matches
Four score and seven years
-
The comparison is case insensitive. Thus
Four score and seven years
matches
four Score and seven YEARS
-
No stemming is applied prior to the comparison. Thus
seven years
does not match
seven year
-
The order of terms matters. Thus
stocks and bonds
does not match
bonds and stocks
-
Punctuation in the query phrase must appear in the document phrase for
them to match. Thus the query phrase
beef prices, rise
does not match the document phrase
beef prices rise
-
Punctuation in the document phrase must appear in the query phrase for
them to match unless the punctuation is the last character, immediately
follows the last word, of the document phrase.
Thus the query phrase
beef prices rise
does not match the document phrase
beef prices, rise
but does match the document phrase
beef prices rise,
-
Hyphens which are embedded as part of terms in queries or documents must
appear in the corresponding query or document terms to match. Thus
mother in law
does not match
mother-in-law
-
Document proximity search
For this restriction the caller supplies two
terms and a distance measured in words. Of the documents found by the
document keyword search only those which contain the two terms where the
distance between the terms is less then or equal the distance supplied
will be returned.
The following conditions apply to proximity processing:
-
The comparison of query terms to document terms is case insensitive.
Thus
Stocks
matches
stocks
-
No stemming is applied prior to the comparison. Thus
stocks
does not match
stock
-
The order of terms does not matter. Thus
stock exchange
using distance 10
and
exchange stock
using distance 10
produce identical results.
-
Punctuation in the query terms must appear in the document terms for them to match.
Thus the query term
NIST's
does not match the document term
NIST
-
Punctuation in the document terms is treated as a token separator if it
does not appear in the query terms.
Thus the query term
NIST
matches the document term
NIST's
-
Hyphens which are embedded as part of terms in queries or documents must
appear in the corresponding query or document terms to match. Thus
mother in law
does not match
mother-in-law
-
Field keyword search
For this restriction the caller supplies a field name and a list of
keywords to search for in the field. Of the documents found by the
document keyword search only documents which contain the specified field
where the field contains at least one of the listed keywords are returned.
The following conditions apply to fielded keyword restrictions:
-
The comparison of query terms to document terms is case insensitive.
Thus
Stocks
matches
stocks
-
Stemming is applied prior to the comparison. Thus
stocks
matches
stock
-
The order of terms does not matter. Thus
stock exchange
and
exchange stock
produce identical results.
-
All punctuation is treated as token separators. Thus
mother-in-law
becomes three tokens
mother in law
The use of special restrictions has the effect of refining the set of
documents returned. This occurs because special restrictions can only
remove documents after the initial document keyword search. It cannot
add documents.
Special restrictions should be used carefully. Although an effort has
been made to minimize the use of string searching, a very time consuming
process, it is still the last step in supporting special restrictions.
Therefore, if it is applied using terms which occur in many documents in
the collection and produces a large number of documents even after Stage
1 of restriction processing, then this system may prove very slow.
Limitations
The PRISE Search Engine has some limitations
-
PRISE will process a query of up to 6000 terms.
-
Only the first 32 query terms used to retrieve documents will be
accounted for in the bitmaps associated with the documents. Any additional
query terms will add documents to the result set and be eligible to
influence document scores but will not be represented in the bitmap
(and will not be elligible for highlighting by the zclient).
Note: When using special restrictions, any special terms which do not
occur in the bitmaps will not contribute to Stage 1 of special
restrictions. This will result in more documents remaining to be
processed in Stage 2 of special restrictions, string searching, which
will slow response time.
-
Keyword searching, document or fielded, on numbers is not possible
because numbers are not indexed by the PRISE Indexer.
-
The string searches used to support special restrictions will time out
after 5 minutes and the result set returned will consist of the
documents which have passed special restrictions at that time. This time
out interval may be adjusted by changing the #define TIME_OUT_SECS.
Source: DISTRIBUTION_ROOT/prise_search/include/restrictions.h
-
Punctuation in the query is ignored for keyword searching.
Includes
#include <stdio.h>
#include <malloc.h>
#include <math.h>
#include <fcntl.h>
#include <string.h>
#include "type_defs.h"
#include "symb_defs.h"
#include "tkn.h"
#include "query.h"
#include "coll.h"
#include "options.h"
#include "result.h"
#include "search_proto.h"
#include "search0.h"
#include "io.h"
#include "errors.h"
Main Data Structures
Collection
typedef struct {
DICTFILE* dict;
POSTFILE* postings;
TPIFILE* tpi;
STOPFILE* stoplist;
COLLSTATS* stats;
DOCNO_TABLES* docno_tables;
FIELD_TABLE* field_table;
char* lexnmzn; /* Name of lexical normalization function. */
int stemfn;
char* locale;
char* dir;
char* name;
int mode; /*mode refers to SEARCH_MODE or INDEXER_MODE*/
OPTIONS* options;
} COLL;
Query
typedef struct TEXT_QUERY
{
/*-------------------------------------------------------------------------*
* Tags delimiting the part of the document to search
*-------------------------------------------------------------------------*/
char* start_end_tags; /* When full document is intended, this */
/* contains a pointer to a null string */
/*-------------------------------------------------------------------------*
* Data indicating the sort of search to [perform on the document part
*-------------------------------------------------------------------------*/
/*-------------------------------------------------------------*
* Keyword search
* No limit on number of terms
* Stemming
*-------------------------------------------------------------*/
struct
{
char* terms;
char* weights;
int numterms;
} keywds;
/*-------------------------------------------------------------*
* Proximity search
* 2 terms
* no stemming
* order does *not* matter
*-------------------------------------------------------------*/
struct
{
char* terms; /* in the order they must appear */
int dist;
} prox;
/*-------------------------------------------------------------*
* Phrase search
* No limit on number of terms
* No stemming
* Order matters
*-------------------------------------------------------------*/
struct
{
char* terms;
} phrase;
struct TEXT_QUERY* next; /* text query for next document section */
/* NULL if none follows */
} TEXT_QUERY;
/*---------------------------------------------------------------------*
* Typedefs to support relevant/nonrelevant document list and relevance
* feedback operations.
*---------------------------------------------------------------------*/
typedef struct docrelnode{ /* structure for passing relevant document
lists */
unsigned long docno; /* document id number */
float relevance; /* how relevant the document is */
struct docrelnode *next; /* pointer to next list member */
} DOCRELNODE;
typedef struct
{
int doSearch; /* Execute a search */
int returnReformulatedQuery; /* Send the reformulated query back to
* the client */
int reformulateQuery; /* Use the relevance documents to create a
* list of unique terms which can be added
* to the query. */
int newQuery; /* This query is not related to any
* previous query aka we have not done
* relevance feedback to get to this
* query
/*-------------------------------------------------------------------*
* Limits used to control the number of additional relevant terms
* used/returned for review. Supporting relevance feedback created the
* need for these so they are being kept with the relevance document
* lists.
* These may also be useful if we were to support boolean queries and
* the expand operation.
*-------------------------------------------------------------------*/
int minrelterms; /* The minimum number of additional
* relevant terms to be used/return for
* review. This may be useful if we use
* a selection method based on something
* other then position in ranked list,
* ex. top N terms. */
int maxrelterms; /* The maximum number of additional
* relevant terms to be used/return for
* review. */
DOCRELNODE *doc_rels; /* List of document relevance information */
int num_wanted; /* Desired number of items in the result */
enum query_type
{
Q_TYPE_TEXT,
Q_TYPE_DOCID,
Q_TYPE_BOOLEAN,
Q_TYPE_LIMIT /* The number of query types */
} qtype;
union q_type_union
{
TEXT_QUERY* text_query;
void* boolean; /* Not defined, yet. Not implemented, yet. */
} u;
union q_type_union reformed; /* Reformed query returned by searcher. */
} QUERY;
Result
typedef struct {
DOC* docs; /* Data representing retrieved documents. */
int doc_ct; /* The number of documents retrieved. */
int query_id; /* For experimentation and evaluation. */
QUERY *query; /* A PRISE query structure */
QTERM* qterms; /* A list of structures representing query */
/* terms.*/
/* This is typically a subset of 'tknlist' */
/* list. */
/* Stopwords, for example, have been removed.*/
/* QTERMs are not strictly necessary, since */
/* they are redundant given TKNs, but are */
/* provided for compatibility with other */
/* components, e.g., feedback. */
/* Allocated and set via qtinit(). */
/* qterms added by dld to allow for */
/* highlighting */
TKNLIST* tknlist; /* A list of structures representing the */
/* original */
} RESULT;
Functions
These functions are located in DISTRIBUTION_ROOT/prise_search/src/lib/search/search0.c.
int init_search0
(
char* coll_name Collection
char* coll_dir Collection directory
COLL** coll Structure representing collection components
RESULT** result Search results.
)
Returns:
RET_OK
RET_ERROR Failed to initialize the collection
Failed to malloc RESULT
Failed to allocate result space
RET_FAIL Initonly option - show specs no retrieval
The interface function init_search0() must be called prior to the first
call to search0.
int search0
(
COLL* coll Structure representing collection components
QUERY* query Query
RESULT* result Search results
)
Returns:
RET_OK Retrieved %d document(s)
RET_ERROR Error returned by search()
RET_FAIL No documents retrieved
The interface function search0() activates the PRISE search engine
function, search() in
DISTRIBUTION_ROOT/prise_search/src/lib/search/search.stem.neg.c.
int close_search0
(
COLL* coll Structure representing collection components
RESULT* result Search results
)
Returns:
RET_OK
References
-
Harman1: Relevance Feedback Revisited by
Donna Harman.
15th Ann Int'l SIGIR'92/Denmark-6/92.
-
Okapi1: Okapi at TREC-3 by
S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford
Overview of the Third Text REtrieval Conference (TREC-3)
NIST Special Publication 500-225, April 1995
Last updated: Tuesday, 01-Aug-2000 13:16:48 UTC
Date created: Monday, 31-Jul-00