NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman takes into account the lexical, semantic and discourse sources of linguistic information in both documents and queries. Secondly, it can serve as input to a filter which uses a more complex version of the original cut-off criterion to determine how many documents should be farther processed by the system's fmal modules. For the Integrated Mateher to produce a combined rankng, each document's similarity value for a given Topic Statement can be thought of as being composed of two elements. One element is the SFC similarity value and one element is the similarity value that represents the combined proper noun, complex nonnal, and text structure simliarities. Additionally, the system will have computed the regression formula, the mean, and standard deviation of the distribution of the SFC similarity values for the individual Topic Statement. Using these statistical values, the system produces the cut-off criterion value. Since we know from the eighteen-month results, that 74% of the relevant documents had what we refer to as a k-value (then PN value; now PN, CN, TS values) and the remaining 26% of the relevant documents had no k-value, we use this information to predict what proportion of the predicted relevant documents should come from which segment of the ranked documents for flill recall. The combined ranking can be envisioned as consisting of four segments, as shown in Figure 2. Docs. having a k-value & an SFC value I Group 1 above the cut-off -----cut-off criterion SFC similarity value Docs. having a k-value & an SFC value I Group2 below the cut-off Docs. having no k-value & an SFC value I Group 3 above the cut-off ------cut-off criterion SFC simllarity value Docs. having no k-value & an SFC value I Group4 below the cut-off Fig. 2: Schematic of Segmented Ranks from SFC & Integrated Ranking (k-value) Four groups are required to reflect the tw[OCRerr]way distinction mentioned above. The fnst distinction is between those groups which have a k-value and which should contain 74% of the relevant documents and those documents without a k-value, which should contribute 26% of the relevant documents. The second distinction is between those documents whose SFC similarity value is above the predicted cut-off criterion and those whose SFC similarity value is not. When a cut-off criterion is the application desired, the system will produce the ranked list in response to a desired recall level, by concatenating the documents above the appropriate cut-off for that level of recall from Group 1; then documents above the appropriate cut-off for that level of recall from Group 3. However, since our test results show that there is a potential 8% error in the predicted cut-off criterion for 100% recall, we use extrapolation to add the appropriate proportion of the top ranked documents from Group 2 to Group 1, before concatenating documents from Group 3. These same values are used to produce the best end-t[OCRerr]end ranking of all the documents using the various segments. Document ranks are produced by the Integrated Matcher and the cut-off criterion is used either by an individual user who requires a cert[OCRerr][OCRerr] recall level for a particular information need, or, as in the twenty four month JIPSThR test situation, by the system to determine how many documents from the Integrated Matcher ranking will be passed on to the fmal modules for further processing. 92