Finding Links

John Tebbutt
The National Institute of Standards and Technology,
Gaithersburg, MD 20899, USA
E-mail: tebbutt@nist.gov

ABSTRACT

Possibilities for the automatic designation of pre-existing text elements as implicitly-typed links through the use of information retrieval technology are discussed. Results of preliminary work in this area are presented, and plans for future research outlined.

KEYWORDS:

Automatic hypertext construction, embedded links, installed links, hypertext, information retrieval, IR

INTRODUCTION

Previous research on the automatic creation of hypertexts from "flat" document collections using information retrieval techniques has spawned highly effective tools (see, for example, [1]). These tools typically link a given document to one or more others using large-scale features of the source document (a collection of paragraphs, or the document itself) as a query to an IR system, and then appending a series of links to those documents which were ranked highest. Such links between documents provide a valuable browsing function within a document collection, but are of necessity very general in nature, even if types are assigned to the links, as suggested by Allan [2].

LINK SPECIFICITY

The connection between the text of a link anchor embedded in the text of a source document and its target is usually very specific. As an example, a link text like the history of race-based rioting in Los Angeles leaves little room for doubt as to what to expect of the target document. The length of the link text, its position and its semantic content tune our expectations of the link target far more finely than the browsing-type links described above.

EMBEDDED LINKS AND RARE TERMS

Our emphasis here is solely on the feasibility of automatically deriving precise, tightly-focused links from pre-existing text elements, without attempting to assess the usefulness of such links. For an elegant treatment of the issues surrounding usefulness of links, see [3].

Our assumption in trying to designate embedded links in a document collection is that the more content-rich elements of a text will prove most useful to the system in filtering out the best link targets. The trick, then, is to identify such text elements.

Our initial focus has been to look at terms or words that are relatively rare in the collection. The rarity of a term is a central factor in IR systems in determining the relevance of documents to a given query and in discriminating between documents based on the terms they contain.

METHOD AND RESULTS

The document collection used was a subset of the Associated Press 1990 TREC collection, some 30,000+ documents. The search engine used was NIST?s PRISE system, incorporating the BM25 weighting scheme. Assessment was performed by the principal researcher.

The basic method is straightforward, if not crude. First, we define a rare term. Initially this was any term occurring in between two and five documents (i.e. any term with a Collection Frequency, CF, of between 2 and 5). Next, each occurrence of the term is located, and a block of text containing the term (initially a sentence) is passed to the search engine as a query - this text block also serves as the link anchor. Finally, the highest-ranked document returned by the search engine, excluding the document from which the query was taken, is designated as the link target and the link is completed.

Preliminary Findings

A preliminary study using this method yielded encouraging results: 32% of target documents were judged to be either perfect, near-perfect or at least acceptable targets for the designated link text. If a relaxed standard was adopted, such that the target document might be considered generally relevant to the source document as a whole, then the hit-rate soared to 62%.

While encouraging, these figures leave much room for improvement. Accordingly, we examined the data in an attempt to identify areas in which improvements could be made.

Proper Names. Two kinds of error arose from the inclusion of proper names in our rare term database: the same name might identify different entities, or the name might refer to the same entity but in an inappropriate context.

Non-standard Spellings. A non-standard spelling effectively creates a new term, which may be rare in the collection though its normally-spelled counterpart is relatively mundane. This leads to two types of error: false links may be created between otherwise unrelated documents, and legitimate links to documents containing the standard spelling of the word will be missed.

Principal Study

Our principal study included several refinements based on our preliminary findings.

Notably, proper names and non-standard spellings were removed from the rare term database by filtering the indexer terms against a list of over 100,000 English words. Additionally, at most one link per rare term was included per document

The scope of the link anchor text was expanded from the sentence to the paragraph level. In many cases this made little or no difference, however, it was felt that more context information would be added to the query and thus retrieval accuracy would be improved.

Finally, the experiment was conducted for multiple Collection Frequency ranges: CF = 2-4; CF = 4-6; CF = 14-16; CF = 100-102; and CF = 200-202.

Three major factors came to light in this study:

Increased Precision. Over all CF ranges, the percentage of links that were judged "direct hits" was much higher than in the earlier study: 35% - 40% of target documents were judged to be perfect or near-perfect targets for the designated link texts, while some 60% of target documents were judged to be either perfect, near-perfect or at least acceptable targets for the designated link text. The broader category mentioned in the earlier study was not evaluated.

Error-rate Unchanged. Over all CF ranges, the percentage of target documents judged as being unrelated to the text of the link anchor was about 35%. This figure is essentially the same as that obtained in the preliminary study.

Results Independent of Collection Frequency. The percentage figures for suitability of a document as a target for a given anchor text were essentially the same across all collection frequencies.

DISCUSSION AND CONCLUSIONS

The preliminary work described here has yielded encouraging results. However, the high error rates and the consistency of link suitability measures across a range of Collection Frequencies used to define a rare term give us cause for concern.

Contribution of Rare Terms

Our observation that link target selection effectiveness was the same across all "rare" term collection frequencies implies that the rarity of the term used in the designation of the link anchor has little or no effect in discriminating between link targets. It seems likely that this is a result of the rare term being swamped by the additional text in the query. If this is the case, it should be possible to increase the contribution of the rare term by shortening the query or by using a form of query processing which takes the Collection Frequencies of the query terms into account.

Error Rates

The high error rates obtained so far may in part be due to the same blurring of focus in the link anchor postulated above as a cause of the CF-independence of the results. If the discriminative power of the query is diluted by a swamping of the rare term, then it might be expected that a higher error rate would result.

Also, because we are dealing with a closed collection of documents, it may be that a satisfactory link target simply does not exist, even for the best focused query.

Ongoing and future research

We are currently experimenting with ways to tighten the focus of the link anchor texts. Methods under investigation include use of a fixed-width window of words surrounding the rare term; using a sliding window in the document within which statistics concerning all the terms are tallied, the text window being designated as a link anchor and submitted as a query whenever that tally exceeds some threshold value; and introducing query processing into the search engine. The relative effectiveness of these techniques will need to be assessed.

In the future we will turn our attention to reducing the occurrence of "bad links". This will involve detection and correction mechanisms. The latter may include some reformulation of the link anchor text, or some recourse to external material, e.g. another document collection, an encyclopedia look-up, etc.

REFERENCES

1. Agosti, M., Crestani, F. and Melucci, M. Design and Implementation of a Tool for the Automatic Construction of Hypertexts for Information Retrieval. Information Processing and Management, 32(4):459-476, 1996.

2. Allan, J. Automatic Hypertext Link Typing. Proceedings of the ACM Hypertext ?96 Conference, Washington, DC,pp.42-52.

3. Golovchinsky, G. Queries? Links? Is There a Difference? Proceedings of CHI ?97, Atlanta, GA.