IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-3
needed for handling documents which are not available with a suitable abstract.
Some larger selections from the full text of documents consisting of more
material than the abstract, yet less than full text, may be possible; for
example, section headings and figure captions might be added to the abstract.
In the present study, several different selections of documents
will be compared, the shortest being titles only, and the longest a collec-
tion of full text tshortt conference papers. Evaluation of these different
document lengths will center on the retrieval performance achieved. Other
evaluation criteria such as search time and input cost will be of considerable
importance in operational environments, but in the experimental tests being
performed on the SMART system no reasonable simulation test of these criteria
can yet be made.
2. SMART Test Comparisons
Three series of comparisons of document length are presented. Firstly,
the use of abstracts (including titles) is compared to the use of document
titles alone. Results are presented for the three collections of documents
being used for current experiments in the subject areas of computer science
(:RE-3, 780 documents, 3[OCRerr] requests), aerodynamics (Cran-l, 200 documents,
[OCRerr]2 requests), and documentation (ADi, 82 documents, 35 requests). Secondly,
using the ADI Collection the abstracts are compared to the use of full text.
In the main results, the text used includes the abstract, and both naturally
include the title, 80 that three distinct document lengths are available
for comparison. The ADI Text Collection consists of a set of short conference
papers of average length 1,380 words; it is therefore not typical of scien-
tific papers in general, and does not pose any problems due to non-textual
material. The third comparison is made with the Cran-l abstracts which are