TABLE OF CONTENTS

Introduction
TIPSTER Overview
TIPSTER Technology Overview
TIPSTER Related Research
Phase III Overview
TIPSTER Calendar
Reinvention Laboratory Project
What's New

Conceptual Papers
Generic Information Retrieval
Generic Text Extraction
Summarization Concepts
12 Month Workshop Notes

Conferences
Text Retrieval Conference
Multilingual Entity Task
Summarization Evaluation

More Information
Other Related Projects
Document Down Loading
Request for Change (RFC)
Glossary of Terms
TIPSTER Source Information

Return to Retrieval Group home page
Return to IAD home page

Last updated:

Date created: Monday, 31-Jul-00

Summarization Concepts

Traditionally, the preparation of document abstracts has been a human function. A person knowledgeable in the subject matter of the document reads it and then writes a short, typically one paragraph, summary of the document. The abstractor tries to include in the summary all of the important ideas and concepts presented in the original document. Obviously, what is important is based upon the abstractor's opinion.

Tests that have been performed show that human abstractors don't always agree completely on the content of an abstract. Some people consider 85% agreement between abstractors to be about the best that can be obtained.

The tremendous increase in documents and easier availability through electronic means has put an insurmountable burden upon organizations using humans to do abstracting. In only a few cases, such as technical and scholarly papers, can the preparation of the abstract be the responsibility of the author. Traditionally, news articles put the most 'important' information in the first paragraph; however, because news articles have short paragraphs it is usually necessary to use additional paragraphs to get a good summary of the article.

With exponential increases in the number of documents available, high quality abstracts become more important as the demand for finding 'the right document' is also increasing. Thus, it is not surprising that there are efforts underway to develop machine aided methods to help improve the quality of information available to the user and to reduce the time to get the information. One of these efforts is summarization.

Summarization can be more than just abstract. An abstract is usually thought of as being associated with a single document, but there may be a need to cluster or categorize large groups of documents with similar subject matter with a single summary. Summarization may be applied at different points in the normal text processing sequence so as to improve relevant information to the user, including:

Building shorter indexes with more relevant words,
Reducing retrieval time by retrieving against the summary,
Aiding browsing by using more succinct information,
Eliminating duplicate or near-duplicate documents and redundant information,
Identifying relevant information.

Summarization will use natural language processing methods and/or statistical techniques to achieve a significant reduction in the quantity of text presented to a user with minimal reduction in information content.

Some of the techniques that may be used, independently or combined, in building summaries may include:

Selecting important paragraphs from a document.
Selecting important sentences from a document.
Selecting high frequency, meaningful words from a document.
Selecting unusual words from a document.
Counting repeated word usage to identify important sentences.
Using information extraction techniques to identify important document entities, e.g., person names, place names, company names, organizations, numeric data and temporal data. More complex extraction methods can be used for determining relationships between entities.
Using vector techniques to group either documents or paragraphs under common concepts.
Using retrieval techniques to identify documents that respond to a complex query which would be the desired summary. This technique may be good for easily adjusting the scope and grain of the summary.
Performing some level of modification of the selected sentences or paragraphs using natural language or statistical techniques.
Using natural language techniques to synthesize new sentences or paragraphs. This may include coreference resolution of pronouns and other constructs.
Applying statistical techniques to condense document or collection content.

It should be noted that effective summarization is not an easy task and it frequently involves semantic analysis and applying world knowledge for clearest presentation. While some research has been done and a few trial systems developed there is still much, much work to be performed before really good summarization software is available. And of course, multi-lingual considerations just increase the difficulty.

Another issue under the summarization umbrella is how well does a particular approach works? Some initial work is underway as part of the TIPSTER program to examine the feasibility of performing summarization evaluation in a manner similar to MUC and TREC.