Notes on plans for DUC 2005 and beyond ------------------------------------------------------------------------------ A. Goals: 1) find some real need for summarization and motivate/define the evaluation framework from the point of view of one or more realistic task scenarios 2) move away from generic summaries of newspaper/wire to summaries of additional genre with respect to broad subject areas, but overlap in some ways with previous source types and tasks 3) continue working on evaluation a) continue to support development, use, and testing of tools for automatic evaluation, (e.g., ROUGE) b) continue to explore better ways of coverage evaluation, such as the Columbia pyramid suggestions c) work hard to build up extrinsic evaluation 4) allow partial participation (by component, source type...) 5) be open to evolution of goals in the nature of the task (fusion, extraction, Q&A), input (not just text),and output (lists, outlines, timelines, etc. ------------------------------------------------------------------------------ B. Characterization of summaries (i.e situation reports) in terms of Summarising Factors (Karen Sparck Jones 20 May 2004 - see "Background items" on DUC Roadmap 2005 - 2007 webpage.): 1) PURPOSE FACTORS a) the SITUATION, i.e. context, within which the summary is to be used for DUC 2005: Situation report as of a given date for use within a crisis management organization b) the AUDIENCE for a summary can be characterised as Targeted for individual manager in the crisis management organization, as opposed to a news release for example b) the USE, or function, for which the summary is intended To provide background, current status (problems and responses), and as far as possible, likely development of the situation as related to organization's role in situation 2) INPUT FACTORS a) the SUBJECT TYPE of the source definition: Variety of subject types possible per topic-situation, but drawn from a limited set known to systems at development time For DUC 2005: Work within natural disasters framework, use 1998-2000 as the timeframe (matches AQUAINT news data) b) the FORM of the source definition: Variety of sources possible per topic-situation, but drawn from a limited set known to systems at development time For DUC 2005: the main source types for 2005 will be government (UN) documents and newspaper/wire. Additional possible sources include scientific documents, Usenet news threads, etc. (but see general issues) There are many tables, images of maps, graphs, etc that could realistically be incorporated in a situation report but their inclusion in DUC is deferred until systems are ready to work on handling them. Other groups would be encouraged to contribute and manage additional genre and media (non-English, speech, etc.). In a similar vein, additional tasks could be proposed that would fit into this scenario, such as "headlines" for easy click-down. Manage is the operative word here, in that groups would be responsible for providing additional documents to all, and evaluating the results when using non-English text or additional tasks. Note that it needs to be made clear how these additional documents or tasks fit into the situation report scenario. c) the UNITS taken as source Mostly multiple units per topic-situation, but there could be single units of a particular form 3) OUTPUT FACTORS a) the MATERIAL of the summary, i.e. the information it gives, in relation to that in the source For DUC 2005: Use a single outline for all situations. It is based mainly on the WHO situation report outline but modified based on the structure of other situation reports found on the Web. Each section will have a brief paragraph describing the kinds of information that belongs there. (Note that this is well above the level of MUC templates or current factoid QA). Purposed outline 1. What happened 2. Geographical area affected, including information on the affected infrastructure (bridges, roads, land under water, etc.) 3. Populations affected, including information on morbidity, mortality, homelessness,etc. 4. Main needs 5. Local/national response 6. Regional/international response 7. Social/political/geographical constraints 8. Expected developments b) the FORMAT of the summary i.e. the way the summary information is expressed For DUC 2005: blocks of running text summarizing information relevant to a given heading. These blocks of text could be composed of either extracted sentences, or extracted phrases (long), of constructed phrases, or generated text. c) the STYLE of the summary i.e. the relationship to the content of the source Informative, (Maybe later: aggregative ,critical) d) the EXPRESSION of the summary i.e. all the linguistic feaures of the summary this subsumes Structured, somewhat technical, English narrative. e) the BREVITY of the summary i.e. relative or absolute scale (length) of the summary Each block of text is limited to 665 bytes (up for discussion by group) ------------------------------------------------------------------------------ C. EVALUATION For DUC 2005: the evaluation will try to answer these questions using the following means: a) How much of the requested info does the submission contain? Assessors will create multiple reference reports for each section in the proposed outline. From these, some sort of list (hereby called the infolist, and likely using the Columbia pyramid scheme) of main information items (reflecting the diversity of the reference reports if possible) will be created, again, one list for each section. 1) Submissions will be evaluated using precision (how??) and recall against the infolist. 2) use of SEE in which the model is the infolist and the peer is the block of text 3) Submissions will be evaluated against the infolist and/or the reference summaries using some automatic method (e.g ROUGE) b) How usable is the submission? Submissions will be evaluated in terms of the time it takes an assessor to find each item on the infolist or determine it is not included in the report. This is an extrinsic evaluation or a pseudo one. If no information is found for a given category of the report, the systems should respond "nothing found". c) How linguistically well-formed is the submission? Evaluate the submission in terms of quality questions as in DUC 2004 ------------------------------------------------------------------------------ D. PROPOSED TIMELINE (assuming meeting at HLT in October 2005) Starting NOW: An organized set of pilot projects looking into this infolist idea; note that there has to be some kind of convergence for evaluation Dec 2004 NIST provides scenario template, test date range (1998-2000), test event types (natural disasters), several examples Groups look for additional non-newpaper/wire data (non-English etc.) Groups train systems for the new task June 1, 2005 NIST provides list of test events By June 15 Groups select additional documents from non-newspaper/wire data and send to NIST for distribution June 15 NIST distributes all test documents July 1 Results due at NIST two weeks later July 30 Results from evaluation out to participants Oct ?? DUC 2005 meeting (at HLT) ------------------------------------------------------------------------------ E.. General issues to be resolved: 1. Some scientific documents are available but it is not clear how the detail they cover fits realistically into the outline. There are Usenet news threads, but again it is not clear how that sort of content would actually be used in writing a situation report. 2. infolist: this is the holy grail we are all seeking that provides the "nuggets" of important information that needs to be contained in a summary. This is related to the QA nuggets, to the Columbia pyramid scheme, etc. The reason for continuing to pursue this is clear; how to do it is less clear. One way to tackle this would be a pilot based on the Columbia pyramid scheme. First, Columbia would develop written guidelines, possibly using a sample of DUC 2004 topics. Then other groups will try to use those guidelines, using additional topics. Finally NIST will use the "final" version of the guidelines with our assessors. It is important that multiple groups work on this project so that this issue can be discussed with better understanding within the whole community. 3. How does component analysis fit into this?