TABLE OF CONTENTS

Introduction
TIPSTER Overview
TIPSTER Technology Overview
TIPSTER Related Research
Phase III Overview
TIPSTER Calendar
Reinvention Laboratory Project
What's New

Conceptual Papers
Generic Information Retrieval
Generic Text Extraction
Summarization Concepts
12 Month Workshop Notes

Conferences
Text Retrieval Conference
TREC-7 Participation Call
Multilingual Entity Task
Summarization Evaluation

More Information
Other Related Projects
Document Down Loading
Request for Change (RFC)
Glossary of Terms
TIPSTER Contacts
TIPSTER Source Information

Return to Retrieval Group home page
Return to IAD home page

Last updated:

Date created: Monday, 31-Jul-00

Notes from TIPSTER 12 month Workshop

The following material is from the 12 month TIPSTER Workshop. Various study groups composed of attendees examined several areas related to TIPSTER and presented their conclusions to all attendees at the end of the workshop. These Report Outs are summarized below for the following areas:

Document Detection Report

Does commercial world benefit? Yes,

Info Seek, MatchPlus, Inquiry, Inktomi (Berkeley)

TREC Collection provides ground truth- that's the big contribution from TIPSTER

Does Government (IC) have unique needs? Yes.

leading edge requirements
strange formats
scaling/scaleability
multilingual
accuracy
recall (lawyers too)
heterogeneous environments
security

So, which should industry address?
Yes -worthwhile for vendors to address

for leading edge issues
for companies in that market

Maybe --

special requirements may be difficult for small companies, but large companies can do it?

No --

custom data formats, computing environments not relevant for many companies

Areas that need to be addressed in future:

speech front ended
multi-lingual
corpus and analysis
clustering
work on GUIs
detecting duplicates
info-objects
structured queries
mixed

What are some resources which researchers could use/need?

multi-dictionaries
speech corpus
classification corpus
language & query collections

Should we develop five-primary models? (are users too unpredictable?)

We need to develop other means of evaluating (in addition to precision and recall). No breakthrough if we continue to evaluate with only TREC.

How to change TREC to be more real world. Apply for "Innovative Funds"

Types of analyst; how they work

Some research should be using intelligence analysts.

Currently it takes analysts too long to search and fix.

Is new "Thinking Tool" category in tech strategy?

Commercial world is not there.

Interface matters and can be studied.

Need to identify types of tasks which users do. Ideally, have various kinds of modules to hook together in various ways.

Internet saturated by advertising.

Document Detection's Future

Application Areas	Retrieval Subtasks	Retrieval Effectiv eness
Speech	GUI organize analyze	Interactive
Multi-lingual	Duplicate Detection	Query Analysis
Corpus Analysis classification clustering TDT	n/a	n/a
OCR	n/a	Metrics for IR (not DR)
Video	n/a	Structured Quereies mixes db meta-data
Foreign Language	n/a	Common query language
Image	n/a	n/a
Spatial	n/a	n/a

Understand Application Areas
1. DB Entry
2. Summarization

chronological
causal

3. Correlation among Event Types (Fusion)
4. Coding/Classification
5. IR - typed Information

Resources used in Document Detection
We love TREC!
Additional resources needed

multi-lingual dictionaries
speech corpora
audio, phones, text, queries
classification corpora
large query collections in multiple forms

Extraction Report

Research Areas
Basic System performance

Basic System Performance
Data Fusion
Portability - Critical

Critical Areas for Research Push

Portability
Adaptability
End User Training
Self-calibrating

Evaluation Is Driving Technology

Basic Technology
Fusion
Analyst Productivity

Government and Private Sector Very Similar Needs

Sub-language
Foreign languages
Applications in various basic areas

Ways to get evaluation to drive the research/technology

Portability
Adaptability
Self calibration
Basic system performance
Data Fusion
Portable

Depending on what you're extracting (domain) the techniques need to change

Multi-Lingual Report

No one addressing cross-lingual (industry seems more interested in one-to-one

Perhaps folks don't believe in cross-lingual (since MT hasn't worked)

At MT Summit - all wanted MT to "their" language

There is a need for training data

It was suggested we ask contractors to work up documents and send back

Could Lexis-Nexis customers work up documents?

Critical gap for Multi-Lingual is lack of training data

GVE Languages - about 12 are really important for CIA and NSA

Spend money on GVE

There is limited support for other than core languages

Ability to "ramp up" in "core language" would be really advantageous

There are 250 languages in which the Government desires capability

Is MUTT system used for MVC

Need a Text Widget which supports languages

Need auto and semi-auto bilingual dictionaries and OCR

Multi-lingual:

Text (Language)
Technology for many languages
Chinese to Chinese

Cross-Lingual:

Technology for examination/processing of texts in many languages, using different ones
English to Chinese

1. Current State of the Art
(a) Multilingual:

localization and interim
limited OCR support for "minor' languages

(b) Cross-language

MT Research
Current level of MT effort is limited, many unexplored areas

2. Timelines
What would be good:
(a) Better/wider MT research
(b) Better plumbing for minor languages
(c) Core language focus with rapid response capacity for 200+ "minor" languages.. but where do resources come from.
(d) Does this mean acquisition/learning is key?

3. The Impact of this Technology? Quality?
(a) Need to examine the potential and real impact in the organization

How is this measured?
COTS?
Operationally?
Is poor quality rapid MT better than none at all? ....Yes (tentative)

4. Types of Infrastructure Needed?
(a) OCR+ Acquisition Technology
(b) I/O, Display Technology
(c) "Interlingual" Tools

5. Commercial World Benefits?
(a) Doesn't the Commercial World always benefit?
(b) ...but should the Government care? Shouldn't MT address specific needs of Gov.?
(c) ...and are the needs of research and product different from commercial?

6. Government Special Needs
(a) Does Application precede capability?
(b) Often Government needs are so narrow or specialized a commercial market can't exist
(c) Difference between multi-government user and localized software users.

Summarization Report

Good attendance (75 folks)

We should do fun stuff first to generate interest

How should we get output from system

Sources of info for summary

"simple summary firm single document" how?
working to single from many

Users want different things from summary

Lexis-Nexis - need specific, unique summaries for certain users
milestones:

keywords
identify main point (currently available)
sets of documents summarized (in + 3 years)
summarize events/activities (in +5 years)
summarize/create a view point

Types of systems (components which have to share to make this work)

Swap/Share modules

What have we done?

Good progress

2 years ago - not much work

Now basic summary is easier

In 3 years user profiles should be available

In 4 years lexical/semantic

Description of dry run for Summarization

Dry run pointed out what can be controlled

Some cross-doc evaluation

Types of Functions (Users)
1. News browser - intelligence analyst

indicative, incomplete
query-sensitive
somewhat domain/genre-specific

2. Legal specialist - Lexis-Nexis

precise, complete
styles: "abstract", "head note"
very genre-specific

3. Financial analysts - numerical

include figures, tables
somewhat genre-specific
user-tailoring

4. IR/VRT engine

Milestone Systems
Milestone 0: Main keywords (DONE)
Milestone 1: Identify and present main points of (set of) newspaper like documents (TODAY)
Milestone 2: Summarize a set of documents, identifying the main themes. User query on object, entity, possible event (+ 3 years)
Milestone 3: Summarize events or activities of object, process, event, location, timeline, ... for set of documents. (+5 years)
Milestone X: Summarize argument of new point, create new point

Function/Technologies

1. Maximum marginal relevance engine
Now: Limited capability
Need: document decompression and clustering, knowledge base, R.A. structure, semantic analysis

2. Coreference engine
Now: limited, but useful
Need: temporal model, Word Sense disambiguation

3. NP Level Summarization
Now: working
Need: Word Sense disambiguation

4. Topic ID
Now: preliminary like MS
Need: Word Sense disambiguation, statistical association, knowledge base, genre recognizers

5 Term expansion
Now: limited, but useful - quality too low
Need: semantics

6. User profiling
Now: almost nothing
Need: user studies

7. Script/frame/pattern
Now: very little
Need: theory, definition, matching, learn collections, knowledge base

Status of Technology
T-2 years:

hardly anything, commercial or government
simple Topic ID by MS
some by initial research

could assemble "level 1" engine from [MMR and coreference, Topic ID & NP]
assembly of systems: open problem

T + 3 years:

assemble "level 2" engine [MMR+ coreference + Topic ID & NP & User]
start research (seriously) on user profiling and script/patterns on large scale
need lexical semantics! multi-lingual, "deeper" WordNet

T+6 years (?):

"level 3"

Human/Computer Interface Report

The Task

What led the person to engage in information interaction

Comparison

Exact, Best Match

Representation

Indexing, Classification

Presentation

Titles, Summaries

Visualization

Ranked List, Clusters

Navigation

Serial Scanning
Links

User

Long Term and current goals and tasks
Knowledge
Previous experience

Interaction

Evaluation
Selection
Recognition
Use

Information Objects

Type
Level
Medium

People involved in information creation and use

problem/task seldom "IR"
support richer variety of interaction
useful paradigms and combination
interfaces matter and can be studied
+5 years only interactive tree
+10 years 7+2 modes of interaction

Future.

Long term studies
Research in under lying phenomena
Test beds for experimental and comparative studies