TACP logo

National Institute of Standards and Technology Home Page
TIPSTER Text Program A multi-agency, multi-contractor program


TABLE OF CONTENTS


Introduction
TIPSTER Overview
TIPSTER Technology Overview
TIPSTER Related Research
Phase III Overview
TIPSTER Calendar
Reinvention Laboratory Project
What's New

Conceptual Papers
Generic Information Retrieval
Generic Text Extraction
Summarization Concepts
12 Month Workshop Notes

Conferences
Text Retrieval Conference
TREC-7 Participation Call
Multilingual Entity Task
Summarization Evaluation

More Information
Other Related Projects
Document Down Loading
Request for Change (RFC)
Glossary of Terms
TIPSTER Contacts
TIPSTER Source Information

Return to Retrieval Group home page
Return to IAD home page

Last updated:

Date created: Monday, 31-Jul-00

Notes from TIPSTER 12 month Workshop

The following material is from the 12 month TIPSTER Workshop. Various study groups composed of attendees examined several areas related to TIPSTER and presented their conclusions to all attendees at the end of the workshop. These Report Outs are summarized below for the following areas:


Document Detection Report


Does commercial world benefit? Yes,

Info Seek, MatchPlus, Inquiry, Inktomi (Berkeley)

TREC Collection provides ground truth- that's the big contribution from TIPSTER

Does Government (IC) have unique needs? Yes.

  • leading edge requirements
  • strange formats
  • scaling/scaleability
  • multilingual
  • accuracy
  • recall (lawyers too)
  • heterogeneous environments
  • security

So, which should industry address?
Yes -worthwhile for vendors to address

  • for leading edge issues
  • for companies in that market

Maybe --

  • special requirements may be difficult for small companies, but large companies can do it?

No --

  • custom data formats, computing environments not relevant for many companies

Areas that need to be addressed in future:

  • speech front ended
  • multi-lingual
  • corpus and analysis
  • clustering
  • work on GUIs
  • detecting duplicates
  • info-objects
  • structured queries
  • mixed

What are some resources which researchers could use/need?

  • multi-dictionaries
  • speech corpus
  • classification corpus
  • language & query collections

Should we develop five-primary models? (are users too unpredictable?)

We need to develop other means of evaluating (in addition to precision and recall). No breakthrough if we continue to evaluate with only TREC.

How to change TREC to be more real world. Apply for "Innovative Funds"

Types of analyst; how they work

Some research should be using intelligence analysts.

Currently it takes analysts too long to search and fix.

Is new "Thinking Tool" category in tech strategy?

Commercial world is not there.

Interface matters and can be studied.

Need to identify types of tasks which users do. Ideally, have various kinds of modules to hook together in various ways.

Internet saturated by advertising.

Document Detection's Future

Application AreasRetrieval Subtasks Retrieval Effectiv eness
Speech
GUI
  • organize
  • analyze
Interactive
Multi-lingualDuplicate DetectionQuery Analysis
Corpus Analysis
  • classification
  • clustering
  • TDT
n/an/a
OCRn/aMetrics for IR (not DR)
Videon/aStructured Quereies
  • mixes
  • db
  • meta-data
Foreign Languagen/aCommon query language
Imagen/an/a
Spatialn/an/a

Understand Application Areas
1. DB Entry
2. Summarization

  • chronological
  • causal

3. Correlation among Event Types (Fusion)
4. Coding/Classification
5. IR - typed Information

Resources used in Document Detection
We love TREC!
Additional resources needed

  • multi-lingual dictionaries
  • speech corpora
  • audio, phones, text, queries
  • classification corpora
  • large query collections in multiple forms

Extraction Report

Research Areas
Basic System performance

  • Basic System Performance
  • Data Fusion
  • Portability - Critical

Critical Areas for Research Push

  • Portability
  • Adaptability
  • End User Training
  • Self-calibrating

Evaluation Is Driving Technology

  • Basic Technology
  • Fusion
  • Analyst Productivity

Government and Private Sector Very Similar Needs

  • Sub-language
  • Foreign languages
  • Applications in various basic areas

Ways to get evaluation to drive the research/technology

  • Portability
  • Adaptability
  • Self calibration
  • Basic system performance
  • Data Fusion
  • Portable

Depending on what you're extracting (domain) the techniques need to change


Multi-Lingual Report

No one addressing cross-lingual (industry seems more interested in one-to-one

Perhaps folks don't believe in cross-lingual (since MT hasn't worked)

At MT Summit - all wanted MT to "their" language

There is a need for training data

It was suggested we ask contractors to work up documents and send back

Could Lexis-Nexis customers work up documents?

Critical gap for Multi-Lingual is lack of training data

GVE Languages - about 12 are really important for CIA and NSA

Spend money on GVE

There is limited support for other than core languages

Ability to "ramp up" in "core language" would be really advantageous

There are 250 languages in which the Government desires capability

Is MUTT system used for MVC

Need a Text Widget which supports languages

Need auto and semi-auto bilingual dictionaries and OCR

Multi-lingual:

  • Text (Language)
  • Technology for many languages
  • Chinese to Chinese

Cross-Lingual:

  • Technology for examination/processing of texts in many languages, using different ones
  • English to Chinese

1. Current State of the Art
(a) Multilingual:

  • localization and interim
  • limited OCR support for "minor' languages

(b) Cross-language

  • MT Research
  • Current level of MT effort is limited, many unexplored areas

2. Timelines
What would be good:
(a) Better/wider MT research
(b) Better plumbing for minor languages
(c) Core language focus with rapid response capacity for 200+ "minor" languages.. but where do resources come from.
(d) Does this mean acquisition/learning is key?

3. The Impact of this Technology? Quality?
(a) Need to examine the potential and real impact in the organization

  • How is this measured?
  • COTS?
  • Operationally?
  • Is poor quality rapid MT better than none at all? ....Yes (tentative)

4. Types of Infrastructure Needed?
(a) OCR+ Acquisition Technology
(b) I/O, Display Technology
(c) "Interlingual" Tools

5. Commercial World Benefits?
(a) Doesn't the Commercial World always benefit?
(b) ...but should the Government care? Shouldn't MT address specific needs of Gov.?
(c) ...and are the needs of research and product different from commercial?

6. Government Special Needs
(a) Does Application precede capability?
(b) Often Government needs are so narrow or specialized a commercial market can't exist
(c) Difference between multi-government user and localized software users.


 


Summarization Report

Good attendance (75 folks)

We should do fun stuff first to generate interest

How should we get output from system

Sources of info for summary

  • "simple summary firm single document" how?
  • working to single from many

Users want different things from summary

Lexis-Nexis - need specific, unique summaries for certain users
milestones:

  1. keywords
  2. identify main point (currently available)
  3. sets of documents summarized (in + 3 years)
  4. summarize events/activities (in +5 years)
  5. summarize/create a view point

Types of systems (components which have to share to make this work)

Swap/Share modules

What have we done?

Good progress

2 years ago - not much work

Now basic summary is easier

In 3 years user profiles should be available

In 4 years lexical/semantic

Description of dry run for Summarization

Dry run pointed out what can be controlled

Some cross-doc evaluation

Types of Functions (Users)
1. News browser - intelligence analyst

  • indicative, incomplete
  • query-sensitive
  • somewhat domain/genre-specific

2. Legal specialist - Lexis-Nexis

  • precise, complete
  • styles: "abstract", "head note"
  • very genre-specific

3. Financial analysts - numerical

  • include figures, tables
  • somewhat genre-specific
  • user-tailoring

4. IR/VRT engine

Milestone Systems
Milestone 0: Main keywords (DONE)
Milestone 1: Identify and present main points of (set of) newspaper like documents (TODAY)
Milestone 2: Summarize a set of documents, identifying the main themes. User query on object, entity, possible event (+ 3 years)
Milestone 3: Summarize events or activities of object, process, event, location, timeline, ... for set of documents. (+5 years)
Milestone X: Summarize argument of new point, create new point

Function/Technologies

1. Maximum marginal relevance engine
Now: Limited capability
Need: document decompression and clustering, knowledge base, R.A. structure, semantic analysis

2. Coreference engine
Now: limited, but useful
Need: temporal model, Word Sense disambiguation

3. NP Level Summarization
Now: working
Need: Word Sense disambiguation

4. Topic ID
Now: preliminary like MS
Need: Word Sense disambiguation, statistical association, knowledge base, genre recognizers

5 Term expansion
Now: limited, but useful - quality too low
Need: semantics

6. User profiling
Now: almost nothing
Need: user studies

7. Script/frame/pattern
Now: very little
Need: theory, definition, matching, learn collections, knowledge base

Status of Technology
T-2 years:

  • hardly anything, commercial or government
  • simple Topic ID by MS
  • some by initial research

T:

  • could assemble "level 1" engine from [MMR and coreference, Topic ID & NP]
  • assembly of systems: open problem

T + 3 years:

  • assemble "level 2" engine [MMR+ coreference + Topic ID & NP & User]
  • start research (seriously) on user profiling and script/patterns on large scale
  • need lexical semantics! multi-lingual, "deeper" WordNet

T+6 years (?):

  • "level 3"

Human/Computer Interface Report


The Task

  • What led the person to engage in information interaction

Comparison

  • Exact, Best Match

Representation

  • Indexing, Classification

Presentation

  • Titles, Summaries

Visualization

  • Ranked List, Clusters

Navigation

  • Serial Scanning
  • Links

User

  • Long Term and current goals and tasks
  • Knowledge
  • Previous experience

Interaction

  • Evaluation
  • Selection
  • Recognition
  • Use

Information Objects

  • Type
  • Level
  • Medium

People involved in information creation and use

  • problem/task seldom "IR"
  • support richer variety of interaction
  • useful paradigms and combination
  • interfaces matter and can be studied
  • +5 years only interactive tree
  • +10 years 7+2 modes of interaction

 

Future.

  • Long term studies
  • Research in under lying phenomena
  • Test beds for experimental and comparative studies