TABLE OF CONTENTS
TIPSTER Technology Overview
TIPSTER Related Research
Phase III Overview
Reinvention Laboratory Project
Generic Information Retrieval
Generic Text Extraction
12 Month Workshop Notes
Text Retrieval Conference
Multilingual Entity Task
Other Related Projects
Document Down Loading
Request for Change (RFC)
Glossary of Terms
TIPSTER Source Information
Return to Retrieval Group home page
Return to IAD home page
Date created: Monday, 31-Jul-00
Glossary of Terms
Select the first letter of the word from the list above to jump to appropriate section of the
Abstract - A document summary which succinctly captures the significant
concepts in the document. Abstracts are usually prepared by humans. See
Annotation - The additional information associated with a document or a
collection. Under the TIPSTER concept annotations are the principal way
components pass data between them. Annotations usually the result of
extraction processes; however, users may also create annotations. See the
Architecture Design document.
Attribute - A characteristic of a collection, document or annotation
represented by a single value or set of values.
- Collection - A group of documents, usually with some characteristic(s) in common.
Under TIPSTER the implementaton of a Collection is broad and a Collection may be
the actual documents (text) or a list of document identifiers (ID). A document may
appear in more than one Collection.
- Corpus - All the documents in the domain of interest.
- Coreference - An alternate reference to an entity. Ex. "John Smith is the president of
Big Linguistics, Inc. He had a problem with the board of directors. Eventually the
board decided the president should be replaced." Coreferences are shown in italics.
- Component - a major piece of code in the TIPSTER Concept. Equivalent to a
Computer System Component (CSC) in conventional life-cycle definitions. Example -
a detection component. Also see module.
- Document Detection - The selection of one or more documents which meet a
Detection Need or Query. Equivalent to the older term Information Retrieval.
- Detection Need - A statement that specifies the user's criteria for selecting
documents from a Corpus. Under TIPSTER a Detection Need may contain any or all
of the following: keywords, Boolean terms, free text describing a document or
concept and examples of desired or not desired documents. Interpretion of a
Detection Need results in a query which may be quite complex in structure. See
- Extraction - The selection of specific types of information from text, e.g. person
name, place names, companies, organizations, or relationships between text entities.
See Information Extraction
- Fill Rules - the criteria that describes the constraints used to select information for
template slots and the conditions under which Template Objects are instantiated. See
- Graphical Interface Unit (GUI) - Graphical interfaces are not part of the TIPSTER
Architecture; however, they are usually necessary for applications. TIPSTER
components frequently interface to a GUI.
- Information Extraction - Same as 'text extraction'. The selection of specific types of
information from text, e.g., person names, place names, companies, organizations,
temporal data, currency data, other entities, co-references, relationships between
entities. The latter two items are more difficult. The usual objective of extraction is to
build databases that are more suitable than free text for querying, e.g. using SQL.
- Knowledge Base - a files or lists of static information used in natural language
processing, such as, gazetteers, parts of speech word lists, grammar rules, document
structures, SGML tag sets, stemming list, stop word list, abbreviation lists and
dictionaries. Thes items are frequently domain dependent.
- Module - is equivalent to a Computer System Unit (CSU) in the conventional
life-cycle definition. A CSU is an element specified in the design of a CSC that is
separately testable. A parser is an example of a module. Modules are used to build
components. Generally modules are composed of no more than 300 lines of code.
- Multi-lingual - is considered to be multiple languages in one document or multiple
documents in different languages.
- Pattern - is an expression of a specific form that is used for matching text during the
extraction process. TIPSTER has a Pattern Specification Language which describes
how to write rules to control extraction engines.
- Profile - a group of Detection Needs which describe a user's area of interest.
- Query - Translation of a Detection Need results in one of more queries which is
either in a user understandable form or a specific format depending upon the actual
retrieval engine. A query typically produces a list of documents which meet the
Detection Need criteria. ( Also see Routing)
- Retrieval Engine - the component that implements the retrieval code. The uniqueness
of different Retrieval Engines is based upon the particular algorithms use for
retrieval, e.g., text index approach, term weighting methods or document vectors.
- Routing - the directing of a document to more than one user. Typically, each user has
a profile which describes that user's area of interest. A document is tested against all
user profiles so as to determine where it should be sent. In essence, one document is
is tested against multiple queries obtained from the Profile, whereas Document
Detection tests many documents against one query. See Profile.
- S -
- T -
- U -
- V -
- W -
- X -
- Y -
- Z -