TIPSTER Architecture Change Request Title: Standard Annotations Page 1 of 6 Date Prepared: 17 February 1998 CR No. 9 Priority: Routine Date Logged Document Affected: Design Document Version: 2.3 Paragraphs Affected: Section 6.0 References: Guidelines for Electronic Text Encoding & Interchange (TEI), http://etext.virginia.edu/TEI.html ISO 639, http://etext.lib.virginia.edu/tei/iso639.html ISO 3166, http://crl.nmsu.edu/Research/Projects/tipster/annotation/iso3166.html ISO 8601, http://crl.nmsu.edu/Research/Projects/tipster/annotation/iso8601.html Corpus Encoding Standard (CES), http://www.cs.vassar.edu/CES/ Z39.50, http://lcweb.loc.gov/z3950/agency/agency.html Change Required: The TIPSTER Architecture describes the generic syntax of annotations for marking a document; however, it does not describe basic, standard annotations which can be commonly used by applications following the TIPSTER Architecture. This RFC will correct part of this problem by defining a basic set of annotations. Additional annotations may be defined later. Specific Recommendations: Modify Architecture Design document pages by replacing section 6.0 TYPES OF DOCUMENT ANNOTATIONS as provided. Reason for the Proposed Change: The use of standard annotations will facilitate interoperability of components and provide more robust applications at lower cost. ---------------------------------------------------------------------------------------------------- 6.0 TYPES OF DOCUMENT ANNOTATIONS References: (1) Guidelines for Electronic Text Encoding & Interchange (TEI), http://etext.virginia.edu/TEI.html (2) ISO 639, http://etext.lib.virginia.edu/tei/iso639.html (3) ISO 3166, http://crl.nmsu.edu/Research/Projects/tipster/annotation/iso3166.html (4) ISO 8601, http://crl.nmsu.edu/Research/Projects/tipster/annotation/iso8601.html (5) Corpus Encoding Standard (CES), http://www.cs.vassar.edu/CES/ (6) Z39.50, http://lcweb.loc.gov/z3950/agency/agency.html The TIPSTER Architecture defines some standard annotations and associated attributes. If these names are used they must be used as defined. Other annotations and attributes may be defined for a TIPSTER application. If well defined annotations are created and it is expected that they will have extended usage, they should be submitted, via an RFC, for inclusion in the Architecture. The annotations and attributes described herein are based upon the Corpus Encoding Standard (CES). The notation used is different to avoid confusion with CES which follows the method of embedding the tags in the text whereas TIPSTER carries the annotation separate from the text and associates it to the text by the span. 6.1 Annotations for Major Document Elements The follow sections describe annotations at the minimum encoding level required for Level 1 CES conformance, requiring markup for gross document structure (major text divisions), down to the level of the paragraph. The annotation types described below are: Doc, Header, Text, Division and P (Paragraph). annDoc CES requires markup to encode a single document. However the annotation will be redundant with the same data represented by a TIPSTER document object. The span of the annotation will be the entire span of the document. In addition to the global attributes for annotations only one attribute is listed: attDocref An attribute value is a Document Reference which contains a pointer to the TIPSTER document object. The common document attributes for this document could be obtained through this reference. The corresponding CES tag is annHeader The header is information about the document typically appearing at the beginning of the document. This information should be contained in the span of the Header annotation. This includes elements such as the document source, date, author, and title. Many of these sub- elements will also be encoded as annotations. See Section 6.5. The corresponding CES tag is annText The text annotation contains the span of text that make up the contents of a document. It includes everything except the header information. No other attributes in addition to the global attributes are defined for this annotation type. The corresponding CES tag is annDivision This is an optional element for the CES and marks any subdivision of a written text, e.g. chapter, section, sub-section, article. For TIPSTER the Division annotation has one additional attribute beyond the global attributes: attType The attribute value will be a String categorizing the division in some respect. The categories are initially: PART, SECTION and CHAPTER. The corresponding CES tag is
annP A paragraph in a written text. The corresponding CES tag is

6.2 Annotations for Document Header Elements These elements are standard annotations for text that normally occurs in the header section of the document. This is meta-information about the document maybe, but not always, contained in the text of the original document. If the information is NOT contained in the document the span values are set to -1. This is a simplified version of information required by the CES header. The CES header defines a more highly structured set of elements. The annotations described here are intended to fill those elements in the CES structure without modifying the structure of the original electronic text. annDocno A document identifier string in the text. This could be an ISBN number or some other string used to identify the document. annTitle The title of the work, including sub titles. For documents containing the title directly in the text. annAuthor Text referring to the author or authors of the document. annPubDate A calendar date for the document. attISO8601 Contains the ISO 8601 normalized form of the date. annPublisher A proper name of a person, place or institution. annPubPlace The place of publication for the document. 6.3 Structural Annotations The following annotations are suggested for marking document structure. These are primarily based on a set defined for the CES standard. Please refer to the CES documentation for further explanation. Annotations for paragraph level elements: annP A paragraph in a written text; (mentioned in Section 6.1, shown here for completeness) annSp Contains spoken text. annCaption A heading or title attached to a picture or diagram. annQuote A quotation for some author other than that of the surrounding text. annList Acollection of distinct items. annFigure The location of graphic data. annBibl A bibliographic citation. annNote Notes that are part of the original data (e.g., footnotes). annTable Contains text displayed in tabular form. 6.4 Sub-paragraph Annotations The following annotations are suggested for marking document structure below the paragraph level. These are primarily based on a set defined for the CES standard. Please refer to the CES documentation for further explanation. The Named Entity elements defined for MUC-6 are shown with their equivalent CES elements. annAbbr Contains an abbreviation. One attribute can be specified. attExpan Contains the expansion of the abbreviation. annDate Contains date data. The MUC equivalent for TIMEX of type DATE. attISO8601 Contains the ISO 8601 normalized form of the date. annMeasure Text containing a quantity of some type. This is the MUC equivalent for NUMEX. attType The type attribute should take one of the following values: WEIGHT LENGTH COUNT AREA VOLUME CURRENCY TEMPERATURE PERCENT annName A proper name. The MUC equivalent for ENAMEX. attType The type of proper noun. The values are: PERSON, (MUC equivalent PERSON_NAME) PLACE, (MUC equivalent LOCATION) ORG, (MUC equivalent ORGANIZATION_NAME) annNumber Contains a number. attNormal The normalized form of the number; the normalization function is application dependent. annTerm A technical word or phrase. annTime The time of day in any form. The MUC equivalent for TIMEX of type TIME. attISO8601 Contains the ISO 8601 normalized form of the date. annForeign A section of text that is in a different language than the surrounding text. The global attLang attribute is used to specify the language. 6.5 Common Document Attributes Five common attributes are defined for TIPSTER Documents. Note: The TIPSTER architecture defines an External ID that can be used for document identification. In many cases, the attributes defined below will have values taken from the text of the original document. An annotation reference containing the value shall be used as the value of the attribute when it is available. attTitle See Section 6.2 for the definition. Repeated here for completeness. attDocno See Section 6.2 for the definition. Repeated here for completeness. In some TREC and MUC applications this can be equivalent to the External ID document property. attLang The attLang attribute is used to identify a primary language value for the document. The value should be either a two-letter code from ISO639 or the three letter code from ISO639- 2. Further specification for country is possible with an extension taken from ISO3166, (e.g., en.us for English in the United States, en.gb for English in the United Kingdom). A similar alternative that can include language and script variants has been defined for HTML 4.0. See RFC1766 for ftp access. attCharset A string specifying the character set used to encode the contents of the document, e.g., ISO- 8859-1. 6.6 Common Annotation Attributes Five global attributes are defined for TIPSTER Annotations. Two of these attributes, attLang, and attCodeset, are also used as global document attributes. The document version sets an attribute value for the whole document. The annotation attribute can be used to override these values. Note that all annotations also have an ID property that is unique within a document. attAnnotator[required] A string that identifies the name of the process or user that created the annotation. attComment A string that can be used to record any general information about the annotation. attN A number or label that can be used by applications to order a sequence of annotations, e.g., paragraph number. attLang See Section 6.5 for the definition. Repeated here for completeness. attCharset See Section 6.5 for the definition. Repeated here for completeness. attAlt The attAlt attribute is a sequence of alternate annotations of either the same type that have different sub-spans (defining the MUC-6 ALT attribute), or annotations that have the same span but are of different annotation types or attributes (defining alternate TYPE attribute values). These are used for generating answer key templates or markup. attStatus The attStatus attribute directly maps to the MUC-6 STATUS attribute used for optional markup. The value of this attribute can be either null or the STRING “OPT”. 6.7 Example of Annotations and Attributes 0 10 20 30 40 This table presents an example of applying Ann. to a piece of 50 60 70 80 90 100 text. The example is from the number 6.7 to the end of the 110 120 130 140 150 160 table caption. 170 177 ID Type Span Start Span End Attributes 1 annTitle 0 41 2 annText 0 310 3 annP 42 177 4 annTable 178 299 5 annCaption 300 310 6 annNum 144 146 7 annAbbr 86 89 attExpan=An notation Annotations 300 310