TIPSTER Architecture Change Request Title: Class Document Page 1 of ? Date Prepared: 26 February 1998 CR No. 13 Priority: Routine Date Logged Document Affected: Design Document Version: 2.3 Paragraphs Affected: Section 4.0 and Appendices B & C References: None Change Required: Persistent Documents Specific Recommendations: Modify Architecture Design document pages by replacing affected sections and appendices with replacement material as provided. There are several ways Persistent Documents can be added to the Architecture; but, the easiest is to create an alternate Class Document which is persistent and named. Reason for the Proposed Change: Currently, the Architecture does not provide for persistent Documents. Only Collections may be persistent. This forces documents to be handled as part of a Collection. Forcing the Collection concept onto an application that wants to process unrelated, single documents is an unnecessary burden as well as causing naming problems and other inefficiencies. ---------------------------------------------------------------------------------------------------- 4.1 Documents The document is the central object class in the TIPSTER architecture. As a unit of information, it serves several basic functions within the architecture: • it is the repository of information about a text, in the form of attributes and annotations (although annotations will in general refer to portions of documents) • it is the atomic unit in building collections, but it may also exist independent of collections • it is the atomic unit of retrieval in detection operations Two types of Documents are defined; those which are part of one or more Collections (see Section 4.2) and those which exist without a parent Collection. A Document in a Collection has persistence by virtue of being a member of a Collection, and can be accessed only as a member of a Collection. Each document is given a unique identity by its Id property, which is copied by the CopyDocument and CopyBareDocument operations, and is also copied when a new collection is created by document retrieval operations. A Document which is not part of a Collection has persistence in its own right and must be named. Operations which require a Collection name, such as Augment(DocumentCollectionIndex, Collection) cannot be used with Documents which are not part of a Collection. A Document may have a BaseDocument Attribute, which identifies a Document from which this Document was derived. Those Tipster operations that use a Document's RawData component should, if the RawData component is nil, use instead the RawData component of the Document's BaseDocument. Such operations include DocumentCollectionIndex.Augment, QueryCollectionIndex.RetrieveQueries, Document.Annotate, Document.WriteSgml, and Collection.AnnotateCollection. If the Document's RawData is nil and there is no BaseDocument, then these operations should ignore the Document. [Original] Class Document Type of AttributedObject Properties Parent: Collection the Collection of which this Document is a member; Id: string an internal document identifier, assigned automatically when a new Document is created, which is unique within the Collection in which the Document resides. ExternalId: string (R, W) a document identifier assigned by the application RawData: ByteSequence OR nil. the contents of the document prior to any TIPSTER processing. The byte-sequence may include subsequences representing text in multiple languages, as well as non-text material such as pictures, audio, and tables. Annotations: AnnotationSet information about portions of the document (information about the document as a whole is stored in Attributes; a Document inherits an Attributes property by virtue of being a type of Attributed Object). Annotations may contain information about a Document related to the current Document. The knowledge of the relationship between Documents must be maintained by the application (possibly using Attributes). Operations CreateDocument (Parent: Collection, ExternalId: string, RawData: ByteSequence OR nil, annotations: AnnotationSet, attributes: sequence of Attribute): Document creates a new document within the Collection Parent and assigns the document a new unique Id CopyBareDocument (NewParent: Collection, Document): Document makes a copy of Document, including only its ExternalId, and RawData, assigns a new unique Id to the copy and places the copy in collection NewParent. The attributes and annotations of the original document are not copied by this operation.. CopyDocument (NewParent: Collection, Document): Document makes a copy of Document, including its , ExternalId, RawData, attributes, and annotations, assigns a new unique ID to the copy, and places the copy in collection NewParent.. Annotate (Document, AnnotatorName: string) invokes annotation procedure AnnotatorName on the Document; see Section 5.6. WriteSGML (Document, AnnotationSet, AnnotationPrecedence: sequence of string): string Converts a document together with a set of Annotations into SGML format. AnnotationPrecedence, which is a list of annotation types, is used to resolve conflicts when two annotations cover the same span: the tag corresponding to the annotation type which appears first in the list is written out first. The resulting document is in a "normalized" SGML, with all attributes and end tags explicit.4 ReadSGML (string, Parent: Collection, ExternalId: string): Document Reads a string marked up with "normalized" SGML, with all attributes and end tags explicit, and generates a Document with the specified ExternalId, no attributes, and an AnnotationSet containing one annotation for each SGML text element marked in the input text. If the input violates these constraints (e.g., unmatched start tags) or violates SGML syntax (e.g., unmatched quotation marks within tags), an error will be signaled.5 Alternate Class Document – not part of a Collection Class Document Type of Type of PersistentObject, AttributedObject Properties Id: string (R, W) a document identifier assigned by the application, equivalent to ExternalId of non-persistent Document RawData: ByteSequence OR nil the contents of the document prior to any TIPSTER processing. The byte-sequence may include subsequences representing text in multiple languages, as well as non-text material such as pictures, audio, and tables Annotations: AnnotationSet information about portions of the document (information about the document as a whole is stored in Attributes; a Document inherits an Attributes property by virtue of being a type of Attributed Object). Annotations may contain information about a Document related to the current Document. The knowledge of the relationship between Documents must be maintained by the application (possibly using Attributes Operations CreateDocument (Id: string, RawData: ByteSequence OR nil, annotations: AnnotationSet, attributes: sequence of Attribute): Document creates a new document with the assigned Id, CopyBareDocument (Document): Document makes a copy of Document, including only its, and RawData, modifies the Id with a version number. The attributes and annotations of the original document are not copied by this operation CopyDocument (Document): Document makes a copy of Document, including its , RawData, attributes, and annotations, modifies the Id with a version number Annotate (Document, AnnotatorName: string) invokes annotation procedure AnnotatorName on the Document; see Section 5.6. NOTE:Operations in other classes which require a Collection name, i.e., Augment(DocumentCollectionIndex, Collection) cannot be used with Documents which are not part of a Collection. As noted earlier, new sources of data will need to be converted by the application into Collections of Documents or Named Persistent Documents before they can be processed within the TIPSTER Architecture. The functions which perform these conversions will necessarily be specific to the type of data source, and hence a TIPSTER application will be required to provide these conversion operations when a new type of data source is to be used. For Appendix B: CreateDocument (Id: string, RawData: ByteSequence OR nil, annotations: AnnotationSet, attributes: sequence of Attribute): Document CopyBareDocument (Document): Document CopyDocument (Document): Document For Appendix C: tip_Document CreateDocument (tip_string, tip_ByteSequence,: tip_AnnotationSet, tip_ AttributeSet); tip_Document CopyBareDocument (tip_Document); tip_Document CopyDocument (tip_Document); 4 The specification of this operation is subject to revision based on the experience of implementors in using these SGML representations in applications. 5 The specification of this operation is subject to revision based on the experience of implementors in using these SGML representations in applications.