DN2TCL description: Description: dn2tcl is an extension of Tcl/Tk 8.0p2 which adds three new commands; parsedn2, querydn2, and getdocument. Parsedn2 accepts a string containing a dn2 and parses it. It returns the number of errors encountered. It also can return the dn2 in a modified form which allows it to be displayed by the DN2Win tcl application. Querydn2 accepts a dn2 string and returns two items. The first is the dn2 returned by the search engine which can contain modifications of the dn2. The second is a list of the document ids that match the query. The final command, getdocument, accepts a document id and returns the document. Running: There is one environment variable which should be set. This is the ZTIP_COLL_LIST variable. It should contain the pathname to the coll_list file. If this variable is not set then the tcl interface should only be run from the directory where the coll_list file is located. Start the tcl shell with the command dn2tcl. It should be started from the directory where the executable is located. It you have set the ZTIP_COLL_LIST environment variable then you may cd elsewhere to run tcl code otherwise you should run your tcl code from the current directory. Requirements: You must have the SP-1.3 libraries and the TkTree4.2 libraries. These sources can be found at the same location from which you retrieved the dn2tcl code. The z3950 api library is included with this distribution. The SMART/DN2 library is also required and must be obtained from Chris Buckley. You can contact him at 1-301-947-3740 or at chrisb@balder.sabir.com. Compilation: To create dn2tcl, examine the Makefile and change the location of library and include file directories. You should then be able to just type 'make' and dn2tcl will be created for you. Notes: The coll_list, spec.logicon, and spec.dn2 files need to be in the directory from which you run the query tool. querytool stuff This code requires the following: dn2tcl: Tcl/Tk interpreter which includes Tree-4.2 extensions and Logicon's dn2 extensions. To run: Start dn2tcl and at the prompt type the following command % source setup Once the setup is complete, use the following command to start up a dn2 editing window: % DN2::setup .win The .win parameter can be anything that you like but must start with a '.'. This is the name of the window that will be created. That's it. docman stuff The Prides Document Manager 14 July 1998 This document describes the Prides Tipster Document Manager. This document manager complies with the Tipster architecture version 1.52 as amended by a set of RFC's which is in effect TIPSTER 2.2. The document manager was developed by Logicon as part of the Prides Tipster demonstration project. This document gives a brief description of the document manager. The MANIFEST file give a list of the files included in this archive. If you have any questions about it, contact John Palm at Logicon. Phone: 703-312-2085 Email: jpalm@logicon.com 1. Background The document manager is a library implementing the TIPSTER architecture document manager specification. The intended purpose is to simplify the use of the TIPSTER architecture by supplying a ready made manager that can be linked with other modules complying with the TIPSTER architecture. 2. Making and Installation The document manager has been coded in C++ and come with a C++ and C API. Along with the document manager library, a simple viewer of documents and test program are included. The viewer, CollectionViewer, requires Motif to compile and run. The TestCollection program has a simple text based menu interface. The source code for the library and viewer are provided. Two compiled versions of the library and viewer are also supplied. These are present in the suncc and egcs directories. As is likely evident by the names of the directories, the suncc directory contains the code compiled by the SparcWorks C++ 4.2 compiler, CC. The other directory, egcs, contains the code compiled by the EGCS Version 1.0.3a C++ compiler, c++. Providing the two versions was necessary due the different types of name mangling that is performed by the two compilers. The Makefiles used for these two versions are present as Makefile.egcs and Makefile.suncc. It should be an easy task to edit either one of these for use with other compilers. Please see the comments included in the makefiles. To perform a clean compile of the entire package, execute the following command: make -f makefile_name clean where makefile_name is the name of the makefile to be used. This will remove all evidence of a previous compilation. Next, execute the command: make -f makefile_name This will compile all of the source code and creates the library, libdocman.a, the viewer, CollectionViewer, and TestCollection, a simple test program. The viewer can be installed anywhere. When it is run it checks the value of the environment variable, TIP_DATA_DIRECTORY. This should be set to the location of your document collections that are in document manager format. If this variable is not set, then the viewer defaults to looking for a directory called data in the current directory. If this is not found then the program will exit. The test program also uses the TIP_DATA_DIRECTORY environment variable. The program allows you view a collection and the documents with in it and also create new collections and add documents to them. The library can be installed anywhere as well. The only caveat is that when compiling and linking programs that will use the document manager care must be taken to inform the compiler and linker where to find the library and the include files. For compilation, the location of the include files is the only thing that must be specified, typically with an option like this '-I/usr/local/docman2.5'. For linking, two items must be specified. The first is the directory in which libdocman.a can be found and the second is directive to actually use the library. The first option is typically specified as '-L/usr/local/docman2.5' and the second by '-ldocman'. Note that the directory names after the -I and the -L should be the locations where the include files and the library can be found, respectively. There should not be a need to alter the environment variable, LD_LIBRARY_PATH, since the viewer is linked statically with the docman library. If a runtime error is encountered that mentions the inability to find libXm.so then the location of the Motif library will have to be added to the LD_LIBRARY_PATH variable. 3. Files created and used by the document manager. The document manager creates a number of files for storage of collections and their documents. These files will be created in a directory identified in the environment variable "TIP_DATA_DIRECTORY". If that variable is not set, the document manager will use the directory "./data". For each persistent collection, the document manager creates two types of files in the TIP_DATA_DIRECTORY: COLLECTION.N For each volume in the Collection, the document manager creates a file COLLECTION.n, where COLLECTION is the name of the Collection and N is the number of the volume. COLLECTION.idx For each Collection, the document manager creates the file COLLECTION.idx that contains an index from external ID's to their containing Documents. COLLECTION is replaced by the name of the collection. 4. Include file conventions. The source code contains all of the include files for the document manager. There is an include file for each class. In each case, the name of the include file is: cmcCLASSNAME.h where CLASSNAME is the name of the class. (cmc refers to the PRIDES CSCI from which the Document Manager originated.) Most of these classes will be recognizable from the Tipster architecture. Several are unique to the PRIDES document manager: they are helper classes used for the document manager implementation. In addition, the file "tipster.h" is the include file for the C binding of the document manager. It is a link to the cmcCBinding.h file in the directory. Within a class description, we have followed a convention for specifying when ownership of an object is passed. When ownership of an object is passed, then the receiving object takes on the responsibility of freeing the object, which is expected to be allocated using heap storage via the "new" operator. Ownership can be passed in one of two ways: by passing a parameter to a function, in which case the object on which the function is run takes ownership of the new object, or by return value of a function, in which case the caller of the function takes ownership of the returned object. In this document manager, if items are passed or returned as non-const pointers, ownership of the item is transferred. If items are passed or returned as references or pointers to const objects, then ownership is not transferred. (There may be exceptions to this rule: they are noted in comments in the include files.) When a container object (e.g., a Collection, Set, or AnnotationSet) is passed an object for storage, the container object generally makes its own copy of the object. The only time this is not true is if the container is passed ownership of the object: in that case, the container keeps the original copy of the item. Generally, when items are returned by containers but ownership of the item is not returned, then the function returns a reference to the item as kept in the container. Thus, passing items into containers generally uses copy semantics, while returning items from containers generally uses reference semantics. 5. A note on sets. The document manager implements Tipster Sets as a template, Set. This supports all of the functions of Sets as defined in the Tipster specification. The include file for this is cmcSet.h. In addition, the file cmcSet.cc is included to allow applications to use the set template. 6. Changes from the Tipster architecture. While we stayed as true to the architecture as we could, we did have to stray in a number of areas. One is in the use of copy/reference semantics (which is not specified in the current version of the architecture). The following is a list of other areas in which we strayed from the architecture. This list is limited to areas where we removed functionality or changed the way things were done: there are a number of areas where we have added functions that are documented in the include files. 1. The Create...() functions have all been changed into constructors. In some cases, where there are multiple ways to create an object, we have created multiple constructors. Most notably, any subclass of PersistentObject has a constructor to create an object of that subclass as well as a constructor corresponding to the Open() function. 2. In all cases, Tipster functions are member functions of the corresponding Tipster classes. Therefore, the parameter identifying the object on which the function is to be performed is not needed: C++ calling conventions identify this function. 3. The Document operations related to SGML (WriteSGML() and ReadSGML()) are not yet implemented. 4. The operations to annotate Documents and Collections are not yet implemented pending some clarification in the architecture of how these should be done. 5. AttributeValue is split into a class hierarchy. Each type of value supported by the Tipster architecture is now a subclass of the abstract base class AttributeValue. See those class descriptions for more details. 6. Collection is split into a class hierarchy. Collection is an abstract base class. Its primary children are VirtualCollection, which contains Collections constructed via the CreateVirtualCollection() constructor, and PersistentCollection, which contains Collections constructed via the CreateCollection() constructor. 7. CopyDocument() and CopyBareDocument() result in a new Document with a new Document ID. This is in accord with an RFC submitted by Logicon. 8. Document ID's are no longer unique across the entire system. Instead, Document ID's are unique only within a Collection. 9. The CreateCollection() constructor has an additional two optional parameters. The first of these, VolumeCount, identifies the number of file volumes that should be used to store the contents of the Collection. For all but the largest Collections, this should be set to 1 (which is the default value). The second parameter is BlockSize. This is the size of a disk block used to store data. Too large a block will lead to wasted storage: too small a block will lead to excessive fragmentation. For Collections that are primarily meant as lists of Documents stored elsewhere, this should be 256. (This is the default.) For Collections containing Document texts, this should be the block size of the local machine. 10. The document manager caches references to PersistentCollections. This means that multiple calls to open the same PersistentCollection will produce references to the same PersistentCollection objects. This will generally be transparent to the application: however, if a change is made to one instance of a PersistentCollection within the same application, that change is propagated to all in-memory copies of the same PersistentCollection. 11. The document manager contains classes for DetectionNeeds, DetectionQueries, and DetectionNeedCollections. These are somewhat altered from the Tipster versions to make them independent of the search engine used. 7. Error handling. This document manager implements error handling using C++ exceptions. The file cmcError.h contains the exception classes that may be thrown by the document manager. Currently, these classes are not translated by the C binding to a more C-friendly format. 8. GCC compilation. Use the Makefile.gcc and gcc 2.7.2.3. Because GCC doesn't instantiate templates without some minor work, you will get unresolved symbol errors at link time if you try to create new Set template things without adding the template instantiation to the bottom of cmcSet.cc. Enjoy! Mark Davis (madavis@crl.nmsu.edu). 9. Thanks To Mark Davis for having modified the code and creating a Makefile for gcc 2.7.2.3. dn2.dtd stuff Detection Need 2 ---------------- A Detection Need Type 2 (DN2) is an SGML query used for querying a document collection. The Document Type Definition (DTD) is the specification of the grammar of a DN2. The DTD allows an SGML parser/validator to read a DN2 and determine whether it is syntactically correct. The basic form of a DN2 query is a start tag of followed by the various terms that are to be searched for. The DN2 concludes with a end tag of . The terms can be straight text surrounded by and tags, or they can be more complex boolean expressions of text terms. The FULL-TERM tag allows the user to combine a text term with some direction about in which field to search for the text. For example, in one query a user could locate all documents that contain "Mark Twain", but using a FULL-TERM tag the user could locate all documents that have "Mark Twain" in the author field. One note is that it is up to the search engine to document and enable the full capabilities of the DN2. The DTD is simply the specification of format of the DN2. It does not say anything about the semantics of the query. Documentation on the Query Tool ------------------------------- The purpose of the query tool is to simplify the creation and evaluation of a Detection Need Type 2 (DN2). It allows the user to graphically create, view, and edit DN2s using a portable Tcl/Tk interface. It also allows the user to send the DN2 to a search engine for evaluation. Since the tool uses the TIPSTER interface for communicating with the search engine it should be easily relinked to use another search engine. Currently, it is linked with the Z39.50 search engine. The query tool is made up of three parts. James Clark's SP SGML parser, Allan Brighton's tree extension to Tcl/Tk, and the query tool code which builds on those two pieces. It is written in a combination of C++ and Tcl/Tk scripting. An extension of the Tcl/Tk interpreter was developed which contained two additional scripting commands, parsedn2 and senddn2. The parsedn2 command accepts an SGML DN2 string and determines whether it correctly meets the DN2 DTD. The senddn2 accepts an SGML DN2 string, sends it to the search engine, returns the matches. James Clark's SP SGML parser is used to validate the DN2 and to extract all of the data. A DN2 is an SGML representation of a query to be perfromed upon a database of documents. SGML was chosen as the base representation due to its portabilitiy. The query tool allows one to deal graphically with the DN2 rather than having to type the SGML representation directly. Rather than create an entire parser from scratch, James Clark's SP library was used to parse the DN2 SGML instance. The query tool was written using a combination of C++ and Tcl/Tk. The parsing and validation of SGML documents was performed by James Clark's SP Version 1.3 SGML Validator. This package ensures that an SGML document meets the specifications of its Document Type Declaration (DTD). An API included with it allows one to create applications that can receive SGML input. The Tree-4.2 package from Allan Brighton provides tree drawing capability to Tcl/Tk. It allows the creation of a tree to which one can add and remove nodes. The tree package takes care of correctly drawing the tree once any edits have been made. The query tool code uses the SP parser/validator to extract all of the data from the DN2 and places it in an internal data structure. From this OUTLINE 1. Description of the program a. query tool b. SP c. Tree-4.2 Dataflow Starting the dn2tcl program invokes a Tcl/Tk shell that has been expanded with the parsedn2 and senddn2 commands. Sourcing querytool.tcl creates the query tool window and opens a new dn2. The user selects a node using the left mouse button. Once selected the following operations can be performed on the selected node by selecting one of the options under the Node menu: Add Node: This adds a new child node to the selected node. Remove Node: This deletes the selected node and all of its children. Prune Node: This deletes the tree beneath the selected node, but keeps the selected node. Edit Attributes: Brings up a window displaying the attributes of the selected node. The values of the attributes can be altered and new attributes can be added. A selected node can be moved by putting the mouse cursor on the selected node and dragging it using the right mouse button. The algorithm for moving the node works as follows: find if there are any nodes overlapping the rectangular box. If there are then find the closest and reattach the selected node there, if possible. If there are no overlapping nodes then we search for the closest node to the left of the new position. Once a DN2 has been crafted graphically then the SGML DN2 can be viewed by pressing the DN2->View menu choice. The resulting text DN2 can be parsed for correctness, edited, and sent to the search engine. Core Data Structures: C++ To develop an application using the SP parser API the programmer creates an object that inherits from the SGMLApplication object. This object has virtual methods for a series of events that are parts of an SGML document. These include such events as startElement, endElement, and data. The programmer writes implementations of these events as his application. In the query tool, a dn2app object was created which inherited from the SGMLApplication. One of its members is a dn2doc object into which the event handlers insert their data. It also has two methods for extracting the data from the parsed DN2. One returns the data is SGML text and the other returns the data in an internal format that is used by the Tcl code for creation of the tree structure in the Tcl code. The dn2doc object contains a tree for storage of the DN2 information. Each node maintains a vector of pointers to its children nodes. The syntax of SGML requires that all of a parent's children nodes be ended before the parent node is ended. For example the following is illegal SGML BOLD BOLD ITALIC italic. It must be written like this: BOLD BOLD ITALIC italic. This makes it straightforward to store the structure of the document. As each start element event occurs, a new node is created, placed in the list of the current node's children, and the current node pointer is changed to the point to the new node. As each end event occurs, the current node pointer moves to the current node's parent. Contained within each node are the type of the element, a vector of the element's attribute names and values, and the data that appears between the start and end tags. These objects are used by the new Tcl command parsedn2. This accepts a string containing an SGML DN2 document and returns successfully if it meets the DN2 DTD. Optionally the string can be returned in either SGML format or in the internal Tcl format. The internal Tcl format is as follows: node := {nodetype attributelist nodelist} attributelist := { { attribute_name attribute_value }* } nodelist := { node* } Basically it is a list of a nodetype, a list of attribute name and value pairs, and a list of subnodes. The commands works by taking the input string, writing it to a temporary file, and then calling the parse routine with the name of the file. The parse routine is able to read data from either the standard input or from a file. Once the Tcl routine receives the internally formatted list, it traverses the list populating its own data structure with the data. The Tcl code is broken into object-like entities. Most of the objects are focused on the windows that are displayed. The exceptions is the tree data structure which was encapsulated into its own object. The main window is constructed and displayed by the DN2.tcl code. From this the other windows can be invoked. These are the view_attribs window, the add_attribs window, the add_sub_node window, the dn2win window, and the datastructure object. The view_attribs window displays the attributes of the selected node and allows editing of those attributes. Whenever the selected node is changed this window is updated with the attributes of the new node. The add_attribs window is invoked from the view_attribs window and it displays the available attributes that can be added to the selected node. The add_sub_node window displays the allowable subnodes for the selected node. This allowable subnodes are those that are syntactically allowed for the subnode. It does not do any semantic checking. For example, it is correct syntax for a TEXT node to follow the DN2 node, but it is semantically incorrect to have two of them. The add_sub_node window will always show that a TEXT node can be added when the DN2 node is selected regardless of whether a TEXT node already exists or not. The dn2win window displays the SGML DN2 that the displayed tree represents. Edits to the SGML DN2 can be made and will be reflected back on the tree display once the apply button is pressed. By pressing the parse button, the current DN2 text is parsed and a message is returned reflecting the success of the parse. The datastructure object is implemented as a general tree. Each node has a pointer to its parent, its first child, and to its right sibling. The data stored with each node is its type, a list of attribute name and value pairs, and string containing the text of the node. The text of the node is all of the data between the start tag and end tag that is not another tag. A DN2 can be saved and can be read in. The query tool was created on a Solaris 2.4 Sparc 10 using Tcl/Tk 8.0p2, Allan Brighton's Tree 4.2, James Clark's SP Version 1.3, and gnu c++. This file documents the implementation of the Annotator Job Manager, by Logicon, Inc. 1998. The system is implemented in Perl, v. 5.004. Its principal components are CGI programs; these use the CGI.pm module (tested with v. 2.42; earlier versions are known not to work). In addition, several programs are provided for use by an administrator on the local machine. CGI programs: jobupload.cgi jobcheck.cgi jobcancel.cgi Command-line programs: jobcheck jobcancel Background programs: fulfill_job Each of these programs use a module file in the same directory, named JobCommon.pm. This module implements a number of functions for common use by the programs. The function of each program is briefly described below. jobupload.cgi This program allows the user to initiate a job, by uploading an archive file (either .tar, .tar.Z, or .tar.gz). Creating a job consists of saving the archive file in an upload directory (named 'incoming'), generating a unique job id, and creating a file (in the 'jobs' directory) which will track the state of the job. The job id is derived by selecting two words at random from a words file; presently this file is named 'words.6', and contains all the lower-case six-letter words from the /usr/dict/words file. The two words chosen are joined with a hyphen. This string is used as the unique job id. It is presented to the user as a convenient way to identify her job when requesting actions or information on the job. The job id is used as the file name for the job status file and various other temporary files generated in the course of processing the job. jobcheck.cgi This program allows the user to check on the status of any job for which she knows the job id. It also allows the user to request further annotation actions on the job; each possible action is represented by a button. If a new annotator is added to the system, this file must be modified to recognize it. Additional actions are queued up, so that no action is commenced until the previously requested actions have all completed. New jobs are initialized with no actions pending. When all the actions requested for a job have been completed, the archive file containing the final results are made available to the user to download, via a link on this page. jobcancel.cgi This program allows the user to cancel any job for which she knows the job id. On being cancelled, all the files associated with a job are deleted from the file system. jobcheck This program can be run from the command line, in the main directory of the Annotator Job Manager. If a job id is given on the command line, the job's status is displayed. If no arguments are given, a list of currently active jobs is displayed. jobcancel This program can be run from the command line, in the main directory of the Annotator Job Manager. If a job id is given on the command line, the job is canceled, just as per jobcancel.cgi. If no arguments are given, a list of currently active jobs is displayed. fulfill_job This program is responsible for ensuring that the requested actions get performed. Typically this consists of extracting the job's files from its archive, launching an external annotator, or sending email to an administrator. If a new annotator is added to the system, this file must be modified to recognize it. Whenever a job has one or more unfinished actions, it must have a background process running, which will perform those pending actions. Whenever an action is added to a job, the adding process checks the job for the existence of the fulfiller; if one is not running, one is started. The fulfiller removes the next action from the job's queue and performs it; when it is done, it marks that action as complete in the job's status file. The fulfiller continues this cycle until no more uncompleted actions remain for the job; then it exits. In the current implementation, the fulfiller processes are started using the Unix 'at' command. The 'at' command will run fulfill_job with the job id as an argument. The system as delivered is configured with two annotators, which perform a "null" operation on the document collection (that is, they do nothing). "Null-Short" completes in very little time, and "Null-Long" completes in several minutes. These allow the user to test the pipeline of the system.