DN2TCL description:
Description:
dn2tcl is an extension of Tcl/Tk 8.0p2 which adds three new commands;
parsedn2, querydn2, and getdocument.  Parsedn2 accepts a string containing a
dn2 and parses it.  It returns the number of errors encountered.  It also 
can return the dn2 in a modified form which allows it to be displayed by the
DN2Win tcl application.  Querydn2 accepts a dn2 string and returns two items.
The first is the dn2 returned by the search engine which can contain 
modifications of the dn2.  The second is a list of the document ids that 
match the query.  The final command, getdocument, accepts a document id and
returns the document.

Running:
There is one environment variable which should be set.  This is the ZTIP_COLL_LIST
variable.  It should contain the pathname to the coll_list file.  If this variable
is not set then the tcl interface should only be run from the directory where the 
coll_list file is located.

Start the tcl shell with the command dn2tcl.  It should be started from the
directory where the executable is located.  It you have set the ZTIP_COLL_LIST
environment variable then you may cd elsewhere to run tcl code otherwise you should
run your tcl code from the current directory.

Requirements:
You must have the SP-1.3 libraries and the TkTree4.2 libraries.  These sources
can be found at the same location from which you retrieved the dn2tcl code.
The z3950 api library is included with this distribution.  The SMART/DN2
library is also required and must be obtained from Chris Buckley.  You can
contact him at 1-301-947-3740 or at chrisb@balder.sabir.com.

Compilation:
To create dn2tcl, examine the Makefile and change the location of library and
include file directories.  You should then be able to just type 'make' and
dn2tcl will be created for you.

Notes:
The coll_list, spec.logicon, and spec.dn2 files need to be in the directory from
which you run the query tool.
querytool stuff
This code requires the following:
dn2tcl: Tcl/Tk interpreter which includes Tree-4.2 extensions and Logicon's
        dn2 extensions.

To run:
	Start dn2tcl and at the prompt type the following command
% source setup
	Once the setup is complete, use the following command to start up
a dn2 editing window:
% DN2::setup .win
	The .win parameter can be anything that you like but must start
with a '.'.  This is the name of the window that will be created.

That's it.
docman stuff
			The Prides Document Manager
 			      14 July 1998

    This document describes the Prides Tipster Document Manager.  This
document manager complies with the Tipster architecture version 1.52
as amended by a set of RFC's which is in effect TIPSTER 2.2.  The
document manager was developed by Logicon as part of the Prides
Tipster demonstration project.

    This document gives a brief description of the document manager.
The MANIFEST file give a list of the files included in this
archive. If you have any questions about it, contact John Palm at
Logicon.

	Phone: 703-312-2085
	Email: jpalm@logicon.com

1. Background
	The document manager is a library implementing the TIPSTER
architecture document manager specification.  The intended purpose is to
simplify the use of the TIPSTER architecture by supplying a ready made
manager that can be linked with other modules complying with the
TIPSTER architecture.

2. Making and Installation
	The document manager has been coded in C++ and come with a C++
and C API.  Along with the document manager library, a simple viewer
of documents and test program are included.  The viewer,
CollectionViewer, requires Motif to compile and run.  The
TestCollection program has a simple text based menu interface.  The
source code for the library and viewer are provided.
	Two compiled versions of the library and viewer are also
supplied.  These are present in the suncc and egcs directories.  As is
likely evident by the names of the directories, the suncc directory
contains the code compiled by the SparcWorks C++ 4.2 compiler, CC.
The other directory, egcs, contains the code compiled by the EGCS
Version 1.0.3a C++ compiler, c++.  Providing the two versions was
necessary due the different types of name mangling that is performed
by the two compilers.  The Makefiles used for these two versions are
present as Makefile.egcs and Makefile.suncc.  It should be an easy
task to edit either one of these for use with other compilers.  Please
see the comments included in the makefiles.
	To perform a clean compile of the entire package,
execute the following command:

	make -f makefile_name clean

where makefile_name is the name of the makefile to be used.
This will remove all evidence of a previous compilation.  Next,
execute the command:

	make -f makefile_name

This will compile all of the source code and creates the library,
libdocman.a, the viewer, CollectionViewer, and TestCollection, a
simple test program.
	The viewer can be installed anywhere.  When it is run it
checks the value of the environment variable, TIP_DATA_DIRECTORY.
This should be set to the location of your document collections that
are in document manager format.  If this variable is not set, then the
viewer defaults to looking for a directory called data in the current
directory.  If this is not found then the program will exit.
	The test program also uses the TIP_DATA_DIRECTORY environment
variable.  The program allows you view a collection and the documents
with in it and also create new collections and add documents to them.
	The library can be installed anywhere as well.  The only
caveat is that when compiling and linking programs that will use the
document manager care must be taken to inform the compiler and linker
where to find the library and the include files.  For compilation, the
location of the include files is the only thing that must be
specified, typically with an option like this
'-I/usr/local/docman2.5'.  For linking, two items must be specified.
The first is the directory in which libdocman.a can be found and the
second is directive to actually use the library.  The first option is
typically specified as '-L/usr/local/docman2.5' and the second by
'-ldocman'.  Note that the directory names after the -I and the -L
should be the locations where the include files and the library can be
found, respectively.
	There should not be a need to alter the environment variable,
LD_LIBRARY_PATH, since the viewer is linked statically with the docman
library.  If a runtime error is encountered that mentions the inability
to find libXm.so then the location of the Motif library will have to
be added to the LD_LIBRARY_PATH variable.

3. Files created and used by the document manager.
    The document manager creates a number of files for storage of collections
and their documents.  These files will be created in a directory identified
in the environment variable "TIP_DATA_DIRECTORY".  If that variable is not
set, the document manager will use the directory "./data".
    For each persistent collection, the document manager creates two types
of files in the TIP_DATA_DIRECTORY:

	COLLECTION.N
		For each volume in the Collection, the document manager
		creates a file COLLECTION.n, where COLLECTION is the name
		of the Collection and N is the number of the volume.
	COLLECTION.idx
		For each Collection, the document manager creates the file
		COLLECTION.idx that contains an index from external ID's
		to their containing Documents.  COLLECTION is replaced by the
		name of the collection.

4. Include file conventions.
    The source code contains all of the include files for the
document manager.  There is an include file for each class.  In each case, 
the name of the include file is:

		cmcCLASSNAME.h

where CLASSNAME is the name of the class.  (cmc refers to the PRIDES CSCI 
from which the Document Manager originated.)
    Most of these classes will be recognizable from the Tipster architecture.
Several are unique to the PRIDES document manager: they are helper classes
used for the document manager implementation.
    In addition, the file "tipster.h" is the include file for the
C binding of the document manager.  It is a link to the cmcCBinding.h
file in the directory.
    Within a class description, we have followed a convention for 
specifying when ownership of an object is passed.  When ownership of
an object is passed, then the receiving object takes on the responsibility
of freeing the object, which is expected to be allocated using heap storage
via the "new" operator.  Ownership can be passed in one of two ways: by passing
a parameter to a function, in which case the object on which the function
is run takes ownership of the new object, or by return value of a function,
in which case the caller of the function takes ownership of the returned
object.  In this document manager, if items are passed or returned as 
non-const pointers, ownership of the item is transferred.  If items are
passed or returned as references or pointers to const objects, then 
ownership is not transferred.  (There may be exceptions to this rule:
they are noted in comments in the include files.)
    When a container object (e.g., a Collection, Set, or AnnotationSet)
is passed an object for storage, the container object generally makes its
own copy of the object.  The only time this is not true is if the
container is passed ownership of the object: in that case, the container
keeps the original copy of the item.  Generally, when items are returned
by containers but ownership of the item is not returned, then the function
returns a reference to the item as kept in the container.  Thus, passing
items into containers generally uses copy semantics, while returning
items from containers generally uses reference semantics.

5. A note on sets.
    The document manager implements Tipster Sets as a template, 
Set<X>.  This supports all of the functions of Sets as defined in the
Tipster specification.  The include file for this is cmcSet.h.  In
addition, the file cmcSet.cc is included to allow applications to use the 
set template.

6. Changes from the Tipster architecture.
    While we stayed as true to the architecture as we could, we did
have to stray in a number of areas.  One is in the use of copy/reference
semantics (which is not specified in the current version of the 
architecture).  The following is a list of other areas in which we
strayed from the architecture.  This list is limited to areas where
we removed functionality or changed the way things were done: there are
a number of areas where we have added functions that are documented in
the include files.

		1. The Create...() functions have all been changed into
	constructors.  In some cases, where there are multiple ways
	to create an object, we have created multiple constructors.  Most
	notably, any subclass of PersistentObject has a constructor to
	create an object of that subclass as well as a constructor
	corresponding to the Open() function.
		2. In all cases, Tipster functions are member functions
	of the corresponding Tipster classes.  Therefore, the parameter
	identifying the object on which the function is to be performed
	is not needed: C++ calling conventions identify this function.
		3. The Document operations related to SGML (WriteSGML()
	and ReadSGML()) are not yet implemented.
		4. The operations to annotate Documents and Collections
	are not yet implemented pending some clarification in the
	architecture of how these should be done.
		5. AttributeValue is split into a class hierarchy.  Each
	type of value supported by the Tipster architecture is now a
	subclass of the abstract base class AttributeValue.  See those
	class descriptions for more details.
		6. Collection is split into a class hierarchy.  Collection
	is an abstract base class.  Its primary children are 
	VirtualCollection, which contains Collections constructed via
	the CreateVirtualCollection() constructor, and PersistentCollection,
	which contains Collections constructed via the CreateCollection()
	constructor.
		7. CopyDocument() and CopyBareDocument() result in a
	new Document with a new Document ID.  This is in accord with an
	RFC submitted by Logicon.
		8. Document ID's are no longer unique across the entire
	system.  Instead, Document ID's are unique only within a
	Collection.
		9. The CreateCollection() constructor has an additional
	two optional parameters.  The first of these, VolumeCount, 
	identifies the number of file volumes that should be used to
	store the contents of the Collection.  For all but the largest
	Collections, this should be set to 1 (which is the default value).
	The second parameter is BlockSize.  This is the size of a disk
	block used to store data.  Too large a block will lead to
	wasted storage: too small a block will lead to excessive 
	fragmentation.  For Collections that are primarily meant as
	lists of Documents stored elsewhere, this should be 256.  (This
	is the default.)  For Collections containing Document texts, this 
	should be the block size of the local machine.
    		10. The document manager caches references to 
	PersistentCollections.  This means that multiple calls to open 
	the same PersistentCollection will produce references to the same 
	PersistentCollection objects.  This will generally be transparent 
	to the application: however, if a change is made to one instance 
	of a PersistentCollection within the same application, that change 
	is propagated to all in-memory copies of the same PersistentCollection.
		11. The document manager contains classes for
	DetectionNeeds, DetectionQueries, and DetectionNeedCollections.
	These are somewhat altered from the Tipster versions to make
	them independent of the search engine used.

7. Error handling.
    This document manager implements error handling using C++ exceptions.
The file cmcError.h contains the exception classes that may be thrown
by the document manager.
    Currently, these classes are not translated by the C binding to 
a more C-friendly format.

8. GCC compilation.
	Use the Makefile.gcc and gcc 2.7.2.3.  Because GCC doesn't
instantiate templates without some minor work, you will get unresolved
symbol errors at link time if you try to create new Set template
things without adding the template instantiation to the bottom of
cmcSet.cc.  Enjoy!  Mark Davis (madavis@crl.nmsu.edu).

9. Thanks
	To Mark Davis for having modified the code and creating a
Makefile for gcc 2.7.2.3.

	
dn2.dtd stuff

<!-- @(#)dn2.dtd	1.11 -->
<!-- DTD for a Detection Need 2 -->
<!-- Created by John Palm of Logicon, Inc.  on 16 April 1998 -->
<!-- based on the specification by Chris Buckley of Sabir -->
<!-- Modified by Janet Walz of Sabir on 27 May 1998 -->
<!-- Modified by Chris Buckley of Sabir on 2 June 1998 --> 

<!ENTITY % yesno "yes|no" >

<!ENTITY % operator "AND|OR|AND-NOT|HEAD-RELATION|OTHER-OPER|INDEPENDENT" >

<!ENTITY % argument "%operator;|TEXT|FULL-TERM" >

<!ENTITY % operarg "SE-APP? & APP-SE? & CONTEXT? & (%argument;)+" >

<!ENTITY % seapplist "(SE-APP?,APP-SE?)" >

<!ENTITY % saclist "(SE-APP?,APP-SE?,CONTEXT?)" >

<!ELEMENT APP-SE - - CDATA >
<!ATTLIST APP-SE
   NUMBER-TO-RETRIEVE	NUMBER				#IMPLIED
   MIN-THRESHOLD 	NMTOKEN				#IMPLIED
   MAX-THRESHOLD	NMTOKEN				#IMPLIED
   REFORMULATION 	(REFORMULATION | NOREFORMULATION) REFORMULATION
   EXPANSION 		(EXPANSION | NOEXPANSION)	EXPANSION
   STEMMING 		(STEMMING | NOSTEMMING)		STEMMING
   HELP			(HELP | NOHELP)			NOHELP
   EXPL			(EXPL | NOEXPL)			NOEXPL
   MATCHES		(MATCHES | NOMATCHES)		NOMATCHES
   DOCFREQ		(DOCFREQ | NODOCFREQ)		NODOCFREQ
   COLLFREQ		(COLLFREQ | NOCOLLFREQ)		NOCOLLFREQ >

<!ELEMENT SE-APP - - (SE-APP-HELP? & SE-APP-EXPL? & SE-APP-DOCFREQ? & SE-APP-COLLFREQ? &
			SE-APP-INTERNAL-ID? & SE-APP-MATCHES*)>
<!ELEMENT SE-APP-HELP - - CDATA > <!-- Static help string from the search engine -->
<!ELEMENT SE-APP-EXPL - - CDATA > <!-- Dynamic explanation from search engine -->
<!ELEMENT SE-APP-DOCFREQ - - CDATA > <!-- number of document meeting operand -->
<!ELEMENT SE-APP-COLLFREQ - - CDATA > <!-- number of occurences of operand -->
<!ELEMENT SE-APP-INTERNAL-ID  - - CDATA > <!-- internal id for this node -->
<!ELEMENT SE-APP-MATCHES - - CDATA > <!-- a document id which the meets the operand -->
<!ATTLIST SE-APP-MATCHES
   SPAN-START	NUMBER	#REQUIRED
   SPAN-END	NUMBER	#REQUIRED
   WEIGHT	NMTOKEN	#REQUIRED>


<!ELEMENT CONTEXT - - (ANN-ATTR) >
<!ATTLIST CONTEXT
	DISTANCE NUMBER #IMPLIED
	ORDERED (ORDERED | UNORDERED) #IMPLIED>

<!ELEMENT FULL-TERM - - (SE-APP? & APP-SE? & TEXT? & ANN-ATTR*) >
<!ATTLIST FULL-TERM
   WEIGHT NMTOKEN #IMPLIED >

<!ELEMENT TEXT - - CDATA  -- this will ignore any tags within it -->

<!-- ..............Beginning of Operators................. -->


<!ELEMENT INDEPENDENT - - (%operarg;) > 
<!ATTLIST INDEPENDENT
   WEIGHT NMTOKEN #IMPLIED >

<!ELEMENT AND - - (%operarg;) >
<!ATTLIST AND
   EXACT (EXACT) #IMPLIED
   FUZZY NMTOKEN #IMPLIED -- fuzzy and exact are mutually exclusive --
   WEIGHT NMTOKEN #IMPLIED >

<!ELEMENT OR - - (%operarg;) >
<!ATTLIST OR
   EXACT (exact) #IMPLIED
   FUZZY NMTOKEN #IMPLIED -- fuzzy and exact are mutually exclusive --
   WEIGHT NMTOKEN #IMPLIED >

<!ELEMENT AND-NOT - - (%operarg;) >
<!ATTLIST AND-NOT
   EXACT (exact) #IMPLIED
   FUZZY NMTOKEN #IMPLIED -- fuzzy and exact are mutually exclusive --
   WEIGHT NMTOKEN #IMPLIED >

<!ELEMENT HEAD-RELATION - - (%operarg;) >
<!ATTLIST HEAD-RELATION
   HR-TYPE (MORPH-VAR|SYNONYM|RELATED|CONSTRAINT|OTHER) #REQUIRED
   WEIGHT NMTOKEN #IMPLIED >

<!-- other search engine defined operator -->
<!ELEMENT OTHER-OPER - - (%operarg; & OTHER-ARGS) >
<!ATTLIST OTHER-OPER
   WEIGHT NMTOKEN #IMPLIED >
<!ELEMENT OTHER-ARGS - - CDATA >

<!-- ....................End of operators................... -->

<!ELEMENT MERGE-INFO - - CDATA >
<!ATTLIST MERGE-INFO
	MERGETYPE (engine-choice) #IMPLIED >


<!-- Special Stuff for INFONEEDs -->

<!ELEMENT INFO-NEED - - (DOC-COLLECTION* & RESTRICT-SET? & FEEDBACK-INFO? & (%argument;)?)>
<!ATTLIST INFO-NEED
   WEIGHT NMTOKEN #IMPLIED >

<!-- single TIPSTER collection Index -->
<!ELEMENT DOC-COLLECTION - - CDATA >

<!ELEMENT DOCID - - CDATA >

<!ENTITY % b-opernd "B-AND|B-OR|B-AND-NOT|TEXT|FULL-TERM" >

<!ELEMENT B-AND - - ((%b-opernd;),(%b-opernd;)) >
<!ELEMENT B-OR - - ((%b-opernd;),(%b-opernd;)) >
<!ELEMENT B-AND-NOT - - ((%b-opernd;),(%b-opernd;)) >

<!ELEMENT RESTRICT-SET - - ((%b-opernd;)? & DOCID*) >


<!ELEMENT FEEDBACK-INFO - - (DOCID-REL* & TEXT-REL*)>

<!ELEMENT DOCID-REL - - CDATA >
<!ATTLIST DOCID-REL
   REL 		(REL | NONREL) 	#IMPLIED
   SPAN-START 	NUMBER		#REQUIRED
   SPAN-END   	NUMBER		#REQUIRED >

<!ELEMENT TEXT-REL - - CDATA >
<!ATTLIST TEXT-REL
   REL (REL | NONREL) REL >


<!-- End of Special Stuff for INFONEEDs -->


<!ELEMENT COMMENT - - CDATA  -- this will ignore any tags within it -->

<!-- make the attlist actual elements -->
<!ELEMENT ANN-ATTR - - (ANN-TYPE? & ATTR-NAME? & ATTR-TYPE? & ATTR-VALUE?) >
<!ELEMENT ANN-TYPE - - CDATA >
<!ELEMENT ATTR-NAME - - CDATA >
<!ELEMENT ATTR-TYPE - - CDATA >
<!ELEMENT ATTR-VALUE - - CDATA >

<!-- the +(comment) at the end of the DN2 spec is an exception -->
<!-- I tried to put in a #PCDATA but it seriously confused the parsing whenever any additional -->
<!--  characters appeared. -->
<!ELEMENT DN2 - - (APP-SE? & SE-APP? & MERGE-INFO? & ((%argument;)|INFO-NEED+)) +(COMMENT) >
<!ATTLIST DN2
         ID ID #IMPLIED
         OUTPUT-QUERY (OUTPUT-QUERY | NO-OUTPUT-QUERY) NO-OUTPUT-QUERY
         OUTPUT-DOCS (OUTPUT-DOCS | NO-OUTPUT-DOCS) OUTPUT-DOCS>


Detection Need 2
----------------

A Detection Need Type 2 (DN2) is an SGML query used for querying a document
collection. The Document Type Definition (DTD) is the specification of the
grammar of a DN2.  The DTD allows an SGML parser/validator to read a DN2 and
determine whether it is syntactically correct.  

The basic form of a DN2 query is a start tag of <DN2> followed by the various
terms that are to be searched for.  The DN2 concludes with a end tag of
</DN2>.  The terms can be straight text surrounded by <TEXT> and </TEXT>
tags, or they can be more complex boolean expressions of text terms.  The
FULL-TERM tag allows the user to combine a text term with some direction
about in which field to search for the text.  For example, in one query a
user could locate all documents that contain "Mark Twain", but using a
FULL-TERM tag the user could locate all documents that have "Mark Twain" in
the author field.

One note is that it is up to the search engine to document and enable the
full capabilities of the DN2.  The DTD is simply the specification of format
of the DN2.  It does not say anything about the semantics of the query.


Documentation on the Query Tool
-------------------------------

The purpose of the query tool is to simplify the creation and evaluation of a Detection Need Type 2 (DN2).  It allows the user to graphically create, view, and edit DN2s using a portable Tcl/Tk interface.  It also allows the user to send the DN2 to a search engine for evaluation.  Since the tool uses the TIPSTER interface for communicating with the search engine it should be easily relinked to use another search engine.  Currently, it is linked with the Z39.50 search engine.

The query tool is made up of three parts.  James Clark's SP SGML parser, Allan Brighton's tree extension to Tcl/Tk, and the query tool code which builds on those two pieces.  It is written in a combination of C++ and Tcl/Tk scripting.  An extension of the Tcl/Tk interpreter was developed which contained two additional scripting commands, parsedn2 and senddn2.  The parsedn2 command accepts an SGML DN2 string and determines whether it correctly meets the DN2 DTD.  The senddn2 accepts an SGML DN2 string, sends it to the search engine, returns the matches.

James Clark's SP SGML parser is used to validate the DN2 and to extract all of the data.


A DN2 is an SGML representation of a query to be perfromed upon a database of documents.  SGML was chosen as the base representation due to its portabilitiy.  The query tool allows one to deal graphically with the DN2 rather than having to type the SGML representation directly.  Rather than create an entire parser from scratch, James Clark's SP library was used to parse the DN2 SGML instance.  

The query tool was written using a combination of C++ and Tcl/Tk.  The parsing and validation of SGML documents was performed by James Clark's SP Version 1.3 SGML Validator.  This package ensures that an SGML document meets the specifications of its Document Type Declaration (DTD).  An API included with it allows one to create applications that can receive SGML input.  


The Tree-4.2 package from Allan Brighton provides tree drawing capability to Tcl/Tk.  It allows the creation of a tree to which one can add and remove nodes.  The tree package takes care of correctly drawing the tree once any edits have been made.  


The query tool code uses the SP parser/validator to extract all of the data from the DN2 and places it in an internal data structure.  From this 

OUTLINE

1. Description of the program
 a. query tool
 b. SP
 c. Tree-4.2


Dataflow

Starting the dn2tcl program invokes a Tcl/Tk shell that has been expanded with the parsedn2 and senddn2 commands.  Sourcing querytool.tcl creates the query tool window and opens a new dn2.  The user selects a node using the left mouse button.  Once selected the following operations can be performed on the selected node by selecting one of the options under the Node menu:
Add Node:  This adds a new child node to the selected node.
Remove Node:  This deletes the selected node and all of its children.
Prune Node: This deletes the tree beneath the selected node, but keeps the selected node.
Edit Attributes:  Brings up a window displaying the attributes of the selected node.  The values of the attributes can be altered and new attributes can be added.

A selected node can be moved by putting the mouse cursor on the selected node and dragging it using the right mouse button.  The algorithm for moving the node works as follows:  find if there are any nodes overlapping the rectangular box.  If there are then find the closest and reattach the selected node there, if possible.  If there are no overlapping nodes then we search for the closest node to the left of the new position. 


Once a DN2 has been crafted graphically then the SGML DN2 can be viewed by pressing the DN2->View menu choice.  The resulting text DN2 can be parsed for correctness, edited, and sent to the search engine.


Core Data Structures:

C++
To develop an application using the SP parser API the programmer creates an object that inherits from the SGMLApplication object.  This object has virtual methods for a series of events that are parts of an SGML document.  These include such events as startElement, endElement, and data.  The programmer writes implementations of these events as his application.

In the query tool, a dn2app object was created which inherited from the SGMLApplication.  One of its members is a dn2doc object into which the event handlers insert their data.  It also has two methods for extracting the data from the parsed DN2.  One returns the data is SGML text and the other returns the data in an internal format that is used by the Tcl code for creation of the tree structure in the Tcl code.

The dn2doc object contains a tree for storage of the DN2 information.  Each node maintains a vector of pointers to its children nodes.  The syntax of SGML requires that all of a parent's children nodes be ended before the parent node is ended.  For example the following is illegal SGML <b>BOLD <i>BOLD ITALIC</b> italic</i>.  It must be written like this: <b>BOLD <i>BOLD ITALIC</i></b><i> italic</i>.  This makes it straightforward to store the structure of the document.  As each start element event occurs, a new node is created, placed in the list of the current node's children, and the current node pointer is changed to the point to the new node.  As each end event occurs, the current node pointer moves to the current node's parent.  Contained within each node are the type of the element, a vector of the element's attribute names and values, and the data that appears between the start and end tags.


These objects are used by the new Tcl command parsedn2.  This accepts a string containing an SGML DN2 document and returns successfully if it meets the DN2 DTD.  Optionally the string can be returned in either SGML format or in the internal Tcl format.  The internal Tcl format is as follows:

node := {nodetype attributelist nodelist}
attributelist := { { attribute_name attribute_value }* }
nodelist := { node* }

Basically it is a list of a nodetype, a list of attribute name and value pairs, and a list of subnodes.

The commands works by taking the input string, writing it to a temporary file, and then calling the parse routine with the name of the file.  The parse routine is able to read data from either the standard input or from a file.


Once the Tcl routine receives the internally formatted list, it traverses the list populating its own data structure with the data.  

The Tcl code is broken into object-like entities.  Most of the objects are focused on the windows that are displayed.  The exceptions is the tree data structure which was encapsulated into its own object.  The main window is constructed and displayed by the DN2.tcl code.  From this the other windows can be invoked.  These are the view_attribs window, the add_attribs window, the add_sub_node window, the dn2win window, and the datastructure object.

The view_attribs window displays the attributes of the selected node and allows editing of those attributes.  Whenever the selected node is changed this window is updated with the attributes of the new node.

The add_attribs window is invoked from the view_attribs window and it displays the available attributes that can be added to the selected node.

The add_sub_node window displays the allowable subnodes for the selected node.  This allowable subnodes are those that are syntactically allowed for the subnode.  It does not do any semantic checking.  For example, it is correct syntax for a TEXT node to follow the DN2 node, but it is semantically incorrect to have two of them.  The add_sub_node window will always show that a TEXT node can be added when the DN2 node is selected regardless of whether a TEXT node already exists or not.

The dn2win window displays the SGML DN2 that the displayed tree represents.  Edits to the SGML DN2 can be made and will be reflected back on the tree display once the apply button is pressed.  By pressing the parse button, the current DN2 text is parsed and a message is returned reflecting the success of the parse.

The datastructure object is implemented as a general tree.  Each node has a pointer to its parent, its first child, and to its right sibling.  The data stored with each node is its type, a list of attribute name and value pairs, and string containing the text of the node.  The text of the node is all of the data between the start tag and end tag that is not another tag.

A DN2 can be saved and can be read in.

The query tool was created on a Solaris 2.4 Sparc 10 using Tcl/Tk 8.0p2, Allan Brighton's Tree 4.2, James Clark's SP Version 1.3, and gnu c++.


This file documents the implementation of the Annotator Job Manager,
by Logicon, Inc. 1998.

The system is implemented in Perl, v. 5.004.
Its principal components are CGI programs; these use the CGI.pm
module (tested with v. 2.42; earlier versions are known not to work).
In addition, several programs are provided for use by an administrator
on the local machine.

CGI programs:
	jobupload.cgi	
	jobcheck.cgi
	jobcancel.cgi

Command-line programs:
	jobcheck
	jobcancel

Background programs:
	fulfill_job
 
Each of these programs use a module file in the same directory,
named JobCommon.pm.  This module implements a number of functions
for common use by the programs.

The function of each program is briefly described below.

jobupload.cgi
	This program allows the user to initiate a job, by uploading
	an archive file (either .tar, .tar.Z, or .tar.gz).
	Creating a job consists of saving the archive file in an upload
	directory (named 'incoming'), generating a unique job id, and
	creating a file (in the 'jobs' directory) which will track
	the state of the job.  The job id is derived by selecting two
	words at random from a words file; presently this file is named
	'words.6', and contains all the lower-case six-letter words from
	the /usr/dict/words file.  The two words chosen are joined with
	a hyphen.  This string is used as the unique job id.
	It is presented to the user as a convenient way to identify her
	job when requesting actions or information on the job.
	The job id is used as the file name for the job status file 
	and various other temporary files generated in the course of
	processing the job.

jobcheck.cgi
	This program allows the user to check on the status of any
	job for which she knows the job id.
	It also allows the user to request further annotation actions
	on the job; each possible action is represented by a button.
	If a new annotator is added to the system, this file must be
	modified to recognize it.
	Additional actions are queued up, so that no action is 
	commenced until the previously requested actions have all
	completed.  New jobs are initialized with no actions pending.
	When all the actions requested for a job have been completed,
	the archive file containing the final results are made
	available to the user to download, via a link on this page.

jobcancel.cgi
	This program allows the user to cancel any job for which she
	knows the job id.  On being cancelled, all the files associated
	with a job are deleted from the file system.

jobcheck
	This program can be run from the command line, in the main
	directory of the Annotator Job Manager.  If a job id is given
	on the command line, the job's status is displayed.  If no
	arguments are given, a list of currently active jobs is displayed.

jobcancel
	This program can be run from the command line, in the main
	directory of the Annotator Job Manager.  If a job id is given
	on the command line, the job is canceled, just as per 
	jobcancel.cgi.  If no arguments are given, a list of currently 
	active jobs is displayed.


fulfill_job
	This program is responsible for ensuring that the requested
	actions get performed.  Typically this consists of extracting
	the job's files from its archive, launching an external 
	annotator, or sending email to an administrator.
	If a new annotator is added to the system, this file must be
	modified to recognize it.
	Whenever a job has one or more unfinished actions, it must have a 
	background process running, which will perform those pending 
	actions.  Whenever an action is added to a job, the adding process 
	checks the job for the existence of the fulfiller; if one is not 
	running, one is started.  The fulfiller removes the next action 
	from the job's queue and performs it; when it is done, it marks 
	that action as complete in the job's status file.  The fulfiller 
	continues this cycle until no more uncompleted actions remain for 
	the job; then it exits.  In the current implementation, the 
	fulfiller processes are started using the Unix 'at' command.  The 
	'at' command will run fulfill_job with the job id as an argument.
	The system as delivered is configured with two annotators, which 
	perform a "null" operation on the document collection (that is,
	they do nothing).  "Null-Short" completes in very little time,
	and "Null-Long" completes in several minutes.  These allow the
	user to test the pipeline of the system.