Design Notes


Overview

The NIST PRISE indexer creates an index from text files, marked up in SGML format, to be used by the PRISE search engine. PRISE indexing consists of 8 individual programs which produces a set of data to be used with the Z39.50 interface. It contains a two-step process, that does not need an explicit sorting step. The first step (rel.build.tmm) produces the basic inverted file, and the second step (rebuild.tmm) adds the term weights to the inverted file and reorganizes it for maximum efficiency. The creation of the basic inverted file avoids the use of an explicit sort by using a right-threaded binary tree. Below is a description of the 8 programs used in creating a PRISE index.

  1. Verify the input files conform to SGML.
    An SGML verifier which checks the input. (not developed by NIST)

  2. Create tokens/terms from the sgmls output.
    sgmls.parser returns tokens which are not SGML tags.

  3. Create the basic inverted file from the first set of terms.
    rel.build.tmm creates the term tree and temporary posting files, subtracting tokens which are commonwords.

  4. Compute the term weights and create the dictionary.
    rebuild.tmm uses the term tree and temporary postings to create the final postings ("postings") containing the term weights and the ascii dictionary ("dictionary").

  5. Build the binary dictionary.
    prep builds the binary dictionary.

  6. Create a document map and table for displaying results.
    docmap creates the document map and document table.

  7. Create a title map and table for displaying titles.
    doctitles creates the titles map and titles table.

  8. Create a numbering sequence used only by the command line search tool "search.small".
    docmapseq creates the document map sequence. (for use with search.small)


    Processing Details and Location of Source Code by Step

    This section describes in greater detail the major steps executed in the PRISE indexer. Each major step is broken into a single program. What follows is a short description of each program.

    1. sgmls


    2. sgmls.parser


    3. rel.build.tmm


    4. rebuild.tmm


    5. prep


    6. docmap


    7. doctitles


    8. docmapseq


    Constraints or Boundaries:

    1. Number of Documents: 220
    2. Word Size: 99 characters (see symb_defs.h)
    3. Weight: 4095


    National Institute of Standards and Technology Home Page Last updated: Tuesday, 01-Aug-2000 13:16:47 UTC

    Date created: Monday, 31-Jul-00