Indexing Your Own Collection Frequently Asked Questions When Building Your Own Index
  1. How do I build my data?

    The search and retrieval system (ZPrise) used at NIST requires use of SGML formatted data. And, the preliminary work (i.e., indexing) specifically requires use of the following tags:

  2. Where will my collection reside?

    Where there is ~35% free space, create the following directory structure:

    collectionname_yr-yr/collectionname_yr-yr_text
    where your text files reside
    collectionname_yr-yr/collectionname_yr-yr_index
    where your user-defined index files reside and where your index is built

    So that a collection of both texts and indices can be transportable (i.e., moved to other directories other than the one created), the indexer must be built at the root level. To do this, create symbolic links from the root directory to the actual physical location. For example, create a directory from root titled /collectionname_yr-yr. Next, cd into that directory, and create symbolic links to where the collection physically resides.

  3. What files are required for indexing and why?
    commonwords
    The commonwords file contains 24 commonly used words such as "a", "the", "and", etc. These words are not indexed by the indexer.
    fields.spec
    options.spec
    The options.spec file defines the various options used by the indexer such as the type of stemmer to be used.
    sgmls.actions
    The sgmls.actions file contains a list of all elements (or tags) used in the collection. These elements should coincide with the elements found in the document-type-definition (DTD) file.
    text.list
    The text.list file contains a list of the filename (and path) followed by the dtd filename
    title_tags
    The title_tags file contains the elements (both open and close) which are to be retrieved as titles. The list is in ascending order of priority (i.e., the elements shown first will attempt to be retrieved as title first, the element shown second will be retrieved as title if the first element is not found, etc.).
  4. How do I create those files required for indexing?
  5. What pre-index checks should I perform?
  6. What post-index checks should I perform?
    1. Verify the existence of:
    2. Check error for error/warning/fatal messages
    3. Check collstats for correct document count
    4. Check docmap_params for correct document count
    5. Ensure retrieval of docno using showdoc
  7. What index files are generated by build.script.sh?
  8. What post-index cleanup should I perform?
  9. How do I relocate my collection?

National Institute of Standards and Technology Home Page Last updated: Monday, 06-Nov-2000 07:37:06 EST

Date created: Monday, 31-Jul-00