ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
S0CCER - A Concordance Program
chapter
Guy E. Hochgesang
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
111-3
of the text begins from the IN?UT tape.
As the cards of the text are read in, they are numbered sequentially
and then written out on the [OCRerr]UTPUT tape to provide a listing of the text.
Cards which have a 1!*I! (the 1space control character?1) in column one are
listed double-spaced; i.e., a blank line precedes their listing. In addition,
cards which have either a 11*1? or a ??$1? (the ?lskip control character1T) are
n[OCRerr]t included in the concordance. In effect this causes cards with an asterisk
or dollar sign in column one to be interpreted as comment cards.
Any card which does not have the skip control character or space
cor:trol character in column one is included in the concordance. These cards
are scanned for the tokens of the concordance in the following steps:
1. The right-most non-blank character is found. If this character
[OCRerr]s not a hyphen (a minus sign) step 2 is taken. If this
character is a hyphen, the character and all blanks to the
right of it are deleted. The next card is scanned from left
to right for alphabetic characters, with the scan terminating
at the first special character. These alphabetic characters
(if any) are then appended to the card with the hyphen and
step[OCRerr]2 is taken. This procedure allows one to hyphenate
words from one card to another, provided that the hyphen
follows immediately after the last alphabetic character on
the first card and that the syllable on the second card
starts in column one. Such hyphenated words appear in the
concordance with the syllables properly joined together and
the hyphen deleted.
2. If n consecutive blanks appear on the card, n-l of these
blanks are deleted to allow as much significant context as
po[OCRerr]sible to be included with the tokens in step 3.
3. The card is then scanned for tokens. As each token is found
it is written out on an intermediate tape, SMRTAP, along with