ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
S0CCER - A Concordance Program
chapter
Guy E. Hochgesang
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
111-2
2. The Concordance
A. Definitions
[OCRerr][OCRerr]CCEP divides the standard character set into two groups: alpha
b[OCRerr]tic characters and special characters. Alphabetic characters are defined
tQ be the char[OCRerr]cters included in the concordance, while special characters
are those characters to be ignored during the generation of a concordance.
A to-en is defined to be any string of consecutive alphabetic characters
de1L[OCRerr]flited by special characters, while a type is defined to be a class of
identical tokens. As an example if one defines the alphabetic characters as
t?e letters of the alphabet, the string of characters to be or not to be
contains the six tokens to, be, or, not, to, and be, but it contains only the
four types t), be, or, and not
B. The Input Text
The input text to [OCRerr][OCRerr]CCER should be punched on cards in columns 1-72.
Cc)1[OCRerr]T[OCRerr]w..[OCRerr]s 73-80 of the cards are ordinarily ignored and may be blank or contain
serial numbers. These cards must then be transferred to the IN?UT tape in
unblocked BCD records of thirteen or more machine words. No special typing
conventions are necessary in punching the text cards. The end of the text
must be indicated by a card with `1*ST[OCRerr]p'T punched left-justified in columns
1-6, or by an end-of-file on the INPUT tape following the last card of the
text.
control
in Part
C. Processing the Text
Before starting to process the input text, [OCRerr]CCER first reads the
cards from A2. (An explanation of the control cards will be found
[OCRerr] of this report.) When the START control card is found, processing