This document describes how the reference (human generated data) will be compared to the hypothesis (system generated data) in the HUB-4 "templette" task. The scoring procedure assumes that a two-step information extraction method will be used to produce both the reference and the hypothesis.
In the first step, the source corpus is converted to a text corpus. The source corpus is a set of source documents. A source document may be a newswire article in ASCII text format, an audio or video file, a printed page, or something else. The exact definition may be found in the test procedures. The text corpus is one computer file, consisting of a sequence of text documents. The format of text documents is described below. The nature of this conversion depends on the format of the source corpus. The conversion may be trivial if the source corpus is already ASCII text, or may be non-trivial (e.g., if the source corpus is a set of audio files).
In the second step of the assumed pipeline, information from the text corpus is extracted and organized into a template set. A template set is one computer file. The format of template sets is detailed below.
The scoring procedure requires the reference and hypothesis template sets as input. If the reference and hypothesis text corpora differ, then they will also be required.
As mentioned above, a text corpus is a single computer file, consisting of a sequence of text documents. A pair of SGML tags, called the document tags, encloses each text document. Often the generic identifier for document tags is "DOC", but any generic identifiers may be used. All document tags in a text corpus should have the same generic identifier.
Each text document should have an SGML document identification element whose contents uniquely identify the text document in the text corpus. Often the generic identifier for document identification elements is "DOCNO" or "DOCID", but any generic identifier may be used (all document identification elements in a text corpus should use the same generic identifier).
The contents of the document identification element are used to associate information in an instance set with a text document, and also to associate information in a reference instance set with information in a hypothesis instance set.
The appendix contains a hypothetical text corpus, consisting of three text documents. The document tags are named "DOC", and the document identifier tags are named "DOCNO". It may be helpful to refer to the examples in the appendix in the following sections.
The appendix also contains hypothetical reference and hypothesis template sets, made of information extracted from the text document.
The information extracted is contained in a single computer file. The information has a hierarchical structure. From top to bottom, the levels of the hierarchy are as follows:
At the top of the hierarchy, the entire file is a template set. A template set is a sequence of instance sets. If a text document is deemed to contain a reportable story, exactly one instance set should be created for that text document. If the document does not contain a reportable story, an "empty" instance set may be created, but this is not necessary for scoring.
Our example has three text documents in the text corpus. Only the last document was deemed relevant, and in our example the first two instance sets are "empty".
An instance set is a collection of instances created from one text document. In our example, the last instance set of the template set is:
<TEMPLATE-PRI19980302.2000.2923-1> := DOC_NR: PRI19980302.2000.2923 ##14#35# EVENT: <SPORTS_EVENT-PRI19980302.2000.2923-1> COMMENT: "No locations for earlier tournaments." <SPORTS_EVENT-PRI19980302.2000.2923-1> := S_EVENT: "African cup of nation soccer tournament" ##216#255# / "the African cup" ##401#416# / "the tournament" ##461#475# WINNER: "Egypt" ##332#337# LOSER: "defending champion [south Africa]" ##295#326#314#326# SCORE: "2-0" ##327#330# LOCATION: "south Africa" ##314#326# / "The host of the tournament" ##449#475# DATE: "03/02/1998" ##89#99# COMMENT: "location of earlier tournaments unstated"
The instance set in the previous section consists of two instances, one of type TEMPLATE, and one of type SPORTS_EVENT.
Each instance must contain a header and a body. The header consists of an instance pointer, followed by the string ":=" on the same line. For example, this is an instance header:
<SPORTS_EVENT-PRI19980302.2000.2923-1> :=The body of an instance is a set of slots.
One slot in an instance may serve to mark the instance as optional. For instance, the above SPORTS_EVENT instance could be marked optional with an OBJ_STATUS slot:
<SPORTS_EVENT-PRI19980302.2000.2923-1> := S_EVENT: ... WINNER: ... LOSER: ... SCORE: ... LOCATION: ... DATE: ... OBJ_STATUS: OPTIONAL COMMENT: ...
There are eight slots In the above instance, named S_EVENT, WINNER, LOSER, SCORE, LOCATION, DATE, OBJ_STATUS and COMMENT. Each slot consists of a slot name and a slot body. The slot name is a string of letters, numbers, hyphens or underscore characters, followed by a colon. It identifies the slot within the instance. The slot body is a set of fill-alternatives. Hypothesis slots may have only one fill-alternative. Reference slots may have several fill-alternatives. Fill-alternatives in a reference slot are separated by slash characters as the first non-blank character on a line.
A fill-alternative is a set of single fills. Our example has no fill-alternatives consisting of more than one single fill, but such a fill-alternative is possible. For instance, if a SPORTS_EVENT instance was to be created for the 1986 African Cup, then the fill-alternative in the EVENT slot of the top-level TEMPLATE instance would contain two single fills of type pointer.
<TEMPLATE-PRI19980302.2000.2923-1> := DOC_NR: PRI19980302.2000.2923 ##14#35# EVENT: <SPORTS_EVENT-PRI19980302.2000.2923-1> <SPORTS_EVENT-PRI19980302.2000.2923-2> COMMENT: "1986 tournament optional, since no location given"
(The instance <SPORTS_EVENT-PRI19980302.2000.2923-2> in the above EVENT slot would be marked optional with the OBJ_STATUS slot.)
The current task description says that only the EVENT slot of the TEMPLATE instance may have more than one single fill in a fill-alternative.
For the current task, there are two types of single fills:
Pointer fills refer to instances in a template set. The pointer format is used in both pointer fills and in the pointer part of instance headers. Here is a pointer in our example:
<SPORTS_EVENT-PRI19980302.2000.2923-1>The entire pointer is enclosed in angle brackets, and consists of three character strings, separated by hyphens. The first string is the instance type, as defined in the task guidelines. The second string is the document identifier. The document identifier should be identical to the non-blank characters of the document identification element in the text document. The third string in a pointer is the instance's one-up number. It is used to identify the instance within the instance set. No two instances of the same type in the same instance set should have the same one-up number.
Here is a text fill from our example:
"defending champion [south Africa]" ##295#326#314#326#Each text fill consists of two parts, the content part and the extent part.
The content part of a text fill is a content string, optionally enclosed in double quotes. A content string is a character string copied from the text document. Content strings in the reference may have pairs of square brackets inserted in them, indicating minimal content strings, which are substrings of the maximal content string. All newline characters in a content string must be converted to space characters when the content string is copied into the text fill. This is the content part of the above text fill:
"defending champion [south Africa]"This content part consists of the maximal content string
"defending champion south Africa"and the minimal content string
The extent part of a text fill is a sequence of pairs of numbers. Both the pairs and the numbers within the pairs are separated by single hash characters. The extent part begins with double hash characters. This is the extent part of the above text fill:
##295#326#314#326#There should be no spaces in the extent part. Each pair of numbers is called an extent. Extents specify the location of the beginning and ending of a piece of information in a document.
It should be noted that what the numbers within extents actually represent varies. In the inputs to the current scoring procedure they are byte offsets (explained below). During scoring, they are transformed in an "extent normalization" step into units based on the phonetic alignment of the reference and hypothesis texts. In future scoring procedures, extents could possibly refer to the start and end times of a part of an audio recording.
The first integer of each extent (called the start offset ) is the character index in the text document of the first character in a (maximal or minimal) content string. The second integer (the end offset) is the character index in the text document of the first character following the fill. Character indices are calculated by counting characters, starting at zero with the "<" character in the document tag (in our example, the document tag is "<DOC>").
One extent (a, b) is said to enclose another extent (c, d) iff
a <= c <= band
a <= d <= b
One extent (a, b) is said to overlap another extent (c, d) iff
a <= c <= bor
a <= d <= bor
c <= a <= dor
c <= b <= d
The first extent in the extent part of a fill gives the locations of the beginning and ending of the fill's maximal content string. In a reference fill, any extents after the first one give the locations of minimal content strings.
Because this document is concerned more with the mechanics of comparing reference and hypothesis and less with the meaning of the current task, different terms are used to avoid clashes.
The term "instance" refers to a "tuple" created in the extraction process. For the current task, there are both "templette" instances and "template" instances. A templette instance is created for each "event" of the general guidelines. To organize all the templette instances from one text document into a single structure, one template instance is also created. The set of all templette instances from one text document, together with the single "tie-up" template instance, is what we refer to as an instance set.
Some earlier scoring documentation [DOUTHAT_MESSAGE] used the term "multiple fill" to refer to what is called a "fill-alternative" in this document. The earlier term didn't capture the sense of "alternative-ness." In the current task it would be even more confusing, since the valence of all slots but the EVENT slot is one. Therefore, the current term has replaced the old one.
The scoring algorithm takes as input the reference and hypothesis template sets. In addition, if the text corpus used by the hypothesis is different from the text corpus used by the reference, the reference and hypothesis text corpora are also required.
If the hypothesis and reference texts differ, valid spelling differences will be removed by transforming non-normal (but valid) word spellings into normal ones. This will be done by means of a Global Mapping File, developed by NIST for use in their evaluations of automatic speech recognition systems [NIST_HUB4]. A global mapping file specifies a list of words that have a non-normal spelling, and for each word the normalized spelling.
If the reference and hypothesis text corpora differ, the extents in text fills will not be immediately comparable. If this is the case, the extents will be normalized, by first aligning the two corpora using a dynamic programming algorithm, then recalculating the extents based on the alignment [MITRE_MSCORE, BURGER_NAMED].
To illustrate extent normalization, here is a simplified example of a pair of different text documents from the same source document, with the character extent indices shown below the texts:
REF: <DOC> The h- heart of General Motors </DOC> HYP: <DOC> A part of General Motors </DOC> NDX: 0123456789012345678901234567890123456And here is a pair of text fills from the text documents:
REF: "General Motors" ##22#36# HYP: "General Motors" ##16#30#(For brevity, the document identifier has been left out of the texts and the fills.) It can be seen that the extents in the text fills do not agree when based on character count.
If a phonetic alignment program [FISHER_TALD3E, PICONE_AUTO, FISHER_BETTER, FISHER_FURTHER] is used to align the two texts, the resulting alignment would look something like this:
REF: | THE | H- | HEART | of | general | motors | HYP: | A | | PART | of | general | motors | NDX: 0 1 2 3 4 5 6The normalized indices are calculated by counting the vertical bars produced by the alignment, rather than the number of characters. Rewriting the text fills with the normalized indices, we obtain
REF: "General Motors" ##4#5# HYP: "General Motors" ##4#5#and can see that the normalized extents do in fact match.
Before mapping the reference and hypothesis instance sets, premodifiers (e.g., the words "and", "a", and "the") will be removed from all fills and the corresponding extents adjusted.
Other substrings, such as certain punctuation marks, may be whited out: the unwanted text will be changed to whitespace, but the extents will not be adjusted. (When fills are compared for content, each whitespace string is changed to a single space character).
It is possible that a fill may begin and end in the same extractable SGML section, but include some non-extractable SGML section, as in the following example:
TEXT: And one final sports note -- today in Anchorage,... <ANNOTATION> (voice-over) </ANNOTATION> ...Alaska, the ceremonial start of the 26th FILL: "Anchorage,... <ANNOTATION> (voice-over) </ANNOTATION> ...Alaska"When this is the case, the non-extractable section will be whited out. The above example would then look like this (depending on what punctuation is removed):
WHITED-OUT FILL: "Anchorage Alaska"
The mapping of hypothesis template sets to reference template sets is the association of structures in the two template set hierarchies [CHINCHOR_FOUR]. Each structure in the hypothesis hierarchy is associated with at most one structure in the reference hierarchy and vice versa.
One structure may only be associated with another structure from the same level in the hierarchy: instance sets with instance sets, instances with instances, slots with slots, etc.
Further, if structure A is mapped to structure B, structure A's child structures may only map to structure B's child structures. For instance, if one instance is mapped to another instance, the first instance's slots may only be mapped to the other instance's slots.
To determine mappings of structures at several levels of the hierarchy, points are used. Points are categorized as correct, incorrect, missing, spurious, or unscored. At the bottom of the structure hierarchy, each mapping of either a single fill to another single fill or of a single fill to nothing results in one or two points. At any level in the structure hierarchy above the single fill, the points resulting from the mapping of one structure to another are determined by combining the points from the structures at the next lower level. For instance, the points obtained from mapping one instance to another are the sum of the points from the mappings of each of the slots of the instances (ignoring unscored slots). The ways points are combined at each level of the mapping are described below.
The following simple greedy algorithm is used to map objects at several levels in the hierarchy:
At the very top of the structure hierarchy, the mapping of the single reference template set to the single hypothesis template set is trivial. The points from the template set mapping are the sum of the points from each instance set mapping.
At the "instance set" level, a reference instance set is mapped to a hypothesis instance set based only on the document identifier used in the pointers of the instance sets' instance headers.
For example if the instance header lines from the reference template set are:
<TEMPLATE-VOA19980126.2100.1446-1> := <BREAD-VOA19980126.2100.1446-1> := <CIRCUS-VOA19980126.2100.1446-1> := <CIRCUS-VOA19980126.2100.1446-2> := <TEMPLATE-VOA19980302.1600.0096-1> := <TEMPLATE-VOA19980111.2300.0414-1> := <BREAD-VOA19980111.2300.0414-1> := <BREAD-VOA19980111.2300.0414-2> := <CIRCUS-VOA19980111.2300.0414-1> :=and those from the hypothesis template set are:
<TEMPLATE-VOA19980126.2100.1446-1> := <BREAD-VOA19980126.2100.1446-1> := <BREAD-VOA19980126.2100.1446-2> := <CIRCUS-VOA19980126.2100.1446-2> := <TEMPLATE-VOA19980302.1600.0096-1> := <CIRCUS-VOA19980302.1600.0096-1> := <TEMPLATE-VOA19980111.2300.0414-1> := <CIRCUS-VOA19980111.2300.0414-1> := <CIRCUS-VOA19980111.2300.0414-2> := <CIRCUS-VOA19980111.2300.0414-3> :=then at the instance set level, the mappings will be (not surprisingly):
VOA19980126.2100.1446 <---> VOA19980126.2100.1446 VOA19980302.1600.0096 <---> VOA19980302.1600.0096 VOA19980111.2300.0414 <---> VOA19980111.2300.0414When one instance set is mapped to another, the points for the mapping are just the sum of the points from each instance mapping.
It is possible that the text corpus used to make the hypothesis will not be segmented into text documents. For instance, the hypothesis might be referring to the raw output of an automatic speech recognizer, which consists only of a list of words and SGML timestamp tags.
If this is the case, each hypothesis instance set extent will be defined as a pair (a, b), where a is the smallest offset of all text fills in a hypothesis instance set, and b is the largest offset of all text fills in the instance set.
The reference instance sets will always be based on segmented data. The reference instance set extent will consist of the smallest and largest possible offsets of the text document. When working with unsegmented hypothesis data, offsets will be measured relative to the beginning of the entire text corpus, rather than the beginning of each text document.
Two instance sets overlap if their instance set extents overlap.
With unsegmented hypothesis texts, a modified form of the general greedy algorithm is used to map instance sets. The modification is the restriction that only instance sets which overlap may be mapped.
It can be shown that when the hypothesis text corpus is segmented the same way as the reference text corpus, and the hypothesis instance sets respect the segmentation, the two algorithms for mapping instance sets give the same results.
At the instance level, a reference instance may be mapped to a hypothesis instance only if the two are of the same type. The points for the mapping are the sums of the points from the mappings of the scored slots in the two instances. Slots which are specified as "unscored" do not contribute to the points of the instance mapping.
The above greedy algorithm is used to map reference instances of a single type to hypothesis templates of the same type.
The slots of a reference instance are mapped to the slots of a hypothesis instance by slot name. The points of a slot mapping are the sum of the points from the slots' fill-alternative mapping.
The one fill-alternative in a hypothesis slot is mapped to whichever of the fill-alternatives in the reference slot gives the best F-measure. Any leftover fill-alternatives in the reference are mapped to nothing and do not contribute any points.
The single fills in a hypothesis fill-alternative are mapped to the single fills in the reference fill-alternative using the general mapping greedy algorithm. The points from a single fill mapping are calculated based on the type of the single fill.
The mapping of one pointer fill to another gives one point. The point is determined as follows:
There are two ways to compare text fills. One way is by content, and one is by extent. Mapping one reference text fill to one hypothesis text fill can result in either one or two points. If only contents or only extents are compared, then there is one point per single fill mapping. If both content and extent are compared, then there are two points per mapping.
To compare content, the hypothesis content string is checked to see if
When the reference and hypothesis source documents are recordings of human speech, the corresponding text documents often contain words that should not be taken into account when comparing content. Pause fillers like "uh" and incomplete words like the one in "Glen Buni- Bunting" should be treated as optional content words. If they are in a reference content fill (maximal or minimal) but not in the hypothesis content fill, they should be ignored.
When comparing content for noisy data, the content string will be broken into tokens, some of which may be optional. Then proceed as for clean data, except that strings of tokens will be compared, rather than strings of characters, and some of the reference tokens will be optional.
To compare extents, we proceed as follows. If the reference fill's maximal extents enclose the hypothesis fill's extents, and the hypothesis fill's extents overlap one of the reference minimal fill's extents, the extent point is correct. Otherwise, it is incorrect.
To compare extent, the hypothesis extent is checked to see if:
The content and extent points are determined independently for a single fill pair. However, as stated previously, only one single fill from the reference may map to the single fill in the hypothesis. If the system locates the correct extent, but the contents at that location differ, the content of a different alternative which matches may not be mapped. For example, if two date slots were mapped like this:
REFERENCE HYPOTHESIS DATE: thirsty ##10#20# thirsty ##99#109# /thursday ##99#109#then either the extent or the content could be counted correct, but not both.
Given a set of points, there are several values calculated in the alignment and final scoring.
POS = COR + INC + MIS
ACT = COR + INC + SPU
COR REC = --- POS
COR PRE = --- ACT
2 ((beta) + 1.0) * PRE * REC F = ----------------------------- 2 ((beta) * PRE) + RECwhere beta is the relative weight of precision and recall. When precision and recall are given equal weight, the value for beta is 1. Substituting 1 for beta, and the previous formulas for precision and recall, the above formula simplifies to
2 * COR F = --------- POS + ACT
The following measures are also calculated from the points:
MIS UND = --- POS
SPU OVG = --- ACT
INC SUB = --- COR
INC + SPU + MIS ERR = --------------------- COR + INC + SPU + MIS
Here is a hypothetical text corpus for an imaginary sports event templette task:
<DOC> <DOCNO> ABC19980307.1830.1415 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 03/07/1998 18:53:35.76 </DATE_TIME> <BODY> <HEADLINE> SPORTS </HEADLINE> Byline:JOHN FRANKEL, AARON BROWN High:MIKE TYSON SUES DON KING FOR MISMANAGEMENT Spec:SPORTS / CASEY MARTIN / BASEBALL / MIKE TYSON [USE # 4 OF THIS PREAMBLE] <TEXT> And one final sports note -- today in Anchorage,... <ANNOTATION> (voice-over) </ANNOTATION> ...Alaska, the ceremonial start of the 26th Iditarod dog sled race. The race officially starts tomorrow when 63 teams take off on their 1,100 mile trek to Nome. First price is worth $50,000. <ANNOTATION> (on camera) </ANNOTATION> And I think the whole crew and I are happy we are here in warmer Austin, Texas. That does it for sports. Aaron? <TURN> <ANNOTATION> spkr:AARON_BROWN </ANNOTATION> John, thank you very much. </TEXT> </BODY> <END_TIME> 03/07/1998 18:54:01.67 </END_TIME> </DOC> <DOC> <DOCNO> PRI19980302.2000.2923 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 03/02/1998 20:48:43.85 </DATE_TIME> <BODY> <TEXT> The followup now to a story we reported last Friday. Egypt has won its first African cup of nation soccer tournament in 12 years over the weekend. Beating defending champion south Africa 2-0. Egypt, which failed to qualify for this Summer's world cup, also won the African cup title in 1957, 1959, and 1986. The host of the tournament turned in a stellar performance, placing fourth. </TEXT> </BODY> <END_TIME> 03/02/1998 20:49:16.69 </END_TIME> </DOC> <DOC> <DOCNO> PRI19980317.2000.2025 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 03/17/1998 20:33:45.13 </DATE_TIME> <BODY> <TEXT> In Boston, I'm Lisa mullens. For a couple of years now, Boris Yeltsin's health has prompted concerns about his ability to govern Russia. Just last Wednesday Yeltsin offered to prove to reporters that he is in perfect shape. <TURN> Tell me the kind of sport you want me to challenge you in and I'm on my way to the sports ground. Tell me. Let's go to the swimming pool, to a tennis court or to a running track. Let's do it. <TURN> But today Boris Yeltsin's latest illness forced the postponement of Thursday's scheduled summit of the presidents of former Soviet republics. The Russian leader said to have a severe cold and a bad cough. The president's health isn't as good as he'd like people to think. </TEXT> </BODY> <END_TIME> 03/17/1998 20:37:31.82 </END_TIME> </DOC>
Here is a sample reference template set, corresponding to the above text corpus.
<TEMPLATE-ABC19980307.1830.1415-1> := DOC_NR: ABC19980307.1830.1415 ##14#35# COMMENT: "Race hasn't started yet" <TEMPLATE-PRI19980317.2000.2025-1> := DOC_NR: PRI19980317.2000.2025 ##14#35# COMMENT: "Challenges only, no event" <TEMPLATE-PRI19980302.2000.2923-1> := DOC_NR: PRI19980302.2000.2923 ##14#35# EVENT: <SPORTS_EVENT-PRI19980302.2000.2923-1> COMMENT: "No locations for earlier tournaments." <SPORTS_EVENT-PRI19980302.2000.2923-1> := S_EVENT: "African cup of nation soccer tournament" ##216#255# / "the African cup" ##401#416# / "the tournament" ##461#475# WINNER: "Egypt" ##332#337# LOSER: "defending champion [south Africa]" ##295#326#314#326# SCORE: "2-0" ##327#330# LOCATION: "south Africa" ##314#326# / "The host of the tournament" ##449#475# DATE: "03/02/1998" ##89#99# COMMENT: "location of earlier tournaments unstated"
Here is a sample hypothesis template set, corresponding to the above text corpus.
<TEMPLATE-ABC19980307.1830.1415-1> := DOC_NR: ABC19980307.1830.1415 ##14#35# <TEMPLATE-PRI19980317.2000.2025-1> := DOC_NR: PRI19980317.2000.2025 ##14#35# <TEMPLATE-PRI19980302.2000.2923-1> := DOC_NR: PRI19980302.2000.2923 ##14#35# EVENT: <SPORTS_EVENT-PRI19980302.2000.2923-1> <SPORTS_EVENT-PRI19980302.2000.2923-1> := S_EVENT: "African cup of nation soccer tournament" ##216#255# WINNER: "Egypt" ##332#337# LOSER: "defending champion south Africa" ##295#326# SCORE: "2-0" ##327#330# LOCATION: "south Africa" ##314#326# DATE: "03/02/1998" ##89#99#