The Message Understanding Conferences (MUCs) are gatherings of researchers in computational linguistics. All participants in the conference develop software systems which perform natural language understanding tasks defined by the conference committee. The systems are evaluated based on how their output compares with the output of human linguists. The MUC scoring software is used in that comparison.
(This manual will briefly describe the MUC tasks from the standpoint of how the scoring software is used. For detailed, up-to-date descriptions of the the tasks, please refer to the various task definition documents for the conference.)
All tasks involve reading in a set of documents from one computer file. The documents contain text from periodicals or some other source of natural language, so the input file is called a texts file. The texts file is analyzed, and a set of objects is produced and printed to a single output file. For each document in the input file there are, in general, several objects produced. The format of the objects varies with the task, although the scorer uses the same internal representation of objects for all tasks.
When the scorer is run, it reads in an output file produced by a human, containing keys, and an output file produced by a software system, containing responses. The scorer aligns objects in the key file with objects in the response file. It then calculates various scores based on how well the responses agree with the keys. At present, there are two ways that the scores are calculated. In four of the five tasks, scores are based on counting how many fills (which are substructures of the objects, and are described below) agree for each aligned object. In the coreference task, the objects in the key file are grouped into equivalence classes, as are the objects in the response file. Scores for coreference are based on how well the equivalence classes agree.
Internally, all objects compared by the scorer have the same basic hierarchical structure. We'll start at the bottom of this hierarchy and work our way up to the objects themselves. The figure above shows the terms we will introduce in relation to an object from a fictitious information extraction task.
For scoring, Single fills are treated as "atoms." When objects are aligned, the objects' single fills are also aligned, and it is the results of the single fill alignments which are tallied up to get the final score (in all tasks but Coreference). There are three types of single fills:
A multi-fill is a group of single fills.
A set of multi-fills is what goes into a slot. Slots in key objects may hold more than one multi-fill. Slots in response objects may hold only one multi-fill. When a key slot has more than one multi-fill, each key multi-fill is an alternative. The response slot's lone multi-fill will be aligned with whichever multi-fill of the key that results in the best score for the alignment. The unaligned multi-fills from the key are then non-committal. A slot also has a name, which distinguishes it from other slots in the top-level object.
A top-level object consists of some identification information and a set of slots. The identification information consists of an id string for the document from which the object was extracted (the document number), the object's type, and a string which distinguishes that object from all other objects of that type in that document. This string is sometimes called the one-up string.
In a template file, the objects produced are in the form of records. An example object from a response file is shown here:
<ORGANIZATION-9303020074-1> := ORG_NAME: "Evergreen Information Technologies Inc." ORG_ALIAS: "Evergreen Information Technologies" "Evergreen" "Evergreen Information" ORG_LEADER: <PERSON-9303020074-57> ORG_TYPE: COMPANYEach record consists of a header and a list of slots. The header is an identification string for the object, followed by the token ":=" on the same line. The header's identification string is enclosed in angle brackets, and consists of three pieces of information:
Each slot in the body of the record consists of a slot name, followed by a colon, and the slot's fills. Set fills and string fills may be enclosed in matching single or double quotes. The format of pointer fills is the same as that of the string which identifies an object in its header.
Here is an example of a record from a human-generated key file:
<ORGANIZATION-9303020074-1> := ORG_NAME: "Evergreen Information Technologies Inc." ORG_TYPE: /COMPANY ORG_ALIAS: "Evergreen Information Technologies" "Evergreen" "Evergreen Information" /"Evergreen" "Evergreen Information" ORG_LEADER: <PERSON-9303020074-57> OBJ_STATUS: OPTIONALKey objects differ from response objects in a few respects:
The information extraction task descriptions often include a BNF which describes the different types of objects in the task. The scorer makes some some further assumptions about the format of template files which are not specified in the BNF's:
SLOT_NAME: "a slot fill" ##392#404#textsfilenameThis is a pair of pound signs ("##") followed by the "start offset" of the fill, then a single pound sign followed by the "end offset" of the fill, then another single pound sign, followed by the name of the texts file. None of the offset information is used in scoring, but it may be used in later versions of the scorer to highlight portions of the texts file. At present the scorer reads the start offset and end offset, but ignores the name of the texts file. The texts file name should not contain any pound signs.
The coreference and named entity tasks involve adding Standard Generalized Markup Language (SGML) to the the texts file to create the key and response files.
SGML is a very flexible and powerful language for adding structure to computer documents. The MUC scoring software recognizes a subset of SGML when it scores the coreference and named entity tasks. This discussion is a (very) simplified description of SGML.
An SGML tag is a character string inserted into a text file. Tags usually come in pairs, consisting of an open tag and a close tag. A pair of tags enclose a section of the text. For example, here is a piece of text, then the same text with some SGML tags added.
Be glad you don't work On the Bungle-bung bridge, That they're building Across Boober Bay at Bum Ridge. <ADVICE> Be glad you don't work On the <STRUCTURE>Bungle-bung bridge</STRUCTURE>, That they're building Across <BODY TYPE="WATER">Boober Bay</BODY> at <LOC>Bum Ridge</LOC>. </ADVICE>
Open tags start with an open angle bracket, and are followed immediately by the generic identifier for that type of tag. Next come a sequence of attribute definitions for that type of tag. The end of an open tag is the close angle bracket. Close tags start with an open angle bracket, then a slash and the same generic identifier as close tag. Close tags don't have attributes.
In the above example, the three tag pairs have generic identifiers ADVICE, STRUCTURE, BODY, and LOC. Only the BODY tag has an attribute, named TYPE, with a value of WATER.
In all MUC tasks, the texts file already has some SGML tags. In the coreference and named entity tasks, the annotators and systems add more tags to the texts to create the keys and responses. The scoring software converts the tags (together with the text they enclose) into objects which have the same internal structure as the objects for the information extraction tasks.
For example, here's some text marked up with TIMEX tags, which were part of the MUC6 named entity task.
<TIMEX TYPE="DATE" ALT="fiscal 1994">the first six months of fiscal 1994</TIMEX>The scorer would convert the text into an object which in a template file would look like this:
<TIMEX-DOCNUM1-1> := TEXT: "the first six months of fiscal 1994" /"fiscal 1994" TYPE: DATE
In the coreference and named entity tasks, there are some things to be careful of when you are preparing keys or responses. One thing is to not delete or insert any characters outside of the SGML tags. Doing this almost always confuses the scoring software and lowers the score. To see if you've changed anything you shouldn't have, you can use the unix "sed" command, or something similar, as in this example with the coreference tags:
unix% sed 's/<COREF[^>]*>//g' rsp | sed 's/<\/COREF[^>]*>//g' >rsp.notags unix% diff texts rsp.notags
The sed command above removes the COREF tags from the responses file (named rsp), and then compares what's left to the original texts file (named texts). The diff command will then show what part of the original texts file has been changed.
The MUC scoring software prints several reports to show how the key and response compared. There is a score report, which only shows "the numbers." There's also report summary, which shows in more detail how the key and response objects were aligned. For the coreference task, there is a "partitions" file, which shows how the key and response equivalence classes compared. And there is a "map history" file, which gives a detailed, if not very readable, description of how the objects were aligned.
The "report summary" files show how the fills and objects of the keys and responses align. There are three types of report summary files: one for the coreference task, one for the named entity task, and one for the information extraction tasks.
Document 930620083 COR "Clinton" "Clinton" COR "Clinton" "Clinton" COR "the White House" "White House" COR "The current briefing room" "The current briefing room" MIS "allies of the securities exchanges" "" MIS "securities" "" MIS "Clinton transition officials" "" MIS "government" "" MIS "the committee" "" SPU "" "Kitty Higgins" SPU "" "an aide" SPU "" "Michigan" OPT "the Clinton camp" "" OPT "the government" "" OPT "briefing" ""
A coreference report summary shows how the COREF objects were aligned by the scorer. Each line has three fields. The first field is a three letter abbreviation telling how a pair of objects are aligned. The abbeviations are:
--------------------------------------------------------------------------------------------------------- Document 930620083 TAG TYPE TEXT KEY_TYPE RSP_TYPE KEY_TEXT RSP_TEXT --------------------------------------------------------------------------------------------------------- ENAMEX cor cor PERSON PERSON "Consuela Washington" "Consuela Washington" ENAMEX cor inc PERSON PERSON "John Dingell" "Washington" ENAMEX cor inc PERSON PERSON "Carter" "Tim Wirth" TIMEX cor cor DATE DATE "01/19/93" "01/19/93" ENAMEX mis mis PERSON "Washington" "" ENAMEX spu spu ORGANIZATION "" "Exchange" ENAMEX spu spu ORGANIZATION "" "Old Executive Office"
The named entity report summary file gives a one-line-per-object-pair description of how the objects were aligned. Each line has seven fields. The first is the generic identifier of the tag which defines the object. The second and third contain three-letter abbreviations for how the key and response objects or fills compared. The abbreviations are:
The fourth and fifth fields are the key and response TYPE fills, if there are any. The sixth and seventh fields are the key and response TEXT fields. If the key contained more than one TEXT fill (through use of the ALT attribute), the one that was aligned with the response fill is the one shown.
If you are interested in seeing all alternatives, you can specify that you want to use the information-extraction-style report summary files. Just include the line
:use_IE_report_summary yessomewhere in the configuration file.
COR | | <PERSON-9301060123-8> | <PERSON-9301060123-8> cor | PER_NAME: | Joe Roth | JOE ROTH cor | PER_ALIAS: | Roth | ROTH cor | PER_TITLE: | Mr. | MR. COR | | <PERSON-9301060123-2> | <PERSON-9301060123-2> inc | PER_NAME: | Rupert Murdoch | MURDOCH cor | PER_TITLE: | Mr. | MR. mis | PER_ALIAS: | Murdoch | SPU | | | <PERSON-9301060123-5> spu | PER_NAME: | | SMITH BARNEY SPU | | | <PERSON-9301060123-12> spu | PER_NAME: | | RUPERT COR | | <ORGANIZATION-9301130133-1> | <ORGANIZATION-9301130133-1> cor | ORG_NAME: | EMI Records Group | EMI RECORDS GROUP cor | ORG_TYPE: | COMPANY | COMPANY mis | ORG_ALIAS: | EMI Records | mis | ORG_DESCRIPTOR: | a unit of London's Thorn EMI PLC | COR | | <ORGANIZATION-9301130133-2> | <ORGANIZATION-9301130133-2> uns | COMMENT: | the alias 'EMI' is here assumed... | cor | ORG_NAME: | Thorn EMI PLC | THORN EMI PLC. cor | ORG_TYPE: | COMPANY | COMPANY mis | ORG_ALIAS: | EMI | mis | ORG_LOCALE: | London CITY | mis | ORG_COUNTRY: | United Kingdom |
The information extraction report summaries files have four columns. The first column shows the result of the pairing on that line. Upper case values are for object comparisons, and lower case values are for single fill comparisons. Possible values are:
The second column shows the name of the slots for the key and response object for the line (and the lines following if there are multiple fills in the slot).
The third and fourth columns show the key and response object records, respectively.
For the coreference task, there is an extra report generated, which shows the COREF objects' equivalence classes, and how they are partitioned by the comparison between keys and responses. Key equivalence classes are surrounded by star characters (*****), and response equivalence classes by equal signs (=====).
Here is a portion of a partition file that gives one key equivalence class from a MUC 6 document.
**************************************************** C 88 116 1 NULL 108 "Washington, an Exchange Ally," C 581 609 4 0 39 "Ms. Washington, 44 years old," C 741 754 8 4 40 "Ms. Washington" C 828 830 9 8 43 "her" C 916 918 12 9 42 "She" C 961 974 15 12 41 "Ms. Washington" C 1124 1171 20 15 48 "A graduate of Harvard Law School, Ms. Washington" C 1257 1259 22 20 49 "She" M 376 454 0 1 "Consuela Washington, an expert in securities laws," ****************************************************Each line containing COREF objects begins with a "C" or an "M", for "correct" or "missing." Correct objects' lines have, in order from left to right,
The "map history" output file is meant primarily for other computer programs to read. It consists of one large Tcl-style list. Each element of this list is itself a list which corresponds to one "document" from the keys and/or responses file. The document lists also contain lists, and this nesting of lists continues on down to the single fill level. Lists in the hierarchy consist of attribute name/attribute value pairs. Attribute names start with a hyphen.
In hierarchy order, the attributes are:
Attribute values are strings, lists, integers, or nonexistent if the attribute's presence alone implies something. At present, the attributes are:
Figure shows one page from a scores file for the MUC-6 scenario template task. There is one page of scores for each document in the task, plus one page for the totals over all documents. Each page is divided into four sections. The first section shows the "text filtering" or "relevance" scores. These have to do with judging whether each document is even relevant to the scenario the NLP system should be looking for. The second section gives the object scores, which shows how the keys and response agree at the object level. The third section shows how well the keys and responses agree at the slot fill level. Only the slot scores determine the final scores, which are the last thing on a page.
The template element and template relation score reports are identical to the scenario template score reports, except that they have no text filtering section.
------------------------------------------------------------------------------ POS ACT| COR PAR INC | MIS SPU NON| REC PRE UND OVG SUB ERR ------------------------+-------------+--------------+------------------------ TEXT FILTERI 100 100| 86 0 14| 0 0 0| 86 86 0 0 14 14 ------------------------+-------------+--------------+------------------------ OBJ SCORES | | | template 0 0| 0 0 0| 0 0 100| 0 0 0 0 0 0 succession_e 195 197| 131 0 6| 58 60 13| 67 66 30 30 4 49 in_and_out 256 310| 164 0 6| 86 140 20| 64 53 34 45 4 59 organization 110 72| 51 0 1| 58 20 10| 46 71 53 28 2 61 person 130 138| 90 0 5| 35 43 7| 69 65 27 31 5 48 ------------------------+-------------+--------------+------------------------ SLOT SCORES | | | template | | | doc-nr 0 0| 0 0 0| 0 0 100| 0 0 0 0 0 0 content 195 197| 131 0 6| 58 60 51| 67 66 30 30 4 49 comment 0 0| 0 0 0| 0 0 15| 0 0 0 0 0 0 succession_e | | | success_org 193 128| 69 0 26| 98 33 15| 36 54 51 26 27 69 post 195 170| 65 0 51| 79 54 35| 33 38 41 32 44 74 in_and_out 251 191| 55 0 25| 171 111 26| 22 29 68 58 31 85 vac_reason 195 197| 63 0 74| 58 60 36| 32 32 30 30 54 75 comment 0 0| 0 0 0| 0 0 273| 0 0 0 0 0 0 in_and_out | | | io_person 254 304| 126 0 39| 89 139 22| 50 41 35 46 24 68 new_status 256 304| 138 0 32| 86 134 20| 54 45 34 44 19 65 on_the_job 256 310| 107 0 63| 86 140 73| 42 35 34 45 37 73 other_org 168 5| 3 0 2| 163 0 47| 2 60 97 0 40 98 rel_oth_org 172 5| 3 0 2| 167 0 34| 2 60 97 0 40 98 comment 0 0| 0 0 0| 0 0 399| 0 0 0 0 0 0 organization | | | name 108 69| 28 0 21| 59 20 10| 26 41 55 29 43 78 alias 65 42| 12 0 4| 49 26 16| 18 29 75 62 25 87 descriptor 64 2| 0 0 2| 62 0 50| 0 0 97 0 100 100 type 110 69| 50 0 1| 59 18 12| 45 72 54 26 2 61 locale 41 7| 4 0 3| 34 0 8| 10 57 83 0 43 90 country 41 7| 6 0 1| 34 0 5| 15 86 83 0 14 85 comment 0 0| 0 0 0| 0 0 15| 0 0 0 0 0 0 person | | | name 130 138| 82 0 13| 35 43 7| 63 59 27 31 14 53 alias 83 79| 56 0 3| 24 20 5| 67 71 29 25 5 46 title 79 78| 60 0 0| 19 18 5| 76 77 24 23 0 38 comment 0 0| 0 0 0| 0 0 1| 0 0 0 0 0 0 ------------------------+-------------+--------------+------------------------ ALL SLOTS 2856 2307|1058 0 368|1430 881 1280| 37 46 50 38 26 72 P&R 2P&R P&2R F-MEASURES 40.98 43.78 38.53
Here is a page from a score report for the named entity task:
------------------------------------------------------------------------------ POS ACT| COR PAR INC | MIS SPU NON| REC PRE UND OVG SUB ERR ------------------------+-------------+--------------+------------------------ SUBTASK SCORES | | | enamex | | | organizatio 443 444| 405 0 18| 20 21 18| 91 91 5 5 4 13 person 373 371| 364 0 2| 7 5 0| 98 98 2 1 1 4 location 110 122| 109 0 0| 1 13 3| 99 89 1 11 0 11 other 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 timex | | | date 111 112| 107 0 0| 4 5 6| 96 96 4 4 0 8 time 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 other 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 numex | | | money 76 76| 73 0 0| 3 3 0| 96 96 4 4 0 8 percent 17 25| 17 0 0| 0 8 0| 100 68 0 32 0 32 other 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 ------------------------+-------------+--------------+------------------------ SECT SCORES | | | Header 244 256| 233 0 9| 2 14 8| 95 91 1 5 4 10 Body 2016 2044|1906 0 42| 68 96 95| 95 93 3 5 2 10 ------------------------+-------------+--------------+------------------------ OBJ SCORES | | | enamex 926 937| 898 0 0| 28 39 21| 97 96 3 4 0 7 timex 111 112| 107 0 0| 4 5 6| 96 96 4 4 0 8 numex 93 101| 90 0 0| 3 11 0| 97 89 3 11 0 13 ------------------------+-------------+--------------+------------------------ SLOT SCORES | | | enamex | | | type 926 937| 878 0 20| 28 39 21| 95 94 3 4 2 9 text 926 937| 876 0 22| 28 39 21| 95 93 3 4 2 9 status 0 0| 0 0 0| 0 0 38| 0 0 0 0 0 0 alt 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 timex | | | type 111 112| 107 0 0| 4 5 6| 96 96 4 4 0 8 text 111 112| 98 0 9| 4 5 11| 88 88 4 4 8 16 status 0 0| 0 0 0| 0 0 6| 0 0 0 0 0 0 alt 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 numex | | | type 93 101| 90 0 0| 3 11 0| 97 89 3 11 0 13 text 93 101| 90 0 0| 3 11 0| 97 89 3 11 0 13 status 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 alt 0 0| 0 0 0| 0 0 0| 0 0 0 0 0 0 ------------------------+-------------+--------------+------------------------ ALL SLOTS 2260 2300|2139 0 51| 70 110 103| 95 93 3 5 2 10 P&R 2P&R P&2R F-MEASURES 93.82 93.32 94.31The report has several parts:
Here is a coreference task score report:
Key Rsp Document No. Cls Cls Recall Precision f 930620083 23 20 25 / 43 58.1 25 / 40 62.5 60.2 930620057 4 5 11 / 15 73.3 11 / 14 78.6 75.9 930560132 11 6 14 / 20 70.0 14 / 16 87.5 77.8 930380019 18 10 50 / 69 72.5 50 / 59 84.7 78.1 930350079 3 2 8 / 11 72.7 8 / 10 80.0 76.2 930220297 34 39 101 / 157 64.3 101 / 133 75.9 69.7 930220050 2 5 1 / 3 33.3 1 / 6 16.7 22.2 930090013 8 2 7 / 17 41.2 7 / 8 87.5 56.0 930050011 10 13 26 / 35 74.3 26 / 31 83.9 78.8 931290244 12 8 11 / 21 52.4 11 / 16 68.8 59.5 931250227 35 26 37 / 85 43.5 37 / 59 62.7 51.4 931110023 5 9 25 / 36 69.4 25 / 33 75.8 72.5 931090230 4 3 8 / 12 66.7 8 / 10 80.0 72.7 931020207 10 10 25 / 41 61.0 25 / 32 78.1 68.5 930900283 16 16 52 / 75 69.3 52 / 66 78.8 73.8 930860108 7 6 20 / 31 64.5 20 / 23 87.0 74.1 930710271 3 4 5 / 8 62.5 5 / 6 83.3 71.4 940460255 3 7 11 / 17 64.7 11 / 14 78.6 71.0 940430215 4 3 4 / 9 44.4 4 / 5 80.0 57.1 940430078 5 6 15 / 19 78.9 15 / 19 78.9 78.9 940410075 13 14 22 / 34 64.7 22 / 32 68.8 66.7 940370255 41 43 72 / 133 54.1 72 / 115 62.6 58.1 940280231 44 50 65 / 126 51.6 65 / 103 63.1 56.8 940270193 22 23 46 / 75 61.3 46 / 69 66.7 63.9 940260231 14 25 102 / 124 82.3 102 / 121 84.3 83.3 940190235 22 21 61 / 78 78.2 61 / 81 75.3 76.7 940120142 31 35 116 / 165 70.3 116 / 146 79.5 74.6 940090210 18 22 33 / 59 55.9 33 / 59 55.9 55.9 940080212 4 3 5 / 7 71.4 5 / 7 71.4 71.4 940050261 9 5 12 / 21 57.1 12 / 12 100.0 72.7 TOTALS: 435 441 990 / 1546 64.0% 990 / 1345 73.6% 68.5%There is one line for each document in the corpus. From left to right, the fields of each line are:
The scoring software has three configuration files, that you use to specify how the keys and responses are compared. The reason there are three files is partly historical and partly because parsing some of the configuration options differs a little. In future versions the three files will probably coalesce into one file.
You must specify the name of the main configuration file on the command line when you invoke the scorer. The configuration file tells the scorer how to compare the keys and responses. It consists of a list of options. Each option is specified by a colon (":") as the first character of a line, followed immediately (no spaces) by the name of the option. After some more spaces come the value or values of the option. Values are separated by spaces. Values which themselves contain spaces must be enclosed in single or double quotes. The current options are:
:class_defs "enamex enamex scored 0" "numex numex scored 0" "timex timex scored 0"The class def strings should be in the order that you want the classes of objects aligned. For the named entity, template element and coreference tasks, this order is unimportant. But for the template relation and scenario template tasks, the pointer fills are judged correct or incorrect based on whether or not the objects they point to are aligned. So the aligning should always start with objects that contain no pointer fills, and proceed to objects whose only pointer fills reference objects without pointer fills, etc. (See the section on how the TR and ST tasks are scored, below.)
:slot_defs "enamex text text scored 4 string" "enamex type type scored 4 set" "enamex status status unscored 4 set" "enamex alt alt unscored 4 string" "timex text text scored 4 string" "timex type type scored 4 set" "timex status status unscored 4 set" "timex alt alt unscored 4 string" "numex text text scored 4 string" "numex type type scored 4 set " "numex status status unscored 4 set" "numex alt alt unscored 4 string"
The methods for scoring the Template Element, Template Relation, Scenario Template, and Named Entity tasks are very similar. From the standpoint of calculating scores, The template element (TE) task is the basic task of these four. This section will explain how TE is scored, and subsequent sections will tell how the NE, TR, and ST tasks can be seen as extensions to TE scoring.
Simply put, the final score for the four tasks is found by aligning the key objects with the response objects and then comparing the objects' single fills. Structures are aligned at each level of the object/slot/multi-fill/single-fill structure hierarchy. However, it is the single-fill alignments that we count to get the score.
The result of aligning one key single fill to one response single fill (or of leaving one key or response single fill unaligned) is called a tally. There are six kinds of tallies:
Given a set of tallies, there are several values calculated in the alignment and final scoring.
POS = COR + INC + PAR + MIS
ACT = COR + INC + PAR + SPU
COR + (0.5 * PAR) REC = ----------------- POS
COR + (0.5 * PAR) PRE = ----------------- ACT
Intuitively, information extraction systems often sacrifice precision for recall, or vice versa. If a system is tuned to "catch everything" (good recall), it often catches more than it should (bad precision). And if it tries to be conservative (good precision), it tends to miss some information (bad recall). When evaluating responses, then, one has to be careful about comparing one response from a system tuned for high recall to another response from a system tuned for high precision. van Rijsbergen's F-measure is used to combine recall and precision measures into one measure. The formula for F is
2 ((beta) + 1.0) * P * R F = ------------------------ 2 ((beta) * P) + Rwhere beta is the relative weight of precision and recall.
The following measures are also calculated from the tallies, and are in the score report:
MIS UND = --- POS
SPU OVG = --- ACT
INC + (0.5 * PAR) SUB = ------------------- COR + INC + PAR
INC + (0.5 * PAR) + SPU + MIS ERR = ------------------------------- COR + INC + PAR + SPU + MIS
When aligning two multi-fills, the scoring software pairs all single-fills of the multi-fills. For example, if the key multi-fill has three single-fills, and the response multi-fill has two multi-fills, then the scorer creates six pairs of single-fills. Each single-fill pair has an F-score associated with it. The scorer sorts these single-fill pairs by F-score in decreasing order. It then proceeds down the sorted list, picking out pairs of single-fills for which neither single-fill has been chosen yet, and adding them to the final alignment for that pair of multi-fills. Any key or response single fills left over (in our example, there would be a key single fill left) is tallied as missing or spurious.
A key slot is aligned with a response slot when the two slots have the same name. The lone multi-fill in the response slot is aligned with the multi-fill in the key slot that results in the best multi-fill-to-multi-fill F-score. Any leftover multi-fills in the key slot are unscored, and are tallied as "noncommittal".
Key objects are aligned with response objects of the same object "type" or "class" To choose which objects are paired, the scorer first generates all possible pairs of objects in the class. The F-score for each pair of objects is calculated from the way the objects' single-fills align. The weighted F-score is also calculated, by multiplying each slot-pair's F-score by the mapping weight of that slot, and summing the factors. The object pairs are sorted by (unweighted) F-score in decreasing order. Then the scorer proceeds down the sorted list, picking out pairs of objects for which neither single-fill has been chosen yet, and for which the weighted F-score exceeds the threshold for that type of object.
If any objects are left over after this, the scorer looks for any key objects which are marked "optional". The single fills of these objects are tallied as non-committal. If any key objects are left after this, their single-fills are tallied as missing. The single fills of any leftover response objects are tallied as spurious.
When all classes of objects have been aligned, the tallies are summed, and the resulting measures are calculated.
For the TR and ST tasks, the scoring proceeds just as in TE scoring,
but the order of alignment of objects is important. It is helpful to
look at the classes of objects in a TR or ST task as vertices of a
topological graph. If one type of object has a slot containing
pointers to another type of object, then the graph has a directed edge
from the first class to the pointed-to class:
Since pointer correctness is defined in this way, the directed graph cannot have any directed cycles in it. Further, the scorer has to align the objects so that any pointed-to objects must already have been aligned. So in the above figure, the order of mapping could be D-B-C-A or D-C-B-A. Any other order would confuse the scorer.
The only other difference between the TR and ST task and the TE task is the existence of implicitly optional objects in the key. In TR, a "relation" object that points to an optional "template element" object is optional, whether it's marked optional or not. And in ST, an object is implicitly optional if the only pointers pointing to that object are in optional slots or in one one multi-fill of a slot, but not in another multi-fill of the same slot (ie, there is an alternative multi-fill in the slot that doesn't point to the object).
The Named Entity task is scored like the Template Element task, except that the objects which are aligned must come from SGML elements in the same position of the original text file. For instance, if in the key the name "Bill Clinton" is tagged in the first paragraph of an article, and in the response "Bill Clinton" is tagged in the tenth paragraph, the objects will not be aligned, even if they would give an F-score of 100%.
The scoring of the Coreference task is very different from that of the other four tasks. Rather than counting single fills, the CO algorithm compares equivalence classes of objects in the key with equivalence classes of objects in the the response. For a detailed explanation, see A Model-Theoretic Coreference Scoring Scheme, by Mark Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman in the MUC-6 Proceedings.