TREC Medical Records Track 2011 Guidelines

Guidelines for the 2011 TREC Medical Records Track

The goal of the Medical Records track is to foster research on providing content-based access to the free-text fields of electronic medical records. In this initial year, the track will focus on a task that models the real-world task of finding a population over which comparative effectiveness studies can be done.

The task

The test document collection for the Medical Records track is a set of de-identified medical records made available for research use through the University of Pittsburgh BLULab NLP Repository . Participants must obtain the data set directly from the University of Pittsburgh after first getting a "Letter of Participation" from NIST. Look here for details about how to obtain the collection.

Visits: Each report in the dataset has a report ID, called the "reports_checksum". Most reports are associated with one or more "visits" identified by a "visitreports_visitid". (A small percentage of the reports have no associated visit because the data linking the record to a visit has been lost.) As of 20. July 2011 this is the only report-to-visit mapping table to be used for TREC 2011. Each line contains a report checksum, a report type code, and the visit ID. Reports not mapped to a visit are not included. Please replace any other version with this one.

Dropped reports: Note please that these 845 report files have been dropped from the final test data for 2011 as of 20. July 2011. They cannot be used as evidence that a visit meets the need expressed in a topic and should be ignored.

The Medical Records track will use the *visit* as the response unit. That is, your retrieval system must return visitreports_visitid's, and relevance judgments will be based on the visit as a whole. Note that the use of visits as the retrieval unit means those reports that are not associated with any visit are effectively removed from the collection.

The retrieval task for the track is an ad hoc search task as might be used to identify cohorts for comparative effectiveness research. Topics will specify a particular disease/condition set and a particular treatment/intervention set and your system should return a list of visits ranked by decreasing likelihood that the visit satisfies the specification. For example, a topic might be "find patients with gastroesophageal reflux disease who had an upper endoscopy".

Topics will be developed by physicians who are also students at Oregon Health and Sciences University. Similar students will also do the relevance judging, and we will call both groups "assessors". Assessors will devise topics using a list of priority areas for comparative effectiveness research issued by the US Institute of Medicine of the National Academies as inspiration. /

Topic development assessors have been instructed that we desire topics that exploit information from the text fields--- in other words, that are not answerable solely by the diagnostic codes contained in the records. However, this does not rule out the possibility that the diagnostic codes might contribute to the fact that an item is a match. We will not intentionally try to create topics for which the diagnostic codes are "gotchas", but as always in TREC, relevance will be in the eyes of the assessor.

NIST and OHSU have produced four sample topics with a few corresponding relevance judgments. *The primary purpose of these example topics is to illustrate the syntactic format of the test topics.* They will also be suggestive of the type of language use that might be expected in the test topics. They are explicitly not guaranteed to be representative of anything else.

The test set of 35 topics will be posted to the Tracks web page on June 15. Results of running your system on those topics (a "run") will be due August 16 for runs to be included in the pooling and manual judging. Runs may be submitted after August 16 and before by September 15. These runs will not be included in the pooling and manual judging. They will however be evaluated using the results of manual judging. They will be marked as non-pooled in the official results.

Your runs may be created completely automatically or with some level of manual intervention. Automatic methods are those in which there is no human intervention at any stage---the system takes the topic statement as input and produces a ranked list of visit ids as output with no human in the loop. Manual methods are everything else. This manual methods category encompasses a wide variety of different approaches. There are intentionally few restrictions on what is permitted to accommodate as many experiments as possible. In general, the ranking submitted for a topic is expected to reflect a ranking that your system could actually produce --- the result of a single query in your system (granting that that query might be quite complex and the end result of many iterations of query refinement) or the automatic fusion of different queries' results. However, it is permissible to submit a ranking produced in some other way, provided the ranking supports some specific hypothesis that is being tested and the conference paper gives explicit details regarding how the ranking was constructed.

You may not change your system once you have looked at the test set of topics. This precludes any possibility of tweaking the system to benefit test topics. Working on your system after the test topics have been posted but before you fetch them is fine. TREC purposely allows a long time-window between topic release and run submission to accommodate as many participants' schedules as possible and to allow time for manual runs.

Submitting runs

Runs are submitted through an automatic run submission system hosted at NIST. This submission system will perform sanity-checking on the submission file and reject any runs that do not pass the checks. Runs that have been rejected are not counted as submitted runs. NIST will not accept emailed submissions; in particular, runs that are emailed because they do not pass the sanity checking in the submission system will simply be discarded. The script that is used to do the sanity checking will be made available to participants once the submission system is open. You are very strongly encouraged to check for errors yourself prior to submitting a run.

When you submit your run, you will be asked to specify the run's features on the submission form. These features will include at least whether the run is a manual or automatic submission; the judging priority of the run (see below); and a short textual description of the run. Other features may be added and will be announced on the mailing list.

In TREC tradition, a deadline of August 16 officially means runs must be submitted by 11:59pm EDT on August 16. In practice, it means runs must be submitted before NIST personnel disable the submission system on the morning of Aug 10; this generally means the effective submission deadline is about 8:00am EDT on August 17.

Format of a Submission

The Medical Records track will use the standard TREC submission format for ad hoc runs. A submission consists of a single file that contains retrieval results for all test topics. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       20 Q0 E6t97jn7a1sA	1 4238 prise1
       20 Q0 78CrJwsWXvYq	2 4223 prise1
       20 Q0 NoXWN9vdBXTO	3 4207 prise1
       20 Q0 yZZStV5RDJPP	4 4194 prise1
       20 Q0 TwI3ghHE0JEk	5 4289 prise1
          etc.
    where:
       * the first column is the topic id
    
       * the second column is the literal 'Q0'
    
       * the third column is an official visitreports_visitid 
    
       * the fourth column is the rank at which the visit is retrieved,
         and the fifth column shows the score (integer or floating
         point) that generated the ranking.  This score MUST be in
         descending (non-increasing) order and is important to include
         so that we can handle tied scores (for a given run) in a
         uniform fashion (the evaluation routines rank documents from
         these scores, not from your ranks).  If you want the precise
	 ranking you submit to be evaluated, the SCORES must reflect that
	 ranking.
    
       * the sixth column is called the "run tag" and should be a unique
         identifier for your group AND for the method used.  That is, each
         run must have a different tag and that tag should identify
	 the group and the method that produced the run.
         Run tags must contain 12 or fewer characters and may not
	 contain whitespace or a colon (:).

Each topic must have at least one visit retrieved for it and no more than 1000. Provided you have at least one visit, you may return fewer than 1000 visits for a topic, though note that the standard ad hoc retrieval evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 visits per topic.

Judging

Groups may submit up to four runs to the track. Judging will be done on pools created from a subset of the runs. The number of visits per topic per run that are added to the pool ("pool depth") will be determined after submissions are complete such that the final pool sizes are within the bounds that assessors can handle. We are targeting a pool size of roughly 500 visits per topic. Assessors will have access to all of the reports that constitute a visit at the time of judgment. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run(s) to assess arbitrarily. Judgments will be binary: the visit either satisfies the query's specification or it does not, in the opinion of the relevance assessor.

Scoring

NIST will score all submitted runs using the relevance judgments produced by the assessors. The primary measure for the track will be mean average precision, though all of the various trec_eval measures will be reported.

Timetable

Documents available: 			now
Sample topics available:		June 6, 2011
Test set of topics available:		June 15, 2011
Results due at NIST for pooling:	Aug 16, 2011
Non-pooled results due at NIST:         Sep 14, 2011
Qrels for topics available:		target of Oct 1, 2011
Conference notebook papers due:		late October, 2011
TREC 2011 conference:			November 15--18, 2011

National Institute of Standards and Technology Home Page

is an agency of the U.S. Commerce Department

The TREC Conference series is co-sponsored by the NIST Information Technology Laboratory's (ITL)
Retrieval Group of the
Information Access Division (IAD) and the
Intelligence Advanced Research Projects Activity (IARPA)

Contact us at: trec (at) nist.gov

Search the TREC site:

Last updated: Wednesday, 20-Jul-2011 11:24:03 MDT
Date created: Monday, May 23, 2011

For further information contact Ellen Voorhees