TREC-9 Interactive Track Guidelines

Goal 
---- 
The high-level goal of the Interactive Track in TREC-9 remains the
investigation of searching as an interactive task by examining the
process as well as the outcome. To this end a experimental framework
has been designed with the following common features:

    - an interactive search task: question answering
    - 8 questions
    - a minimum of 16 searchers
    - a newspaper/wire document collection to be searched (= Q&A track's)
    - a required set of searcher questionnaires
    - 5 classes of data to be collected at each site and submitted to NIST
    - 1 summary measure to be calculated by NIST for use by participants

The framework will allow groups to estimate the effect of their
experimental manipulation free and clear of the main (additive)
effects of participant and topic and it will reduce the effect of
interactions.

In TREC-9 the emphasis will be on each group's exploration of
different approaches to supporting the common searcher task and
understanding the reasons for the results they get. No formal
coordination of hypotheses or comparison of systems across sites is
planned for TREC-9, but groups are encouraged to seek out and exploit
synergies. As a first step, groups are strongly encouraged to make the
focus of their planned investigations known to other track
participants as soon as possible, preferably via the track listserv at
[email protected]. Contact track coordinator Bill Hersh to join.


What's new?
----------
The Interactive Track will experiment for TREC-9 with 2 different
question types from previous years, a shorter time for each question,
and a shorter overall session time for each searcher.


Questions
---------
The track looked at 4 sorts of questions:

1.  Find any n Xs 
    e.g., Name 3 US Senators on committees regulating the nuclear 
    industry.

2.  Find the largest/latest/... n Xs
    e.g., What is the largest expenditure on a defense item by 
    South Korea?

3.  Find the first or last X
    e.g., Who was the last Republican to pull out of the nomination 
    race to be the candidate of his/her party for US president in 
    1992?

4.  Comparison of 2 specific Xs
    e.g., Do more people graduate with an MBA from Harvard Business 
    School or MIT Sloan?

After some pretesting we ended up with 8 questions, half of type 1
and half of type 4. Here are the questions. (NOTE that this is not the
order in which they will be presented to searchers.)

1.  What are the names of three US national parks where one can find 
    redwoods?

2.  Identify a site with Roman ruins in present day France?

3.  Name four films in which Orson Welles appeared.

4.  Name three countries that imported Cuban sugar during the period 
    of time covered by the document collection.

5.  Which children's TV program was on the air longer: the original
    Mickey Mouse Club or the original Howdy Doody Show?

6.  Which painting did Edvard Munch complete first: "Vampire" or 
    "Puberty"?

7.  Which was the last dynasty of China: Qing or Ming?  

8.  Is Denmark larger or smaller in population than Norway ?  


Document Collection
-------------------
We'll use the TREC-9 Q&A track data (all newspaper/wire data on the
TREC disks 2.5GB) which includes:

- AP from disks 1-3 
- Wall Street Journal from disks 1-2 
- San Jose Mercury News from disk 3 
- Financial Times from disk 4 
- Los Angeles Times from disk 5 
- FBIS from disk 5

(NOTE: FBIS is included)


Searcher task
-------------
The searcher's task will be to answer each question and identify a
(minimal please) set of documents which supports the answer - within a
maximum of 5 minutes. Each answer may have multiple parts. The
searcher will be asked for the answer and how certain they are about
it both before and after searching. Sites should not submit more
parts to an answer than were requested since additional ones will
be ignored.


Instructions to be given to searchers
-------------------------------------
The goal of this experiment is to determine how well an information
retrieval system can help you to answer questions you might ask when
searching newswire or newspaper data.  The questions are of one of two
types:

- Find a given number of different answers

  For example: Name 3 hydroelectric projects proposed or under
               construction in the People's Republic of China.

- Choose between two given answers

  For example: Which institution granted more MBAs in 1989 -
               the Harvard Business School or MIT-Sloan?

You will be asked to search on four questions with one system and
four questions with another.  You will have five minutes to
search on each question, so plan your search wisely.  You will
be asked to answer the question and provide a measure of your
certainty of your answer both before and after searching.

You will also be asked to complete several additional questionnaires:

- Before the experiment - computer/searchng experience and attitudes
- After each question
- After each four questions with the same system
- After the experiment - system comparison and experiment feedback


Searcher questionnaires 
-----------------------
Here is the minimal set.


Data to be collected and submitted to NIST (emailed to [email protected])
------------------------------------------
Several sorts of result data will be collected for evaluation/analysis 
(for all questions unless otherwise specified):


   ===>  Due at NIST by 31. August 2000:

        1. sparse-format data   


   ===>  Due at NIST by when the site reports for the conference notebook
         are due:

        2. rich-format data

        3. a full narrative description of one interactive session for
           a question to be determinedn by each site

        4. any further guidance or refinement of the task specification
           given to the searchers

        5. data from the common searcher questionnaires

Sparse format data for each question will comprise the answer (with
possibly multiple parts) for each question as well as the TREC DOCNO
for each document cited in support of the answer. Sparse-format data
will be the basis for an assessment of summary search effectiveness at
NIST: basically, whether the question was answered or not.

Rich format data for each question will record:

- the searcher's answer to the question before searching begins
  in the case the searcher believes s/he already knows the answer.

- significant events in the course of the interaction and their 
  timing.  

          Rich format data are intended for analytical evaluation by the 
          experimenters.
 
          All significant events and their timing in the course of the 
          interaction should be recorded.  The events listed below are 
	  those that seem to be fairly generally applicable to different 
          systems and interactive environments; however, the list may 
          need extending or modifying for specific systems and so should
          be taken as a suggestion rather than a requirement:

          o Intermediate search formulations:  if appropriate to the 
            system, these should be recorded.

          o Documents viewed:  "viewing" is taken to mean the searcher 
            seeing a title or some other brief information about a 
            document; these events should be recorded.

          o Documents seen:  "seeing" is taken to mean the searcher 
            seeing the text of a document, or a substantial section of 
            text; these events should be recorded. 

          o Terms entered by the searcher:  if appropriate to the 
            system, these should be recorded.

          o Terms seen (offered by the system):  if appropriate to the 
            system, these should be recorded.

          o Selection/rejection:  documents or terms selected by the 
            user for any further stage of the search (in addition to the 
            final selection of documents). 


Format of sparse data to be submitted to NIST
---------------------------------------------

One ascii file from each site. One line for each question a searcher
works on even if no answer is found. Each line containing the following
items with intervening spaces and semicolons as indicated. Since
semicolons will be used to parse the lines, they can only occur as
indicated below:

  SiteID; SystemID; SearcherID; QuestionNum; ANSWERLIST; DOCNOLIST

Where:

  SiteID - unique across sites

  SystemID - unique within site to each of your IR systems

  SearcherID - unique within site to each of your searchers

  QuestionNum - a digit, the question number in the quidelines

  ANSWERLIST - a list of answer parts separated by commas 
               Answer parts may contain spaces. The number of parts
	       will vary with the question. If no answer is found then 
	       there will be just a space followed by a semicolon.

  DOCNOLIST - a list of TREC DOCNOs as found in the documents, 
              separated by commas.

Sites determine SiteID, SystemID, and SearcherID. They are not allowed
to contain spaces.


Evaluation of data submitted to NIST
------------------------------------ 
The assessment procedure will check each question to see whether or
not it is fully answered and whether the answer (each of its parts) is
supported by the document(s) cited. Fully answered and supported
questions will be assigned a 1; otherwise a 0 will be assigned to the
question, i.e., no credit will be given for partially correct and/or
partially supported answers.


Experimental design in general 
------------------------------ 
The design will be a within-subject design like that used for TREC-8
but with different numbers of questions and searchers.

Each user will search on all the questions. Questions will be presented
in the pseudo-random fashion like last year, with 16 variations to
insure each is searched at a different position (1st through 8th) by
each system. This means that one complete round of the experiment will
require 16 subjects. Contact Bill Hersh for allocation of the
sequences.

The searching part of the experiment will also take about one hour.
Each question will take 7 minutes: 1 minute before, to find out if
they know the answer; 5 minutes to find the answer(s) by searching;
and 1 minute after, to answer questions about that specific search.

An example non-searching part of the experimental session would be 
as follows:

Introductory stuff              10 minutes
Tutorials (2 systems)           30 minutes total
Post system questions           10 minutes total (5 for each system)
Exit questions                  10 minutes
(Total non-searching             1 hour)


Experimental design for a site
------------------------------
  1. Example minimal experimental matrix as run:

     Reminder: Don't actually run this one. Contact Bill Hersh
     ([email protected]) to request your own matrix.

               Subject        Block #1             Block #2

                   1      System 2: 4-7-5-8    System 1: 1-3-2-6
                   2      System 1: 3-5-7-1    System 2: 8-4-6-2
                   3      System 1: 1-3-4-6    System 2: 2-8-7-5
                   4      System 1: 5-2-6-3    System 2: 4-7-1-8
                   5      System 2: 7-6-2-4    System 1: 3-5-8-1
                   6      System 2: 8-4-3-2    System 1: 6-1-5-7
                   7      System 1: 6-1-8-7    System 2: 5-2-4-3
                   8      System 2: 2-8-1-5    System 1: 7-6-3-4
                   9      System 1: 4-7-5-8    System 2: 1-3-2-6
                  10      System 2: 3-5-7-1    System 1: 8-4-6-2
                  11      System 2: 1-3-4-6    System 1: 2-8-7-5
                  12      System 2: 5-2-6-3    System 1: 4-7-1-8
                  13      System 1: 7-6-2-4    System 2: 3-5-8-1
                  14      System 1: 8-4-3-2    System 2: 6-1-5-7
                  15      System 2: 6-1-8-7    System 1: 5-2-4-3
                  16      System 1: 2-8-1-5    System 2: 7-6-3-4


  2. Augmentation

     The design for a given site can be augmented in two ways:

       1. Participants can be added in groups of ? using the design
          above.  Additional blocks should be requested from Bill
          Hersh.

       2. Systems can be added by adding additional groups of ? users
          with each new system.  Additional blocks should be requested
          Bill Hersh.

     Questions cannot be added/subtracted individually for each site. 

     All augmentations other than the two listed above, however interesting, 
     are outside the scope of this design. If sites plan such adjunct 
     experiments, they are encouraged to design them for maximal synergy 
     with the track design.

 3. Analysis

     Up to each group, but all are strongly encouraged to take advantage
     of the experimental design and undertake:

        1. exploratory data analysis

           to examine the patterns of correlation, interaction, etc.
           involving the major factors. Some example plots for the TREC-6
           interactive data (recall or precision by searcher or topic)
           are available on the Interactive Track web site under 
           "Interactive Track History".
           
        2. analysis of variance (ANOVA), where appropriate,

           to estimate the separate contributions of searcher, topic and 
           system as a first step in understanding why the results of one 
           search are different from those of another.


Deadlines
---------
All experiments must be done and sparse-format data sent to NIST by
31. August 2000.  Rich-format data must be sent to NIST by the time the
conference notebook papers are due.
National Institute of Standards and Technology Home
Last updated:

Date created: Monday, 31-Jul-00
For information about this webpage contact
Paul Over ([email protected]) or Bill Hersh.