ZPRISE Poster Session

Lessons Learned in an Informal Usability Study
(Poster presented at SIGIR - July 1997)

Dawn (Hoffman) Tice, dawn.tice@nist.gov
Laura L. Downey, ldowney@vignette.com

(Link Legend:

Unvisited Link

Visited Link)

NOTE: This poster session is an example of some of the research we are doing at NIST to evaluate interactive information retrieval systems.

Abstract
Introduction
Testing Methodology
Usability Test 1

Usability Test 2
Analysis
Lessons Learned
References

Abstract

This poster examines the challenges involved in conducting an informal usability study based on the introduction of a new information retrieval system to experienced users. We present a summary of activities performed during two iterations of usability testing and describe our analysis methodology. Results of the study include lessons learned about both the users and the testing techniques.

Introduction

The purpose of the study was three-fold: to gain experience in conducting usability testing on information retrieval systems, to specifically examine the usability of the new ZPRISE interface, and to identify problems our users were having with the assigned task (topic development for TREC).

Testing Methodology

The NIST users (assessors) are retired information analysts from the National Security Agency (NSA). Most of the users have been performing the topic development and relevance assessment tasks for NIST for four years within the TREC [Harman 1996] project parameters.
For the usability study, we chose the TREC-5 topic development task. This allowed the actual TREC-5 topic development activity and the usability study to be conducted in parallel. Each user was instructed to compose topics on any subject of interest to them prior to the usability test. They were required to provide the following information per topic: a short title, a short description of the topic, and a narrative that explained what would constitute a relevant document match.
Once the usability test began, users searched a pre-selected database for their topics. During the search they marked documents relevant to the topic and also recorded the number of relevant documents found per topic. The users performed the searches using the NIST ZPRISE system which was installed on networked SUN Workstations.
Prior to the actual usability test, users answered questions on their topic development activities. This data was not part of the usability test but was gathered to support ongoing investigations into user search behavior.
Based on traditional usability practices, we chose a three-step process: a tutorial, observations and verbal feedback, and a satisfaction survey. [Nielsen 1993].

Usability Test 1

We conducted the first test with the following parameters:

11 users, all familiar with the topic development task
2 users per session, two 4-hour sessions per day
We began the test by giving a 1-hour tutorial to the users explaining the features of the new interface. One trainer demonstrated the new interface while the second trainer recorded the users' comments during the tutorial.
During the second part of the test, we observed the users for 50 minutes while they navigated the system performing their topic development task. We recorded the critical incidents and user comments. When users had trouble, we encouraged them to problem-solve on their own or to consult system help or the written instructions. Users were given an additional 30 minutes to finish their topic development without observers in the room.
For the final portion of the usability test, we administered a user satisfaction survey (30-45 minutes).

Usabiity Test 2

Based on input from the first usability test, the interface was modified. A second usability test was conducted under the same conditions as Test 1, with two of the original users and two new users.

Analysis

After conducting each iteration of the usability tests, we performed an analysis of the results using several grouping and prioritizing methods. We identified critical incidents, in-scope and out-of-scope factors, and prepared estimates for code changes resulting in a final decision model.
The challenge in analyzing all the collected data was to organize it in order to identify the major system and interface issues. We gathered the following data per user:

tutorial observations
task performance observations including critical incident recordings
user satisfaction survey data
We first combined the tutorial observations and the task performance observations by user. To identify problem trends among users and common problems in specific parts of the interface, we also grouped the combined observations by interface window, then by user.
For the next step, the observation list was shortened by consolidating like problems and separating problems attributed to training issues. We also identified out-of-scope observations such as problems related to the underlying windowing system and not to the interface itself.
At this point we incorporated the data from the satisfaction survey that was relevant to the identified usability issues in order to combine all the observed and perceived problems. We also created two other lists: a set of positive comments about the system and a list of users' suggestions for future enhancements.
As the final step in organizing the usability problem matrix, we categorized the items into several sub-groups such as problems relating to messages in the interface or information organization. We then assigned high, medium and low priorities to the problems. With the development team, we proposed usability solutions and discussed the cost/benefit of each, resulting in a set of action items and estimates for changes to the interface.
Changes were made to the interface and the second usability test was conducted. In order to be able to compare and contrast the results from Test 1 and Test 2, we used the same basic analysis technique for Test 2.
During comparative analysis, we were concerned with two major questions. Had we minimized/eliminated the problems identified in Test 1? And, did any new or unreported problems occur in Test 2, especially those that may have been introduced due to the changes?
We examined the categories of usability issues resulting from both tests rather than comparing the actual raw numbers to account for the differences in users and the number of users in each test.
Usability issues were identified in 19 categories in Test 1 and 14 categories in Test 2. When we compared the categories we found 11 in common. This translates to elimination of 8 groups of usability issues between Test 1 and Test 2 and the identification of three new groups in Test 2.
We then analyzed the 11 common groups and the three new groups. We found that they could roughly be classified into two major divisions - navigation issues and conceptual issues. Navigation issues included widget co-location, size, placement, and existence. Conceptual issues primarily revolved around the definition and function of relevance feedback including the use and utility of enhanced query terms.
It should be noted that mixed objectives made it difficult to collect and organize the data. During Test 1 we struggled with several classification schemes before deciding on a useful strategy. In the end, the final data organization and analysis was made easier through repeated examination of the data from several different perspectives during Test 1.

Lessons Learned

Our first goal in conducting the informal usability study was to gain experience in usability testing on information retrieval systems. First and foremost, we learned that performing several activities in tandem can lead to confusion between tasks and more difficult analysis of results. The users were performing actual TREC topic development and in turn we were testing the new interface for general usability while also testing this general use interface on a specific task. Often the lines became blurred.
The second goal of identifying and correcting the problems related to our general use ZPRISE interface was relatively straightforward. During analysis, we identified navigation and conceptual difficulties which were corrected and retested. The poster session will explore these in more detail.
The third goal of identifying problems our assessors were having with the TREC task became the most complex (and interesting) of the three goals. This section will mainly concentrate on lessons learned in that area.
First, as in most usability studies [Koeneman 1994], we identified the typical user issues:

Users brought their biases and experiences with them.
They compared the new interface to the old interface, and these comparisons can reveal a great deal about an interface's usability. We expect comparisons to continue and view it as a dynamic feedback mechanism for ourselves and the system developers.
Users responded favorably to hands-on training during the actual task, and this is the best way for users to grasp and retain knowledge of the interface features.
Improvements in one area may cause unanticipated adverse effects. For example, additional features placed a noticeable burden on the performance of our computer systems, and response time declined, along with a decline in user satisfaction.
We also learned a few TREC-specific and task-specific lessons during the usability study:

Users confused the topic development task and the relevance assessment task because relevance assessment is part of the topic development task, and some of the screens and features of the new interface resembled the old relevance assessment interface.
Users required a detailed explanation of the objectives of the topic development task before task performance begins. In particular, users became anxious when they did not find what they considered to be a good number of relevant documents. Consequently, they may change their definition in midstream of what constitutes a relevant document. This can lead to inconsistencies during the topic development task as well as during the relevance assessment task.
The second round of usability testing revealed additional relevance assessment system requirements because the users' most recent task had been relevance assessment.
These observations have led to three recommendations specific to the TREC task:

Design and development of two specialized task-specific interfaces (i.e., topic development and relevance assessment) for the TREC assessors.
A need for further work in understanding the consistency issues in relevance judgments.
A revised training program that allows hands-on training for the topic development and relevance assessment interfaces. Most importantly, the revised training program must explain the TREC task in more detail. Once users have a better understanding of the objectives of their two tasks, we believe that this will lead to increased accuracy in relevance assessments.
In conclusion, we learned that even an informal usability study produces a substantial amount of useful data. On the basis of this data, we were able to make significant improvements to our general purpose ZPRISE interface. Additionally we were able to identify problems the TREC assessors were having with the TREC task and gain insights into user preferences in interface design and usability issues.

References

Harman D. (Ed.). (1996). The Fourth Text REtrieval Conference (TREC-4). National Institute of Standards and Technology Special Publication 500-236, Gaithersburg, Md. 20899.
Jurgen Koenemann, Richard Quatrain, Colleen Cool, Nicholas J. Belkin. New Tools and Old Habits: The Interactive Searching Behavior of Expert Online Searchers using INQUERY. In D. K. Harman, editor, Overview of the Third Text Retrieval Conference (TREC-3), pages 145-177. NIST Special Publication 500-225, April 1995.
Jakob Nielsen. Usability Engineering. Academic Press, Inc., San Diego, CA, 1993.

For more on TREC, see http://trec.nist.gov/

Last updated: Monday, 31-Jul-2000 13:16:47 MDT
Date created:
Monday, 31-Jul-00