IRE Information Retrieval Experiment Ineffable concepts in information retrieval chapter Nicholas J. Belkin Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 50 Ineffable concepts in information retrieval it is difficult to make explicit predictions of behaviour or other empirically verifiable phenomena on their basis. And, for the same reasons, it is very difficult to determine reasonable operational definitions for these variables. In order to achieve these goals, it is usually necessary to go through a number of subsequent assumptions or hypotheses, each of which is a theoretical construct in its own right. When one finally gets to some phenomenon that is operationally definable or empirically observable, the relationship of that phenomenon to the original theoretical concept is probably very tenuous indeed. All of the intervening constructs and assumptions mean that it is unclear just what is being tested in the final experiment or investigation. Concepts from both the user and text related groups share this problem, and so, therefore do those from the group of concepts arising from their relationships. For example, consider the problem of operationalizing information need. Belkin and Oddy9 have suggested that an `anomalous state of knowledge' (ASK) is the basis of any information need, and that information retrieval systems should attempt to use representations of ASKs as the basis for retrieval. An ASK is considered by them as a part of an individual's state of knowledge which that person considers to be inadequate (anomalous) in some way. The first problem that arises in trying to make this concept operational is to decide upon a general schema for representation. On the basis of psychological arguments, the investigators29 chose structures consisting of concepts and relations among the concepts. Next one needs to decide upon means for obtaining the data from which the representation will be constructed. They decided to use `problem statements'; that is, statements by users about the problem which brought them to an information retrieval system. This decision was supported by Wersig's7 argument concerning the problematic situation, but the method for eliciting these statements had to be designed from first principles. Finally, a technique for analysing the data and generating the structure is needed. On the basis of some quite speculative argument concerning underlying `cognitive' structures and their reflection in linguistic structures, and in order to make the problem relatively simple, the general structure chosen was one of associative relations among concepts, these concepts to be represented by word stems and strength of association determined by the degree of co-occurrence of words within specified distances in the text of the problem statement. This entire chain then resulted in a structure which was claimed to be a representation, at some level, of the ASK underlying the person's information need. The representation could be displayed as a graph, with word stems as nodes, associative relations between nodes represented by edges, and the distances between nodes related to the strength of their association. Consider now what lies between the original theoretical construct (the notion of an ASK) and its operational definition. There are assumptions and decisions made about what a state of knowledge is, or could be; about how, and even whether, some verbal description of an ASK can be elicited; about the nature of relations between concepts in a state of knowledge; about the relationship between the distance between words in a text and association strength of concepts in a state of knowledge; and many more. These assumptions build one upon the other in an elaborate inference chain, so that the end product, the representation, is only tenuously related, and in very U I I i I I I I: I