Persistence

Introduction

The first working version of IRF had no support for persistent storage: nothing produced by the framework/application was saved when the application was shut down and, even while the application was running, all data produced was maintained only as objects in memory. In order for the framework/application to be at all useful, we needed to enhance this in-memory-only version so that objects (e.g., indexes) could exist in a persistent repository after it was shut down. But we also needed support for persistence while the application was running - we wanted to build a sample application that could index and search hundreds of megabytes if not several gigabytes of data, which when represented in the complex object structures of IRF would balloon to several times that amount.

Although we planned to use a file-based persistence scheme for our sample application, we wanted to allow other application developers to use other sorts of support for persistence with minimal changes to the framework. In solving this problem we drew significantly on Larman's discussion (pp. 455-486) of a persistence framework. Note too that we did not try to replace all the functionality of the full object database - only the minimum we thought was needed by a research IR system with emphasis on flexible modeling rather than realistic operational adequacy. For example, we assumed documents could be added but not deleted or changed, we did not implement commit/rollback, etc.

To isolate most of code from knowledge of whether an object exists on disk, in memory, or both and to allow for gradual, on-demand creation, we implemented the virtual proxy pattern. See Gamma, Helm, Johnson, and Vlissides (pp. 207-217) or Buschmann, Meunier, Rohnert, Sommerlad, and Stal (pp. 263-275) for more information on this pattern. Most of the important objects are dealt with indirectly via proxy objects, which act as smart references, passing method calls to the associated real objects, fetching the real objects only on-demand and when not alreadly present in memory. Complex objects such as those representing documents or indexing features are built using proxies and so can be materialized gradually as needed. Here are the classes of proxies that inherit directly from VirtualProxy. These classes are in turn subclassed

All proxies inherit from the VirtualProxy class which models the state and behavior common to all proxies: the real object reference, a unique object identifier (Oid) and methods for making an object persistent, for fetching the real object, for making the proxy lightweight (by detaching the real object), etc. Fetching/reading and writing the real object is handled by a broker which can be specific to the class and the storage scheme.

For detailed information about how persistence is supported see the main classes at the top of the hierarchies involved:

More on VirtualProxy

Here are the main methods in VirtualProxy and some notes about when to use them as is or override them.

More on brokers

The brokers are part of a hierarchy which allows each class to be treated differently for each sort of persistence mechanism (e.g., file-based, relational, etc.) Since brokers can be class-specific, they can be tailored to efficiently (de)serialize objects of that class. We implemented only file-based brokers, beginning with ones which used Java's writeObject()/readObject() methods and later replacing those with less generic methods. The extension of the broker hierarchy to use other means of persistence has not been tested and may require some modification. The difficulties we encountered during development of the brokers themselves were small compared to the general problem of how to avoid turning multiple references to one object into multiple copies of that object, when the referring objects are deserialized.

Avoiding duplication of objects

When a proxy is moved to persistent storage (as part of a containing object) only the Oid of the proxy's real object is stored. Multiple references to a single proxy in memory are represented in persistent storage as multiple instances of the single Oid. When the references need to be reconstructed in memory during materialization, a table of in-memory proxies keyed by Oid is maintained so that the situation before dematerization (multiple references to a single proxy) can be recreated and the alternative (multiple proxies with the same Oid) can be avoided. The table keeps a reference count and removed its entry for a proxy when the reference count reaches 0, thus making the proxy available for collection.

The following two methods, provided only by VirtualProxy maintain the table of in-memory proxies:

Finalize methods for real objects which contain references to proxies unregister the references to those proxies from the in-memory proxies table if the Boolean (proxyRefsCounted) indicates the references were registered.

The following methods use the table of in-memory proxies to avoid duplication of proxies. One or the other other is used, never both on the same proxy:

Uses of persistent storage

The instances of FileBroker we implemented are responsible for managing persistent versions of most classes of objects (data elements, IR documents, indexes, indexing feature lists, and indexing features). The sample application's "indexDir" parameter controls their location.

Basic persistent data objects produced by the installation test:

     Bytes              Filename Use
     -----              -------- ----------------
       240 Dec 30 08:10 DB.HCI   HciDocs
     18283 Dec 30 08:10 DB.HTML  DeHtmls
       300 Dec 30 08:10 DB.IF    IndexingFeatures
      2280 Dec 30 08:10 DB.IFWS   String feature
     33925 Dec 30 08:10 DB.IFWT   Text feature
      9756 Dec 30 08:10 DB.Indx  Indexes
       612 Dec 30 08:10 DB.Per   PersonNames
      7636 Dec 30 08:10 DB.Str   Strings
But there are other classes, not derived from Broker, that manage persistent storage of themselves and their components.

PersistentDualKeyContainer files

The following files, as produced by the installation test, are created/managed by a PersistentDualKeyContainer, the heart of the current Index/IdxIntern class. (The sample application's "indexDir" parameter controls their location.)

Indexes (two files per index: sBs (sources by value) and 
vBs (value by source):

       517 Dec 30 08:10 DBauthorindexsBv0
       517 Dec 30 08:10 DBauthorindexvBs0
     13692 Dec 30 08:10 DBdocAbstractindexsBv0
       517 Dec 30 08:10 DBdocAbstractindexvBs0
      2017 Dec 30 08:10 DBtitleindexsBv0
       517 Dec 30 08:10 DBtitleindexvBs0

Feature list pools (two files per index):

       260 Dec 30 08:10 DB.pool.autho.SbV
       260 Dec 30 08:10 DB.pool.autho.VbS
     19175 Dec 30 08:10 DB.pool.docAb.SbV
     19175 Dec 30 08:10 DB.pool.docAb.VbS
      1560 Dec 30 08:10 DB.pool.title.SbV
      1560 Dec 30 08:10 DB.pool.title.VbS

File name codes:

       357 Dec 30 08:10 DBfileNames

IRF management files

The following files, as produced by the installation test, contain information used to manage IRF resources. (The sample application's "irmDir" parameter controls their location.)

The following file is the serialized last-used-Oid, saved by the IrfManager:

         4 Dec 30 08:10 DB.Oid

The following file is created/managed by the HandlesByOid class to store the mapping of handles to object identifiers. If all handles are not of the same length, then there would be an additional index file (DB.HanI) as well

     54936 Dec 30 08:10 DB.HanD

The following file is the serialized InfoServer, created/managed by the IrfManager. It is used to find all the other persistent objects.

      1031 Dec 30 08:10 DB.Info

National Institute of Standards and Technology Home Last updated: Tuesday, 01-Aug-2000 06:34:30 MDT

Date created: Monday, 31-Jul-00
For further information contact Paul Over (over@nist.gov) with
copy to Darrin Dimmick (ddimmick@nist.gov)