Persistence

Introduction

The first working version of IRF had no support for persistent storage: nothing produced by the framework/application was saved when the application was shut down and, even while the application was running, all data produced was maintained only as objects in memory. In order for the framework/application to be at all useful, we needed to enhance this in-memory-only version so that objects (e.g., indexes) could exist in a persistent repository after it was shut down. But we also needed support for persistence while the application was running - we wanted to build a sample application that could index and search hundreds of megabytes if not several gigabytes of data, which when represented in the complex object structures of IRF would balloon to several times that amount.

Although we planned to use a file-based persistence scheme for our sample application, we wanted to allow other application developers to use other sorts of support for persistence with minimal changes to the framework. In solving this problem we drew significantly on Larman's discussion (pp. 455-486) of a persistence framework. Note too that we did not try to replace all the functionality of the full object database - only the minimum we thought was needed by a research IR system with emphasis on flexible modeling rather than realistic operational adequacy. For example, we assumed documents could be added but not deleted or changed, we did not implement commit/rollback, etc.

To isolate most of code from knowledge of whether an object exists on disk, in memory, or both and to allow for gradual, on-demand creation, we implemented the virtual proxy pattern. See Gamma, Helm, Johnson, and Vlissides (pp. 207-217) or Buschmann, Meunier, Rohnert, Sommerlad, and Stal (pp. 263-275) for more information on this pattern. Most of the important objects are dealt with indirectly via proxy objects, which act as smart references, passing method calls to the associated real objects, fetching the real objects only on-demand and when not alreadly present in memory. Complex objects such as those representing documents or indexing features are built using proxies and so can be materialized gradually as needed. Here are the classes of proxies that inherit directly from VirtualProxy. These classes are in turn subclassed

ProxyDocument
ProxyDataElem
ProxyIndex
ProxyFeatureList
ProxyIndexingFeature

All proxies inherit from the VirtualProxy class which models the state and behavior common to all proxies: the real object reference, a unique object identifier (Oid) and methods for making an object persistent, for fetching the real object, for making the proxy lightweight (by detaching the real object), etc. Fetching/reading and writing the real object is handled by a broker which can be specific to the class and the storage scheme.

For detailed information about how persistence is supported see the main classes at the top of the hierarchies involved:

gov.nist.nlpir.irf.proxy.VirtualProxy
gov.nist.nlpir.irf.broker.Broker
gov.nist.nlpir.irf.handle.Handle
gov.nist.nlpir.irf.proxy.Oid

More on VirtualProxy

Here are the main methods in VirtualProxy and some notes about when to use them as is or override them.

public void makePersistent()

This method of a proxy must make the real object directly associated with a proxy persistent and see that the same is done recursively for all members of the real object that are proxies, etc.
The default method in VirtualProxy does not handle member proxies, so every proxy class which contains one or more proxies must override this method to make the recursive call(s) on its contained proxies. This method has to be recursive, otherwise an object couldn't be reconstructed if all of its component were not found.
Invocations beyond the first for a given proxy have no effect.
When an object with a proxy is made persistent, the broker assigns a handle to the real object and an object identifier (Oid) is stored in the proxy. The reference to the proxy is added to a table of in-memory proxies by Oid. See below for information about how this table is used to avoid duplicate instantiations as the result of materialization.

public Object getRealObject()

The method of a proxy returns the proxy's real object. If it's not in memory any more, this method will initiate the series of calls necessary to get it back from persistent storage.

public void makeLightweight()

This method of a proxy cuts the link to the proxy's real object so that the real object is available for garbage collection.

It shouldn't be necessary to design a recursive makeLightweight for every proxy container class since contained objects with no references to them from outside the container become available for collection as soon as the container is available. But with current Solaris Java 1.2 VM and garbage collector it seems possible that making contained proxies lightweight speeds their collection. The choice of objects having a recursive makeLightweight method isn't easy to make. Right now, only Documents and ProxyFeatureLists have one.

public void replaceRealObject()

This method updates the object on the storage. Right now, it doesn't check whether the object has changed or not since it was last saved. Should do it one day, and attempt to write the new object at the same place if possible.

More on brokers

The brokers are part of a hierarchy which allows each class to be treated differently for each sort of persistence mechanism (e.g., file-based, relational, etc.) Since brokers can be class-specific, they can be tailored to efficiently (de)serialize objects of that class. We implemented only file-based brokers, beginning with ones which used Java's writeObject()/readObject() methods and later replacing those with less generic methods. The extension of the broker hierarchy to use other means of persistence has not been tested and may require some modification. The difficulties we encountered during development of the brokers themselves were small compared to the general problem of how to avoid turning multiple references to one object into multiple copies of that object, when the referring objects are deserialized.

Avoiding duplication of objects

When a proxy is moved to persistent storage (as part of a containing object) only the Oid of the proxy's real object is stored. Multiple references to a single proxy in memory are represented in persistent storage as multiple instances of the single Oid. When the references need to be reconstructed in memory during materialization, a table of in-memory proxies keyed by Oid is maintained so that the situation before dematerization (multiple references to a single proxy) can be recreated and the alternative (multiple proxies with the same Oid) can be avoided. The table keeps a reference count and removed its entry for a proxy when the reference count reaches 0, thus making the proxy available for collection.

The following two methods, provided only by VirtualProxy maintain the table of in-memory proxies:

public VirtualProxy addRefToInMemoryProxiesByOid()

When a class makes a reference to a proxy or contains a Java container (e.g., Vector) that makes a reference to a proxy, the class must call this method of the proxy to register the reference. This could happen in various methods such as set methods, add methods, etc. When the references are registered, a Boolean (proxyRefCounted) is set to true.

If a class only has a local variable refering to a proxy, it shouldn't call this method on the proxy. Classes concerned, for example, are subclasses of Document, PersistentDualKeyContainer for the ProxyFeatureLists, ProxyFeatureList for the objects it contains, IndexingFeature.

Constructors do not call addRef... because only references to proxies of persistent objects are registered - since only these objects can be materialized. The constructor cannot know whether the object being constructed will always, sometimes, or never be made persistent. The client controls this decision via the makePersistent method, which registers references to proxies at that point. This method also sets the proxyRefsCounted boolean in the real object (see finalize).

As the proxyRefsCounted boolean now exists in every real class containing proxies, it has to be set to true by the broker for this real class when the object comes back from disk because the default value for this boolean is false, which means a real object that came back from disk won't remove the references it had when it is finalized.

public void deleteRefFromInMemoryProxiesByOid()

This method must be called by the same classes as called addRef... in their finalize() method and in other methods, e.g., any which may overwrite an existing reference to a proxy. This way, just before disappearing, a container class tells every proxy it contains that it will no longer be refering to the proxy and the reference count in the table of in-memory proxies is adjusted accordingly.

Finalize methods for real objects which contain references to proxies unregister the references to those proxies from the in-memory proxies table if the Boolean (proxyRefsCounted) indicates the references were registered.

The following methods use the table of in-memory proxies to avoid duplication of proxies. One or the other other is used, never both on the same proxy:

public static VirtualProxy getProxyFor(Class proxyClass,long objectIdentifier)

This method is used by only brokers, when materializing proxies contained in the real object they are responsible for. The method returns the proxy of the corresponding given class with the Oid given. It builds the proxy from scratch using the no-argument constructor, if the proxy didn't already exist or returns the already existing one if possible. It increments the number of references this proxy has, so NO call to addRef... is necessary.

public VirtualProxy getFirstInstance()

Sometimes, an object comes back from disk with the usual readObject or defaultReadObject methods. A proxy may then be redundant with an already existing one and after creation must be replaced with the return value of this method. This way, a redundant proxy can be collected and a proxy for a given object will be unique.

Uses of persistent storage

The instances of FileBroker we implemented are responsible for managing persistent versions of most classes of objects (data elements, IR documents, indexes, indexing feature lists, and indexing features). The sample application's "indexDir" parameter controls their location.

Basic persistent data objects produced by the installation test:

     Bytes              Filename Use
     -----              -------- ----------------
       240 Dec 30 08:10 DB.HCI   HciDocs
     18283 Dec 30 08:10 DB.HTML  DeHtmls
       300 Dec 30 08:10 DB.IF    IndexingFeatures
      2280 Dec 30 08:10 DB.IFWS   String feature
     33925 Dec 30 08:10 DB.IFWT   Text feature
      9756 Dec 30 08:10 DB.Indx  Indexes
       612 Dec 30 08:10 DB.Per   PersonNames
      7636 Dec 30 08:10 DB.Str   Strings

But there are other classes, not derived from Broker, that manage persistent storage of themselves and their components.

PersistentDualKeyContainer files

The following files, as produced by the installation test, are created/managed by a PersistentDualKeyContainer, the heart of the current Index/IdxIntern class. (The sample application's "indexDir" parameter controls their location.)

Indexes (two files per index: sBs (sources by value) and 
vBs (value by source):

       517 Dec 30 08:10 DBauthorindexsBv0
       517 Dec 30 08:10 DBauthorindexvBs0
     13692 Dec 30 08:10 DBdocAbstractindexsBv0
       517 Dec 30 08:10 DBdocAbstractindexvBs0
      2017 Dec 30 08:10 DBtitleindexsBv0
       517 Dec 30 08:10 DBtitleindexvBs0

Feature list pools (two files per index):

       260 Dec 30 08:10 DB.pool.autho.SbV
       260 Dec 30 08:10 DB.pool.autho.VbS
     19175 Dec 30 08:10 DB.pool.docAb.SbV
     19175 Dec 30 08:10 DB.pool.docAb.VbS
      1560 Dec 30 08:10 DB.pool.title.SbV
      1560 Dec 30 08:10 DB.pool.title.VbS

File name codes:

       357 Dec 30 08:10 DBfileNames

IRF management files

The following files, as produced by the installation test, contain information used to manage IRF resources. (The sample application's "irmDir" parameter controls their location.)

The following file is the serialized last-used-Oid, saved by the IrfManager:

         4 Dec 30 08:10 DB.Oid

The following file is created/managed by the HandlesByOid class to store the mapping of handles to object identifiers. If all handles are not of the same length, then there would be an additional index file (DB.HanI) as well

     54936 Dec 30 08:10 DB.HanD

The following file is the serialized InfoServer, created/managed by the IrfManager. It is used to find all the other persistent objects.

      1031 Dec 30 08:10 DB.Info

National Institute of Standards and Technology Home

Last updated: Tuesday, 01-Aug-2000 06:34:30 MDT

Date created: Monday, 31-Jul-00
For further information contact Paul Over (over@nist.gov) with
copy to Darrin Dimmick (ddimmick@nist.gov)