eXtremeDB Prioritized Cache

As explained in the Persistent Database I/O page, for persistent databases I/O operations (reading from and writing to the persistent media) are the most “expensive” operations in performance terms. To minimize the effect of I/O, eXtremeDB implements a Disk Manager (DM) Cache that interacts with the Operating System’s file system cache. By default eXtremeDB uses a variant of the CLOCK caching algorithm called Prioritized Cache. It is also possible to choose a Least Recently Used (LRU) algorithm by rebuilding the eXtremeDB runtime setting an internal conditional compilation (#ifdef) switch. (Please contact McObject Support for further details if the LRU algorithm is desired.)

The following sections explain some implementation details to help developers optimize the eXtremeDB Disk Manager performance for a specific application’s needs.

Implementation

Least Recently Used (LRU) Algorithm

The LRU policy is based on the principle of locality which states that program and data references within a process tend to cluster. The LRU replacement policy selects that page for replacement which has not been referenced for the longest time. For years, LRU was considered to be the optimal online policy. The problem with this approach for small in-memory entities is the difficulty in implementation. One approach would be to tag each page with the time of its last reference; this would have to be done at each memory reference, both instruction and data. LRU policy does nearly as well as an optimal policy, but it imposes significant overhead.

Typical implementations of the LRU algorithm require keeping "age bits" for cached pages and tracking the "Least Recently Used" cache page based on these age-bits. eXtremeDB improves on basic LRU cache policies by implementing a technique that allows applications to influence how long certain pages remain in the disk manager cache. The crux of the improvement is in adding a cache priority property to each page. When the LRU algorithm locates a "victim", instead of immediately releasing the page (removing the page from the tail of the L2 link list), the algorithm inspects its caching_priority field. If the value is not zero, the caching_priority is decremented and the page is re-linked to the beginning of the L2-list. A caching priority of zero means the default behavior. A caching priority of 1 indicates that the page will be moved from the head to the tail of the LRU list twice. A caching priority of 2 means three loops through the LRU list, and so on. The higher the priority, the longer the page remains linked to the LRU list (stays in cache).

However, the main drawback of this implementation is relatively large memory overhead (2 pointers per entity) and problems with concurrent access (manipulations with the L2 list requires a global lock). In the case of using LRU for database pages on persistent storage, the main LRU problem is flushing of the useful data from cache by a sequential scan. This is solved by using a more complicated version of the LRU algorithm (for example, by splitting cache in two parts: for frequently and rarely accessed objects) or changing the sequence scan algorithm to use some kind of cyclic buffer. For this reason, the default eXtremeDB caching algorithm is the CLOCK algorithm described below.

CLOCK Algorithm

In CLOCK, all page frames are visualized to be arranged in form of a circular list that resembles a clock. The hand of the clock is used to point to the oldest page in the buffer. Each page has an associated reference counter that is set whenever the particular page is referenced. The page replacement policy is invoked in case of a page miss, in which case the page pointed to by the hand, i.e. the oldest page is inspected. If the reference bit of the page is set, then the bit is reset to zero and the hand is advanced to point to the next oldest page. This process continues until a page with reference bit zero is found. The page thus found is removed from the buffer and a new page is brought in its place with reference bit set to zero.

Every page access ("pin") bumps up the counter by one until the counter has reached its limit (which is currently set to 5). Every full rotation of the hand decreases the counter by one. When the page is released ("unpin") the counter is set to the max(current_value, caching_priority+1). Hence it would take no less than caching_priority+1 hand rotations to remove the page from the cache. So the higher the caching_priority the more rounds of the CLOCK algorithm it has to perform before removing it from the cache. The caching_priority parameters are set when opening the database.

In effect, the CLOCK algorithm is better in three respects:

It has been observed that this CLOCK algorithm approximates LRU very efficiently with minimum overhead.

Cache Priority

In fact, applications can assign the cache priority for specific database objects. At the time the database is created, the application can assign priorities to indexes, memory allocator bitmap pages and object pages (excluding BLOBs). By default all pages have the same priority (zero), but it is also possible to change the caching priority for a class (object) at runtime. Using the object priority, the relative priorities of specific classes can be adjusted. For example, large and rarely accessed objects can be assigned lower priority, while small frequently accessed classes can be assigned a higher priority. The caching priority assigned at runtime is stored in the database and is used until it is explicitly overwritten.

Other memory initialization factors that can affect overall performance are the sizes specified for the cache and maximum disk space for the database. These are explained in the following sections.

Cache Size

The memory address and size for the cache are specified in the memory devices passed to the database open API. The memory can be either shared memory or local memory. (It must be shared memory if two or more processes are to share the database.) Generally a larger cache will improve application performance, but the frequency of updates to persistent media (flushing of cache pages) is more important for performance. How database updates are written to persistent media is determined by the Transaction Commit Policy.

Maximum Database Size

The eXtremeDB runtime uses the value of the database parameter MaxDiskDatabaseSize to allocate the “dirty pages bitmap”. The bitmap is allocated in cache at the time the cache is created. The bitmap size can be roughly calculated as:

 
    MaxDiskDatabaseSize / MemPageSize / 8.
 

Reserve Page Pool

The eXtremeDB runtime deals with a cache overflow condition by adding a reserved pool of pages into the cache. In “debug" mode (using the debug disk manager library), the cache overflow condition triggers an ASSERT (a fatal error). But in release mode (using the release disk manager library), the disk manager logic assumes that the operation of adding a page into the page pool is always successful (the return code from the MCO_PIN() operation is never verified). The reserve pool ensures that this assumption is always true: if it is impossible to allocate a page from the normal, unreserved space in the page pool, the disk manager allocates the page (or a number of pages) from the reserve pool. When that happens, the disk manager sets the current transaction into the error state. The error state of the transaction is then detected and the transaction is rolled back and the “out-of-memory” condition is returned to the application.

The page pool reservation mechanism facilitates out-of-memory error handling for the cache and ensures that if / when the database runtime runs out of page pool, active transactions can still be committed or rolled back. To accomplish this the database runtime utilizes two threshold values: the low threshold, called a yellow zone and the high threshold called the red zone. The yellow zone value is calculated as 2/3 of the pool size, but no greater than the red zone value, while the red zone is calculated based on the number of max_active_pages (32 by default) and the maximum number of connections.

Once the yellow threshold is crossed the runtime attempts to offload (commit to the media) some pages modified or created by the transaction. Once the red threshold is crossed the transaction is marked erroneous.

Connection cache

In addition to the disk manager cache (also often referred to as a page pool), eXtremeDB provides a per-connection cache. The database runtime “pins” a predefined number of pages from the page pool for each connection. This is referred to as a connection cache. When a transaction loads pages into the page pool, and the total number of pages loaded from the media is less than the size of the connection cache, the database runtime makes sure that these pages stay in the cache until the transaction is committed, or the database connection is broken.

By default the size of the connection cache is set to four pages. Usually the runtime accesses a page only a few times during a transaction and the number of pages accessed at each moment of time is normally quite small so that the default value of 4 pages works fine. The implementation of the per-connection cache in this case uses a sequential search. However, changing the connection cache size can have a very significant impact on performance in some cases. It is possible to specify a much larger size with the hope that an entire working set of pages will fit into the cache. In such cases a hash table is used to lookup cached pages. But note that it is not possible to modify the connection cache size without recompiling the eXtremeDB libraries. If this is required please contact McObject Technical Support.

The connection cache is enabled by default, but can be disabled and reset (committing the connection cache to the database) at runtime. These APIs are provided to manage a scenario with many connections and long-lasting transactions. In this scenario, the connection cache could cause the page pool to run out of free pages (a new transaction allocates its own connection cache, but long transactions prevent those pages from being released back to the shared page pool). To address this, the connection cache could be turned off or reset often. However, under normal circumstances, the application does not need to control the connection cache.

In-memory Page allocation

Whereas the connection cache manages memory pages for the disk manager cache, the eXtremeDB runtime uses a per-connection cache allocator to reduce synchronization overhead while allocating and deallocating pages. Each connection has a private set of pages which it can manage without the need for synchronization.

The minimum and maximum number of pages held by the per-connection allocator in MVCC mode are determined by database parameters that can be modified by some APIs. The MVCC transaction manager optimizes access to the shared memory pool by pre-allocating a number of pages at once and assigning these pages to the connection. The default minimum value is 256 pages and maximum is 512. The minimum / maximum value assignments represents a tradeoff between accessing a shared resource more frequently and allocating extra memory. Changing these default values can be effective if there are well defined object allocation and deallocation patterns in the application.

Native Language APIs

Currently only the C API provides functions to manage caching parameters. Please refer to the C API Cache Management page for implementation details.