Persistent Database I/O

To understand how persistent database read and write operations are performed it is important to first understand how caching works. Typically, persistent database applications are not performing I/O to the persistent media (hard disk, solid state disk, etc.) directly. The operations of reading and writing to the database do not access the persistent media but rather use memory caches to optimize overall performance.

eXtremeDB uses a database cache, or page pool, that can be optimized via a number of configuration options. The page pool holds a certain amount of data for fast access. And when necessary writes the data to or reads from the persistent media via the operating system’s file system. This file system manages its own cache to buffer the contents before actually accessing the persistent media.

These two types of cache and how they interact are explained in the following sections.

A note on bypassing the File System Cache

Why do we need file system caching? Using the O_DIRECT flag it is possible to bypass the file system cache and thus avoid the well known problem called "double buffering", i.e. when the same pages are cached by the database cache management and by the operating system's file system cache management. This leads to extra copies and waste of memory. Some people think that the DBMS should have full control over caching and exclude the operating system from this process.

But the fact is that only the operating system knows the amount of free resources it has at a given moment. Many empirical tests have shown that attempts to eliminate the operating system's cache actually lead to worse performance.

Huge Page support on Linux systems

Most modern Linux systems allow configuring some part of their virtual memory space to use by huge pages. The huge pages feature enables the Linux kernel to manage large pages of memory in addition to the standard 4KB (on x86 and x86_64) or 16KB (on IA64) page size. When the system needs to access a virtual memory location, it uses the page tables to translate the virtual address to a physical address. Using huge pages means that the system needs to load fewer such mappings into the Translation Lookaside Buffer (TLB), which is the cache of page tables on a CPU that speeds up the translation of virtual addresses to physical addresses. Enabling the huge pages feature allows the kernel to use hugetlb entries in the TLB that point to huge pages. The hugetlb entries mean that the TLB entries can cover a larger address space, requiring many fewer entries to map the memory. On systems with more than 16GB of memory running eXtremeDB databases, enabling the huge pages feature can improve database performance. Specifying the MAP_HUGETLB flag, or SHM_HUGETLB for eXtremeDB shared memory databases, in the memory device flags makes it possible to use the huge page feature.

The File System Cache

The operating system (Linux, MacOS, Windows, VxWorks, etc.) allocates a portion of system memory for file system management. The file system works hard to prevent applications from suffering disk I/O latency, for example by using DRAM to buffer writes, and to cache and pre-fetch reads. What matters to the database application is this latency of its requests to the file system.

Reads and writes often are served from the file system main memory cache instead of the disk as illustrated in the following diagram:

 

As the wider arrows indicate, the I/O throughput is orders of magnitude faster for "cache hits" that have RAM access speeds as opposed to the "cache misses" which require the operating system to access the persistent media.

Please see the Persistent Media I/O page for further details on how eXtremeDB interacts with the Operating System's file management system.

The eXtremeDB Database Cache

Why do we need a separate database cache? As explained above, the performance of database applications can be significantly impacted by latency of read write requests from the file system. When read or write requests result in file system "cache misses" the persistent media access can seriously degrade application performance. The file system cannot resolve the performance requirements of database applications for a number of reasons:

For these reasons, eXtremeDB implements a database cache to minimize the effects of persistent media I/O as illustrated in the following diagram:

The database cache is defined as a memory device which can be conventional RAM or shared memory. Database transactions affect one or more "pages" which are managed by the database cache (or "page pool") manager as illustrated in the following diagram:

Cache Size and Tuning

The memory address and size for the cache are specified in the memory devices passed to the database open API. The memory can be either shared memory or local memory. (It must be shared memory if two or more processes are to share the database.) Generally a larger cache will improve application performance, but the frequency of updates to persistent media (flushing of cache pages) is more important for performance. How database updates are written to persistent media is determined by the Transaction Commit Policy.

In general, a cache is too small to hold all the data an application might possibly need, so at some point something must be removed from the cache in order to make room for new data. The goal is to retain those items that are more likely to be retrieved again soon. This requires a sensible algorithm for selecting what to remove from the cache. The topic of cache replacement policies or page replacement algorithms is well covered in this wikipedia article which is worth reading for detailed descriptions of various approaches to this problem. By default eXtremeDB uses a variant of the CLOCK algorithm called Prioritized Cache. Please view the Prioritized Cache page for further implementation details.

Normally, when re-opening a persistent database, eXtremeDB populates the cache by running the normal application operations. So it takes some time to gain the best performance. To address this the eXtremeDB runtime provides APIs to save and load the cache contents at runtime so that the process of re-opening a persistent database, and preparing it for maximum operational performance, can be significantly enhanced.

Database Log Policy

Log policy controls when transactions are committed to the persistent storage. eXtremeDB supports three database logging policies.

Note that it is important not to confuse the use of eXtremeDB Transaction Logging, an entirely independent group of APIs, with the normal eXtremeDB runtime logging for persistent databases. Please see the Transaction Logging page for further details.

The policies are described below, but the basic strategy, benefits and disadvantages of these policies can be summarized as follows:

Write Ahead Logging (REDO_LOG policy)

WAL's central concept is that changes to the data must be written only after those changes have been logged.

Benefits:

Significantly reduced number of disk writes (only at commit);

The log file is written sequentially, and so the cost of syncing the log is much less than the cost of flushing the data pages.

Disadvantages:

Cache size limit - can run out of memory when there are many uncommitted updates;

The transaction size is limited to the size of the page pool (cache).

Immediate Modifications Logging (UNDO_LOG policy)

The log file contains entries that allow the current transaction’s updates to be undone.

Benefits:

The algorithm never runs out of memory and provides easy and efficient recovery.

Disadvantages:

All updates must be flushed to the database file when committed to persistent media.

All “writes” are to the database file and are random and thus are slower than writes to the log file, which are sequential.

A "flush" is performed not only when the database commits; almost always when dirty pages are discarded from the page pool a "flush" is performed.

No logging policy (NO_LOG policy)

If this option is selected, transaction logging is turned off and a log file is not created.

Benefits:

Updates can be significantly faster.

Disadvantages:

Application will not be able to recover the database in the event of a crash.

Transaction rollback is not available (when MURSIW transaction manager is used).

 

Write Ahead Logging (REDO_LOG policy)

Write Ahead Logging (WAL) is a standard approach to transaction logging. (Please refer to this web page for a general description.)​

Briefly, WAL's central concept is that changes to the data must be written only after those changes have been logged - that is, when log records have been flushed to permanent storage. When a page (data or index) is updated, it is "pinned" in the page pool (cache) and guaranteed to never get swapped out during the transaction ("no steal" policy). Upon transaction commit, all updated pages are first written into the log and then committed (flushed) to the permanent storage. Only then are updated pages written to the database (but don't get flushed). If during the commit the log size becomes larger than the specified threshold, a checkpoint is created: all updated pages are written to disk, updates are flushed to the permanent storage and the log is truncated.

The obvious benefit of the REDO_LOG policy is a significantly reduced number of disk writes, since only the log file needs to be flushed to disk at the time of transaction commit. Furthermore, the log file is written sequentially, and so the cost of syncing the log is much less than the cost of flushing the data pages. The disadvantage of using WAL is that the algorithm can run out of memory when there are many uncommitted updates. The transaction size is limited to the size of the page pool (cache). Every time a page is made "dirty" (anything is changed on the page), it must remain in cache.

The following diagram illustrates the WAL approach:

Immediate Modifications Logging (UNDO_LOG policy)

When the UNDO_LOG strategy is used, the log file contains entries that allow the current transaction’s updates to be un-done. Briefly, the eXtremeDB implementation of this approach is as follows: During the update, the runtime marks the containing page as "dirty" and flags it in the bitmap of modified pages, and the original page is written to the log file. Regardless of the number of times the individual page is changed during the transaction, the original image of the page is written to the log file only once. When the transaction is committed, all modified pages are written and flushed to the database file and then the log file is truncated. The recovery and the rollback procedures read all saved pages from the log file, restoring the original images of the pages from the log file and clearing the "dirty" bit for the page.

The advantages of using Undo Logging are that the algorithm never runs out of memory and provides easy and efficient recovery. The disadvantages are that all updates must be flushed to the database file when committed to persistent media. Writes to the database file are usually random and thus are slower than writes to the log file, which are sequential. Further, when we updating the page, we need to be sure that the updated version is not stored to the persistent media before we have saved the original version in the log. So a "flush" is performed not only when the database commits, but almost always when dirty pages are discarded from the page pool.

The following diagram illustrates the Undo Logging approach:

The Non-transactional Mode (NO_LOG Option)

If this option is selected, log file updating is turned off, and a log file is not created. This will significantly increase update performance, but the application will not be able to recover the database in the event of a crash, and transaction rollback is also not available. This mode can be useful when the application needs to quickly populate the database file. But it is not recommended to use this option under any other circumstances.

 

Choosing the Log file type

The choice of logging policy is determined by the log file type: UNDO_LOG, REDO_LOG or NO_LOG. Following are some guidelines for choosing the proper logging strategy depending on application characteristics:

Choose UNDO_LOG when the application:

Choose REDO_LOG when the application:

Choose NO_LOG if data safety is of no concern:

For further details please see the Setting the Log File Type page.

Database Transactions

The ACID (Atomicity, Consistency, Isolation, Durability) model is one of the oldest and most important concepts of database theory. It sets forward four goals that every database management system must strive to achieve: atomicity, consistency, isolation and durability. No database that fails to meet any of these four goals can be considered reliable:

Atomicity states that database modifications must follow an “all or nothing” rule. Each transaction is said to be “atomic.” If one part of the transaction fails, the entire transaction fails. It is critical that the database management system maintain the atomic nature of transactions in spite of any DBMS, operating system or hardware failure.

Consistency states that only valid data will be written to the database. If, for some reason, a transaction is executed that violates the database’s consistency rules, the entire transaction will be rolled back and the database will be restored to a state consistent with those rules. On the other hand, if a transaction successfully executes, it will take the database from one state that is consistent with the rules to another state that is also consistent with the rules.

Isolation requires that multiple transactions occurring at the same time not impact each other’s execution. For example, if Joe issues a transaction against a database at the same time that Mary issues a different transaction, both transactions should operate on the database in an isolated manner. The database should either perform Joe’s entire transaction before executing Mary’s or vice-versa. This prevents Joe’s transaction from reading intermediate data produced as a side effect of part of Mary’s transaction that will not eventually be committed to the database. Note that the isolation property does not ensure which transaction will execute first, merely that they will not interfere with each other.

Durability ensures that any transaction committed to the database will not be lost. Durability is ensured through the use of database backups and transaction logs that facilitate the restoration of committed transactions in spite of any subsequent software or hardware failures.

Transaction Commit Policies

A transaction is said to be durable if, upon return of control to the application after a transaction commit, the transaction data can be recovered in the event of a failure of the application or the system (assuming that the media on which the database and/or transaction log itself is not compromised / corrupted).

In order to guarantee the durability of transactions, database systems must force the updates to be written through the database cache, and the file system cache, onto the physical media (be it solid state or spinning media). This flushing of the file system buffers is an expensive operation (in terms of performance), but is the only way to guarantee the durability property.

Why might an application relax the durable property of transactions? Durable transactions are “expensive” . Some applications do not require durable transactions, for example: commodity / equity trading systems whose data can be re-loaded from other sources; and applications that implement bulk load operations that can be repeated if necessary.

eXtremeDB offers four transaction commit policies. The commit policy is initially set when opening the database but can later be modified during program execution.

(Note that the transaction commit policy affects persistent databases only).

The transaction commit policies are described below, but the basic strategy, benefits and disadvantages of these policies can be summarized as follows:

Buffered Synchronization (MCO_COMMIT_BUFFERED policy)

Indicates that the database cache is not flushed to disk upon transaction commit.

Benefits:

Significantly reduces the number of I/O operations.

Application failure will not cause database corruption.

The application can explicitly force the cache to be flushed to the media (i.e. Periodically, at the application’s discretion, it can persist all the data to disk and truncate the log file).

Disadvantages:

In case of application failure, the database cache is destroyed and buffered transactions (changes made by all transactions committed after the policy was set) will be lost.

No Synchronization (MCO_COMMIT_NO_SYNC policy)

Indicates that the database runtime does not explicitly synchronize the file system buffers with the file system media. It is up to the file system to determine when the data is actually written to the media.

Benefits:

This mode provides some performance advantages over the full synchronization mode.

Disadvantages:

Risks losing transaction data in the event of a system crash (while committed transactions are still in the file system cache). The database and log files could be left in an inconsistent state causing the database content to be corrupted.

Delayed Synchronization (MCO_COMMIT_DELAYED policy)

Indicates that modified pages are kept in the page pool, similar to the MCO_COMMIT_BUFFERED policy. When the total size of all kept pages reaches the threshold or the number of delayed commits exceeds a value, all delayed transactions are committed to the persistent storage at once.

Benefits:

This mode has performance advantages over the full synchronization (MCO_COMMIT_SYNC_FLUSH) mode.

No risk of corrupting the database.

Disadvantages:

In case of system failure multiple transactions (that are still in the cache) are lost.

Full Synchronization (MCO_COMMIT_SYNC_FLUSH policy)

Indicates that a database commit flushes the cache, synchronizes the file system buffers for both database and log files.(Note that this is the default transaction policy.)

Benefits:

This policy provides durable transactions.

Disadvantages:

None really, except that it can be slow.

 

Buffered Synchronization (MCO_COMMIT_BUFFERED policy)

This policy indicates that the database cache does not get flushed to disk upon transaction commit but when the application calls the cache flush API. Pages that were made dirty by the current transaction are left in the database cache. That applies to both the database and the log file pages. This policy significantly reduces the number of I/O operations: the runtime only writes dirty pages to disk during normal swapping. In the case of an application failure, the database cache is destroyed and all changes made by buffered transactions (committed after the policy was set) will be lost. The database can be corrupted in the case of an abnormal application termination when using the NO_LOG logging policy.

Note that an application failure will not cause database corruption, provided that the hardware and the operating system are working properly, as the log file is written to persistent storage during the cache flush operation thereby guaranteeing consistency of database files. But, because not every commit will cause a write to persistent media, the buffered commits will be lost. The database is restored to a consistent state from the log file when the application is restarted.

The following diagram illustrates the Buffered Synchronization approach:

Note that the dashed arrows indicate that data in the cache (buffered transactions) are not automatically flushed to the persistent media. Persistent updates occur when the application calls the cache flush API.

No Synchronization (MCO_COMMIT_NO_SYNC policy)

This policy indicates that the database runtime does not explicitly synchronize the file system buffers with the file system media. Upon transaction commit, all changes made by the transaction are transferred from the application space to the operating system space and the log file is truncated (when using the UNDO_LOG logging policy only). It is up to the file system to determine when the data is actually written to the media. This mode provides some performance advantages over the full synchronization mode but also risks losing transaction data in the event of a system crash (while committed transactions are still in the file system cache).

Note that in the case of a failure the database and log files could be left in an inconsistent state causing the database content to be corrupted.

With MCO_COMMIT_NO_SYNC mode, the application can explicitly force the cache to be flushed to the media. When combined with the MCO_COMMIT_NO_SYNC transaction policy, the cache flush API allows periodically (at the application’s discretion) persisting all the data to disk and truncating the log file.

The following diagram illustrates the No Synchronization approach:

Note that the solid arrows indicate that data in the database cache is flushed to file system on commit and the dashed arrows indicate that data in the file system cache (buffered transactions) are flushed to the persistent media under file system control.

Delayed Synchronization (MCO_COMMIT_DELAYED policy)

This commit policy is only available if the logging policy is set to REDO_LOG. Similar to the MCO_COMMIT_BUFFERED mode, in this mode the transaction is not committed to persistent media immediately upon the database commit. Instead, the database cache manager keeps the transaction in memory until a specified threshold is reached. There are two threshold criteria: 1) the amount of used space (1/3 is the default) and 2) the number of transactions (turned off by default). So when the total size of all kept pages reaches the used space threshold or the number of delayed commits exceeds the number of transactions value, all delayed transactions are committed to the persistent storage at once.

In the case of an application failure, some of the last transactions can be lost because it cannot be known when the commit to persistent media actually happened and which transactions were actually written.

The following diagram illustrates the Delayed Synchronization approach:

Note that the dashed arrows indicate that data in the cache (buffered transactions) are not automatically flushed to the persistent media. Persistent updates occur when the specified threshold is reached.

Full Synchronization (MCO_COMMIT_SYNC_FLUSH policy)

This policy indicates that a database commit flushes the cache, synchronizes the file system buffers for both database and log files and truncates the log file (when using the UNDO_LOG logging policy only). This policy provides durable transactions. The database can be corrupted only if the physical media where the database and log files are located is damaged.

The following diagram illustrates the Full Synchronization approach:

Note that the solid arrows indicate that data in the database cache is flushed to file system cache and to the persistent media automatically when transactions are committed.