In-memory Database Questions & Answers
From the in-memory database system experts at McObject
“Reliability is absolutely critical to our customers, which was one of the key reasons for choosing McObject’s eXtremeDB.”
In-memory database systems (IMDS) are a growing sub-set of database management system (DBMS) software. In-memory databases emerged in response to new application goals, system requirements, and operating environments. Below, we answer some common IMDS questions. Want more technical answers? Start in our on-line documentation.
What is an in-memory database system?
An in-memory database system is a database management system that stores data entirely in main memory. This contrasts to traditional (on-disk) database systems, which are designed for data storage on persistent media. Because working with data in-memory is much faster than writing to and reading from a file system, IMDSs can perform applications’ data management functions an orders of magnitude faster. Because their design is typically simpler than that of on-disk databases, IMDSs can also impose significantly lower memory and CPU requirements.
If avoiding disk I/O is the goal, why not achieve that through database caching?
Caching is the process whereby on-disk databases keep frequently-accessed records in memory, for faster access. However, caching only speeds up retrieval of information, or “database reads.” Any database write – that is, an update to a record or creation of a new record – must still be written through the cache, to disk. So, the performance benefit only applies to a subset of database tasks. In addition, managing the cache is itself a process that requires substantial memory and CPU resources, so even a “cache hit” under performs an in-memory database.
If an in-memory database system boosts performance by holding all records in memory, can’t I get the same result by creating a RAM disk and deploying a traditional database there?
As a makeshift solution, placing the entire on-disk database on a RAM disk will speed up both database reads and writes. However, the database system is still hard-wired for disk storage, and processes in the database to facilitate disk storage, such as caching and file I/O, will continue to operate, even though they are now irrelevant.
In addition, data in an on-disk database system must be transferred to numerous locations as it is used. Figure 1 shows the hand-offs required for an application to read a piece of data from an on-disk database, modify it and write that record back to the database. These steps require time and CPU cycles, and cannot be avoided in a traditional database, even when it runs on a RAM disk. Still more copies and transfers are required if transaction logging is active.
Figure 1. Data transfer in an on-disk database system
In contrast, an in-memory database system entails a single data transfer from the in-memory database to the application’s working data. Elimination of multiple data transfers streamlines processing. Removing multiple copies of data reduces memory consumption, and the simplified processing makes for greater reliability and minimizes CPU demands.
Can you quantify the performance difference between the three approaches described above – using on-disk, on-disk deployed on a RAM-disk, and in-memory database systems?
In a published benchmark McObject compared the same application’s performance using an embedded on-disk database system, using an embedded in-memory database, and using the embedded on-disk database deployed on a RAM-disk. Moving the on-disk database to a RAM drive resulted in read accesses that were almost 4x faster, and database updates that were more than 3x faster.
Moving this same benchmark test to a true in-memory database system, however, provided much more dramatic performance gains: the in-memory database outperformed the RAM-disk database by 4x for database reads and turned in a startling 420x improvement for database writes.
What else distinguishes an in-memory database from a “traditional” (on-disk) database management system (DBMS)?
The optimization objectives of an on-disk database system are diametrically opposed to those of an in-memory database system. With an on-disk database system, the primary burden on performance is file I/O. Therefore, an on-disk database system seeks to reduce that I/O, and it will trade off memory consumption and CPU cycles to do so. This includes using extra memory for a cache, and CPU cycles to maintain the cache.
On-disk DBMSs also keep a lot of redundant data around. For example, duplicate data is kept in index structures, to enable the on-disk database system to fetch data from the index, rather than “spending” an I/O navigating from the index to the associated table row. Disk space is cheap, so designers of on-disk database systems proceed with the assumption that storage space is virtually limitless and that using extra storage space in exchange for reducing file I/O is a good tradeoff.
In stark contrast, an in-memory database system carries no file I/O burden. From the start, its design can be more streamlined, with the optimization goals of reducing memory consumption and CPU cycles. Though memory has declined in price, developers rightly treat it as more precious—and because memory equals storage space for an in-memory database system, IMDSs should be (and McObject’s eXtremeDB in-memory embedded database is) designed to get the most out of memory. An in-memory database is chosen explicitly for its performance advantage, so a secondary design goal is always to eliminate unnecessary CPU cycles.
Isn’t the database just lost if there’s a system crash?
It needn’t be. Most in-memory database systems offer features for adding persistence, or the ability survive disruption of their hardware or software environment.
One important tool is transaction logging, in which periodic snapshots of the in-memory database (called “savepoints”) are written to non-volatile media. If the system fails and must be restarted, the database either “rolls back” to the last completed transaction, or “rolls forward” to complete any transaction that was in progress when the system went down (depending on the particular IMDS’s implementation of transaction logging).
In-memory database systems can also gain durability by maintaining one or more copies of the database. In this solution – called database replication for high availability – fail-over procedures allow the system to continue using a standby database. The “master” and “replica” databases can be maintained by multiple processes or threads within the same hardware instance. They can also reside on two or more boards in a chassis with a high-speed bus for communication, run on separate computers on a LAN, or exist in other configurations.
Non-volatile RAM or NVRAM provides another means of in-memory database persistence. One type of NVRAM, called battery-RAM, is backed up by a battery so that even if a device is turned off or loses its power source, the memory content—including the database—remains. The latest generations replace the battery with a super-capacitor for the same effect. Newer types of NVRAM, including ferroelectric RAM (FeRAM), magnetoresistive RAM (MRAM) and phase change RAM (PRAM) are designed to maintain information when power is turned off, and offer similar persistence options.
For more on this topic, we invite you to read the white paper, Database Persistence, Without The Performance Penalty – Benchmarking McObject’s In-Memory Database System With AgigA Tech’s Non-Volatile DIMM Technology
Finally, new hybrid database system technology adds the ability to apply disk-based storage selectively, within the broader context of an in-memory database. For example, with McObject’s hybrid eXtremeDB, a notation in the database design or “schema” causes certain record types to be written to disk, while all others are managed entirely in memory. On-disk functions such as cache management are applied only to those records stored on disk, minimizing these activities’ performance impact and CPU demands.
What kinds of applications typically employ an in-memory database?
In-memory databases are most commonly used in applications that demand very fast data access, storage and manipulation, and in systems that don’t typically have a disk but nevertheless must manage appreciable quantities of data.
An important use for in-memory database systems is in real-time embedded systems. IMDSs running on real-time operating systems (RTOSs) provide the responsiveness needed in applications including IP network routing, telecom switching, and industrial control. IMDSs manage music databases in MP3 players and handle program data in set-top boxes. In-memory databases’ typically small memory and CPU footprint make them ideal because most embedded systems are highly resource-constrained.
Non-embedded applications requiring exceptional performance are an important growth area for in-memory database systems. For example, algorithmic trading and other applications for financial markets use IMDSs to provide instant manipulation of data, in order to identify and leverage market opportunities. Some multi-user Web applications – such as e-commerce and social networking sites – use in-memory databases to cache portions of their back-end on-disk database systems. These enterprise-scale applications sometimes require very large in-memory data stores, and this need is met by 64-bit IMDS editions.
Is an in-memory database the same as an “embedded database”?
How scalable is an in-memory database system? My application manages terabytes of data – is it practical to hold this much in an in-memory database?
IMDS technology scales well beyond the terabyte size range. McObject’s benchmark report, In-Memory Database Systems (IMDSs): Pushing Past the Terabyte-Plus Boundary, detailed this scalability with a 64-bit in-memory database system deployed on a 160-core SGI Altix 4700 server running SUSE Linux Enterprise Server version 9 from Novell. The database grew to 1.17 terabytes and 15.54 billion rows, with no apparent limits on it scaling further.
Performance remained consistent as the database size grew into the hundreds of gigabytes and exceeded a terabyte, suggesting nearly linear scalability. For a simple SELECT against the fully populated database, the IMDS (McObject’s eXtremeDB) processed 87.78 million query transactions per second using its native application programming interface (API) and 28.14 million transactions per second using SQL. To put these results in perspective, consider that the lingua franca for discussing query performance is transactions per minute.
A single logical eXtremeDB in-memory database can sharded, in other words horizontally partitioned, in to multiple physical databases. These shards, or partitions, can be distributed across multiple servers, thus overcoming the memory limitations of any single server.
Doesn’t it take a long time to populate an in-memory database?
Populating a very large in-memory database system can be much faster than populating an on-disk DBMS. During such “data ingest,” on-disk database systems use caching to enhance performance. But eventually, memory buffers fill up, and the system writes the data to the file system (logical I/O). Eventually, the file system buffers also fill up, and data must be written to the hard disk (physical I/O). Physical I/O is usually measured in milliseconds, and its performance burden is much greater than that of logical I/O (which is usually measured in microseconds). Physical I/O may be required by an on-disk DBMS for other reasons, for example, to guarantee transactional integrity.
Consider what happens when populating an on-disk database, as the total amount of stored data increases:
First, as the database grows, the tree indexes used to organize data grow deeper, and the average number of steps into the tree, to reach the storage location, expands. Each step imposes a logical disk I/O.
Second, assuming that the cache size stays the same, the percent of the database that is cached is smaller. Therefore, it is more likely that any logical disk I/O causes a more-burdensome physical I/O.
Third, as the database grows, it consumes more physical space on the disk platter, and the average time to move the head from position to position is greater. When the head travels further, physical I/O takes longer, further degrading performance.
In contrast, in-memory database ingest performance is roughly linear as database size increases.
Isn’t an in-memory database only really usable on a single computer system, whereas an on-disk database can be shared by any number of computers on a network?
An in-memory database system can be either an “embedded database” or a “client/server” database system. Client/server database systems are inherently multi-user, but embedded in-memory databases can also be shared by multiple threads/processes/users. First, the database can be created in shared memory, with the database system providing a mechanism to control concurrent access. Also, embedded databases can (and eXtremeDB does) provide a set of interfaces that allow processes that execute on network nodes remote from the database node, to read from and write to the database. Finally, database replication can be exploited to copy the in-memory database to the node(s) where processes are located, so that those processes can query a local database and eliminate network traffic and latency.
Read more about how flexible eXtremeDB can be embedded and client/server.
What’s different/better about an in-memory database versus STL or Boost collections, or even just creating my own memory-mapped file(s)?
The question is the same as asking why these alternatives are not viable replacements for Oracle, MS SQL Server, DB2, and other on-disk databases. Any database system goes far beyond giving you a set of interfaces to manage collections, lists, etc. This typically includes support for ACID (atomic, consistent, isolated and durable) transactions, multi-user access, a high level data definition language, one or more programming interfaces (including industry-standard SQL), triggers/event notifications, and more.
Won’t an in-memory database require huge amounts of memory because database systems are large?
Equating “database management system” with “big” is justified, generally speaking. Even some embedded DBMSs are megabytes in code size. This is true largely because traditional on-disk databases – including some that have now been adapted for use in memory, and are pitched as IMDSs—were not written with the goal of minimizing code size (or CPU cycles). As described above, on-disk database systems’ overriding design goal is amelioration of disk I/O.
In contrast, a database system designed from first principles for in-memory use can be much smaller, requiring less than 200K of memory, compared to many 100s of kilobytes up to many megabytes for other database architectures. This reduction in code size results from:
- Elimination of on-disk database capabilities that become irrelevant for in-memory use, such as all processes surrounding caching and file I/O
- Elimination of many features that are unnecessary in the types of application that use in-memory databases. An IP router does not need separate client and server software modules to manage routing data. And a persistent Web cache doesn’t need user access rights or stored procedures
- Hundreds of other development decisions that are guided by the design philosophy that memory equals storage space, so efficient use of that memory is paramount.
Learn about the many ways eXtremeDB reduces memory usage in embedded systems.
McObject is made up of a dedicated group of database management specialists, and we have the answers to in-memory database questions.
We have worked hard to make eXtremeDB the best in-memory database management system on the market. We invite you to review additional research, and evaluate our software free for 60 days.
Articles about data management solutions
Database management system solutions are our only focus at McObject. We are proud to employ some of the best database management system specialists in the field, and invite you to read these and other published articles.
Change Data Capture in Embedded Databases
Embedded Computing Design CDC in embedded database systems can be implemented in several different ways that are sometimes invisible to applications, and ways that applications can exploit for data sharing, responding to events, and incremental back up. Learn more about Change Data Capture with eXtremeDB.
The Future of Embedded Databases
Insights Tech the Future Challenges and solutions for managing IoT data at rest and in flight.
What is a Distributed Database System?
IoT Agenda.TechTarget An in-depth article about three use cases integral to the deployment of scalable and reliable IoT systems.
Portability Techniques for Embedded Systems
Truly portable embedded systems data management code carries its optimization with it, requiring the absolute minimum adaptation to deliver the best performance in new environments.
Will the Real IMDS Please Stand Up?
Real v. imitation in-memory database systems. Understanding the distinction is critical to determining the performance, cost and ultimately the success or failure of a solution.
Distributed Database Systems and Edge/Fog/Cloud Computing
A distributed database system is one in which the data belonging to a single logical database is distributed to two or more physical databases. Beyond that simple definition, there are a confusing number of possibilities for when, how, and why the data is distributed.
Embedded Databases: Building In Always On High Availability
This Webinar presents development solutions that address the “always on” needs of fault-tolerant systems, while taming these applications’ growing data complexity.
Using Data Indexes to Boost Performance and Minimize Footprint in Embedded Software
The right index can boost lookup speed logarithmically, and reduce RAM and CPU demands. While the B-Tree is the best known index, many others can be more efficient in specific circumstances, such as geospatial/mapping and telecom/networking applications.
Edge Node Database Systems, the Internet of Things’ Hidden Workhorses
A review of edge node DBMS requirements and a look at the field of available solutions. Topics covered include DBMS architectures, NoSQL, data complexity, off-the-shelf vs. roll-your-own data management approaches, and more.
Our eXtremeDB HTML documentation includes an extensive online library to introduce eXtremeDB, walk you through the installation process, and the use of key features.
Hybrid In-memory and/or Persistent
Combine both database paradigms – in-memory and on-disk – in a single database instance. Specifying one set of data as transient (managed in memory), while choosing persistent storage for other record types, requires a simple database schema declaration.