eXtremeDB for Time Series Analysis

 

What is Time Series Data?

Time series data is usually some system-generated data that arrives in time (chronological) order. The world abounds with examples. In the financial markets, trade-and-quote (TAQ) data is inherently time series. In the Internet of Things (IoT), sensors capture and stream measurements.

The time interval between instances can be regular or irregular. For TAQ data, the interval is irregular, since it depends on trading systems that are either humans placing orders or bots reacting to market buy/sell signals.

Most often, time series data is append-only and immutable, meaning that each new data point is a new entry in the time series and, once recorded, never changes. But, not always. For example, when merging multiple financial market data feeds for the purpose of having a single golden standard, it can be necessary to insert new elements, update elements or delete elements.

How Does Time Series Data Affect Database Design?

Traditional database management systems bring rows of data into the CPU cache for processing. But time series data is naturally columnar, and handled more efficiently by a column-based layout. eXtremeDB for HPC stores time series data with a columnar layout, and “normal” data with a conventional row-based layout. The result is higher performance for time series analytics, resulting from a database system that best leverages the CPU cache speed and avoids costly (in performance terms) fetches from storage and main memory.

Time Series Analysis

Time series analysis is a specialized sub-domain of data management, with special rules. Applications that crunch time series data (including market data) benefit from a DBMS tailored to the task.

eXtremeDB performs time series analysis with record-setting performance through two capabilities: columnar data layout and pipelining vector-based statistical functions.

Columnar Data Layout

Traditional DBMSs bring rows of data from a database table into CPU cache for processing, but a time series, such as a security’s price over time, typically occupies a single column within a table. Much of the rest of the row is irrelevant for time series analysis, making it inefficient to fetch entire rows. For example, to compute volume weighted average price (VWAP) for an equity in a certain time frame, we need the timestamp, trade price and volume of shares traded for each trade (execution) within the time frame. We don’t need any other columns.

With its sequence data type, eXtremeDB implements columnar data layout. This means data is transferred column-by-column (not row-by-row) from storage into RAM (database cache) and I/O is not wasted transferring irrelevant data, and data is also transferred from RAM into CPU cache column-by-column, and the CPU cache is not “flooded” with unneeded data. See the diagram below, under “Hybrid Data Layout”.

 

Pipelining with eXtremeDB

Throughput between main memory and CPU cache is 2x to 4x slower than the CPU can process data. Traditional DBMSs traverse this bottleneck frequently, with the CPU typically handing off results of each step of a multi-step calculation to temporary tables in main memory.

In contrast, pipelining vector-based statistical functions with eXtremeDB for HPC avoids these hand offs, keeping interim results in CPU cache.

Please use the following link to view a video about pipelining.

Record-Setting Time Series Analysis

The Securities Technology Analysis Center’s STAC-M3 benchmark presents the ideal test of time series analysis performance, with queries provided by trading firms. Using the features described above, eXtremeDB has repeatedly delivered record-setting STAC-M3 results in these independently audited benchmark tests.

Hybrid Data Layout

While columnar data layout accelerates time series analysis, conventional layout using rows is often faster for data that is not sequential, including for relational database ‘join’ operations. eXtremeDB implements row-based layout for all data types other than sequences. Row and columnar layout can be combined in hybrid data designs to optimize performance managing mixed data.