Pipeline for Ultra Low Latency
Pipelining slashes latency by eliminating transfers between the CPU and main memory.
Read a white paper about Pipelining
Review the Pipelining datasheet
Watch a video about Pipelining
What is pipelining, and how does it reduce latency?
Pipelining is the programming technique in eXtremeDB that accelerates processing by combining the database system’s vector-based statistical functions into assembly lines of processing for market data, with the output of one function becoming input for the next. Calculations are pipelined in order to keep data within CPU cache during its transformation by multiple functions. Without pipelining, interim results from each function would be transferred back and forth between CPU cache and main memory, imposing significant latency due to the relatively lower-bandwidth front side bus (FSB) or quick path interconnect (QPI) between the two.
Row-by-row vs. Row and/or column
Relational databases typically store data in tables consisting of rows and columns, and transfer data row-by-row between storage, main memory and CPU cache, as in Figure 1. With a columnar approach, as in Figure 2, tables’ columns are the basic unit for constructing the database pages used for database input/output.
eXtremeDB for HPC implements columnar data layout for fields of type ‘sequence’. Sequences can be combined to form a time series, ideal for working with tick streams, historical quotes and other sequential data. This technology supports database designs that combine row-based and column-based layouts, in order to best leverage the CPU cache speed.
In multiple audited STAC-M3 benchmark tests, eXtremeDB has broken earlier records for the best (lowest) mean response times and for lowest standard deviation (highest predictability/lowest “jitter”) of test results. Review our results.
Throughput between main memory and CPU cache is 3x to 4x slower than the CPU can process data. Traditional DBMSs traverse this bottleneck frequently (twice per function), with the CPU handing off results of each step of a multi-step calculation to temporary tables in main memory.
In contrast, pipelining vector-based statistical functions with eXtremeDB for HPC avoids these hand offs, keeping interim results in CPU cache to reduce database latency.
Pipelining for High Performance
Here’s how pipelining works. Let’s say the application needs to calculate 5-day and 21-day moving averages for a stock, and detect the points where the faster moving average (5-day) crosses over or under the slower one (21-day).
This is accomplished below in SQL, using eXtremeDB’s vector-based statistical functions as expressions in a SELECT statement:
seq_window_agg_avg(Close, 21)), 1))
WHERE symbol = ‘IBM’;
- Two invocations of ‘seq_window_agg_avg’ execute over the closing price sequence to obtain 5-day and 21-day moving averages;
- The function ‘seq_sub’ subtracts 21- from 5-day moving averages;
- The result “feeds” a fourth function, ‘seq_cross’, to identify where the 5- and 21-day moving averages cross;
- Finally, the function ‘seq_map’ maps the crossovers to the original ‘Close’ sequence, returning closing prices where the moving averages crossed.
Columnar handling results in faster performance because only the column(s) of interest (closing prices, in our example) are brought from the database into CPU cache at the start of the operation. In contrast, conventional row-wise handling would bring database pages consisting of entire rows with all their columns into CPU cache.
Even more significantly, this approach eliminates the need to create, populate and query temporary tables outside CPU cache (i.e. in main memory), as would be required by other database systems and vector-based programming languages to manage interim results of processing (for example, no table is needed to contain 5-day moving averages, 21-day moving averages, etc.).
McObject’s technology achieves this through tile-based processing of vector elements. In the SQL example above, one tile of input data is processed by each invocation of ‘seq_window_agg_avg ()’, which each produce one tile of transformed data that is passed to ‘seq_sub ()’ which, in turn, produces a tile and passes it as input to seq_cross (), and so on until the last function, seq_map () has exhausted its input. Transferring tiles between functions occurs entirely within CPU cache – data is not “materialized” back into main memory until fully transformed tiles emerge at the end of the pipeline.
Temporary Tables Impose Overhead
In contrast, to accomplish the task discussed above using a traditional SQL DBMS, the developer first creates three temporary tables:
CREATE TEMP TABLE mavg ( 5day float, 21day float );
CREATE TEMP TABLE sub ( delta float );
CREATE TEMP TABLE crosses ( Price float );
The next step calculates 5-day and 21 moving averages and populates the temporary table ‘mavg’ with the results (this code assumes that the database system supports user-defined functions):
INSERT INTO mavg SELECT MovingAvg( Close, 5 ), MovingAvg ( Close, 21 )
WHERE symbol = ‘IBM’;
The next step populates the temp table ‘sub’ with the result of subtracting the two moving averages:
INSERT INTO sub SELECT 5day – 21day FROM mavg;
From here, “normal” SQL can go no further. Code in a language such as C/C++ is needed to get crossover positions.
Clearly, using the traditional SQL DBMS requires a lot more code. But more importantly, all temporary (transient) results used in processing have to be created as output that is external to CPU cache, in temporary tables. This results in multiple trips “across the transom” to fetch data into CPU cache, and then to write it back into the temporary table. This occurs for each temporary table.
Without pipelining, the overhead due to traffic across the QPI or FSB, which is several times slower than CPU cache, is enormous. The delay generated by these transfers is multiplied by the number of pages required to move a given set of data back and forth, and the number of functions in the algorithm. In other words, processing by the SQL DBMS isn’t just 3-4 times slower than the approach using pipelining. It is 3-4 times slower times the number of pages that have to cross the QPI/FSB, times the number of functions. In contrast, pipelining reduces the number of transfers for interim results to zero.