The streaming benchmark is based on a stock exchange aggregation. Each message representing a trade is published to Kafka and then a simple windowed aggregation which calculates the number of traders per ticker symbol is done using various data processing frameworks.
The latency is measured as the delay between the earliest time we could have received results for the window the event is in, and the actual time the result was received.
For example, if we want the result for events happening between
12:00:01.000, theoretically, the earliest time we can get an aggregated result is at 12:00:01.000. However, this does not take into account that events happening at
12:00:00.999 might not reach the system immediately and there also can be some out of orderness due to partitioning (Kafka only guarantees ordering by partition).
As a solution, we need to allow for some delay to allow all the events to reach the system. If the delay is configured as one second, we should wait until receiving an event with timestamp
12:00:02 before we can compute the window for events that happened between
12:00:00-12:00:01. Based on this, the earliest time we can get a result will be the time when we received an event with a timestamp
12:00:02.000. If the system’s actual output happens at 12:00.02.100, then we define the latency as 100ms. This latency includes all of the following:
- Time for message to be published to the broker
- Time for message to be consumed by the framework
- Time for message to be processed
Each framework is expected to output tuples to one or more files in the following format:
(WINDOW_TIME, TICKER, COUNT, CALCULATION_TIME, LATENCY)
WINDOW_TIME is defined as the end time of a window. For example, for a window of events between
WINDOW value would be 12:00:01. Latency can then be calculated based on the difference between the
WINDOW and the
A sample output could look as follows:
The first value is the window close timestamp, which indicates what time period this value is for
(WINDOW_TIME). The second value is the stock ticker, and the third the count for that ticker within that window. The next value represents when the processing for that window was completed
(CALCULATION_TIME), and the last value
(LATENCY) is simply
CALCULATION_TIME – WINDOW_TIME.
If the allowed latency was 1000 ms, then this number should also be subtracted from LATENCY to find the real latency of the processing framework.
The following windowing combinations are tested:
- 1-second tumbling window
- 10 seconds by 100 milliseconds sliding window
- 60 seconds by 1-second sliding window
Allowed out of orderness is 1 sec
The output files as above are parsed by a simple log parser written in Python to calculate the average latencies.
All source is available here: big-data-benchmark