An open source streaming library for data intensive applications
It’s our third birthday today, and to celebrate we decided to open source a project we’ve been working on since 2nd March 2020.
After three years in development, we’re pleased to release Quix Streams, a new library designed for building real-time applications that process high volumes of telemetry data when developers need a quick response and guaranteed reliability at scale.
Quix Streams is a lightweight library like Kafka Streams, has next-gen blob-backed state management like Flink, and Python-native support like Faust. It's written in C# and designed to be easily extended to other programming languages with an included interop generator project. Quix Streams is JVM free, does not require separate clusters or orchestrators, and includes enhanced features for working with time-series and binary telemetry data at the edge, on-prem or in the cloud.
Available to everyone for free under the Apache 2.0 licence. Quix Streams is aimed at people who work with time-series telemetry data streams—from software, data and ML engineers to data scientists and mechanical engineers.
Why we’re open sourcing Quix Streams
We’re excited by the potential of real-time telemetry data streams to create a new wave of next-gen applications at the intersection of the physical and digital worlds.
However we know this won’t happen unless we free developers from the drudgery of event streaming on the JVM and the hard choices they have to make between simplicity, power and reliability.
We want to provide the data community with a tool that is both powerful and easy to use. Thereby giving more developers the freedom to build previously unimaginable applications. Our hope is that open sourcing Quix Streams will encourage the wider data community to help make it even better.
What makes Quix Streams unique
Quix is a little different from other stream processing libraries and Kafka clients for a few key reasons:
- It treats time as a “first-class citizen"
The data in each message is given a timestamp that is automatically set as the primary key. This simplifies operations that involve time, such as the order of messages, buffering etc. Quix Streams also includes special handling for time-series data like window operations where the true timestamp from the source is used rather than the consumer clock.
- It provides a "stream context" which let you group messages by source
When dealing with topic partitions, it can be difficult to ensure that the right data is sent to the right partition. To solve this issue, we created the concept of a Stream context. This ensures that a stream binds all of its messages to one partition and creates an envelope for all different types of data in your topic. This means you can always be sure that the messages from each stream are ordered correctly and ensures that the integrity of the data is retained, even if there are outages or disruptions.
- It has built-in support for time-series data, event data, and large binary blobs
When appending data to a message, you would normally create a Python object and serialize it into bytes. In Quix Streams, the process is simpler. Instead, you append column values to rows and will merge the data into DataFrames and send them over the wire. This is not just more efficient for CPU and bandwidth but also easier to process further in the pipeline.
What you can do with Quix Streams
To learn more about what you can do with this library, check out the core features with simplified code samples to demonstrate how they work:
Use stream contexts for horizontal scaling
Stream contexts allow you to bundle data from one data source into the same scope with supplementary metadata—thus enabling workloads to be horizontally scaled with multiple replicas.
- In the following sample, the create_stream function is used to create a stream called bus-123AAAV which gets assigned to one particular consumer and will receive messages in the correct order:
Produce time-series data without worrying about serialization or deserialization
Quix Streams serializes and deserializes time-series data using different codecs and optimizations to minimize payloads in order to increase throughput and reduce latency.
- The following example shows data being appended to as stream with the add_value method:
Leverage built-in buffers to optimize processing operations for windows of time-series data
If you’re sending data at high frequency, processing each message can be costly. The library provides built-in time-series buffers for producing and consuming, allowing several configurations for balancing between latency and cost.
- For example, you can configure the library to release a packet from the buffer whenever 100 items of timestamped data are collected or when a certain number of milliseconds in data have elapsed (note that this is using time in the data, not the consumer clock).
buffer.packet_size = 100
buffer.time_span_in_milliseconds = 100
- You can then read from the buffer and process it with the on_read function.
Use Pandas DataFrames to produce data more efficiently
Time-series parameters are emitted at the same time, so they share one timestamp. Handling this data independently is wasteful. The library uses a tabular system that can work for instance with Pandas DataFrames natively. Each row has a timestamp and user-defined tags as indexes
Produce and consume different types of mixed data
This library allows you to produce and consume different types of mixed data in the same timestamp, like numbers, strings or binary data.
- For example, you can produce both time-series data and large binary blobs together.
Often, you’ll want to combine time series data with binary data. In the following example, we combine bus's onboard camera with telemetry from its ECU unit so we can analyze the onboard camera feed with context.
- You can also produce events that include payloads:
For example, you might need to listen for changes in time-series or binary streams and produce an event (such as "speed limit exceeded"). These might require some kind of document to send along with the event message (e.g. transaction invoices, or a speeding ticket with photographic proof). Here's an example for a speeding camera:
Leverage built-in stateful processing for greater resiliency
Quix Streams includes an easy-to-use state store combining blob storage and Kubernetes persistence volumes that ensures quick recovery from any outages or disruptions. To use it, you can create an instance of LocalFileStorage or use one of our helper classes to manage the state such as InMemoryStorage.
Here's an example of a stateful operation sum for a selected column in data.
Take advantage of performance and usability enhancements
The library also includes a number of other enhancements that are designed to simplify the process of managing configuration and performance when interacting with Kafka:
- No schema registry required: Quix Streams doesn't need a schema registry to send different sets of types or parameters, this is handled internally by the protocol. This means that you can send more than one schema per topic.
- Message splitting: Quix Streams automatically handles large messages on the producer side, splitting them up if required. You no longer need to worry about Kafka message limits. On the consumer side, those messages are automatically merged back.
- Message Broker configuration: Many configuration settings are needed to use Kafka at its best, and the ideal configuration takes time. Quix Streams takes care of Kafka configuration by default but also supports custom configurations.
- Checkpointing: Quix Streams supports manual or automatic checkpointing when you consume data from a Kafka Topic. This provides the ability to inform the Message Broker that you have already processed messages up to one point.
- Horizontal scaling: Quix Streams handles horizontal scaling using the streaming context feature. You can scale the processing services, from one replica to many and back to one, and the library ensures that the data load is always shared between your replicas reliably.
For a detailed overview of features, see our library documentation.
This is the first iteration of Quix Streams, and we’re already working on the next release. The main highlight is a new feature called "streaming data frames" that simplifies stateful stream processing for users coming from a batch processing environment. It eliminates the need for users to manage state in memory, update rolling windows, deal with checkpointing and state persistence, and manage state recovery after a service unexpectedly restarts. By introducing a familiar interface to Pandas DataFrames, we hope to make stream processing even more accessible to data professionals who are new to streaming data.
The following example shows how you would perform rolling window calculation on a streaming data frame:
Note that this is exactly how you would do the same calculation on static data in Jupyter notebook—so will be easy to learn for those of you who are used to batch processing.
There's also no need to grapple with the complexity of stateful processing on streaming data—this will all be managed by the library. Moreover, although it will still feel like Pandas, it will use binary tables under the hood—which adds a significant performance boost compared to traditional Pandas DataFrames.
To find out when the next version is ready, make sure you watch the Quix Streams GitHub repo.
We also want our roadmap to be shaped by feedback and contributions from the wider data community:
- If you find a bug or want to request an enhancement, feel free to log a GitHub issue.
- If you have questions, need help, or simply want to find out more about the library, try posting a message in our open Slack community “The Stream” or check out our documentation.
- If you want to improve the library, see the contribution guidelines.