What is stream processing?

An overview of stream processing: core concepts, use cases enabled, what challenges stream processing presents, and what the future looks like as AI starts playing a bigger role in how we process and analyze streaming data

Tun Shwe

VP Data

Stream processing fundamentals

Until recently, constrained computing power and concerns about complexity meant that we had to take snapshots of the real world and pass them along for batch processing. But that led to a disconnect between the real-time data of events taking place and how we could use it in our applications. With stream processing, it's as if we've switched from taking still photos to filming a live, high-definition broadcast. It’s a shift that equips us to build software that uses streaming data to predict and adapt to the world around us in real-time.

That’s because stream processing handles data as it arrives and delivers the results in a continuous output stream. Crucially, data processing takes place at the speed of the stream, meaning that we can now respond to events as they happen, rather than waiting for batch processing to catch up. It’s why our rideshare can reroute around traffic in real-time and our bank calls us to query a suspected fraudulent transaction even before we’ve noticed the money is missing. In effect, stream processing fulfills the promises of big data technology, dealing with a high throughput of data and putting it at the heart of how we understand and interact with the world.

In this article, we're going to examine how stream processing enables those use cases, what challenges stream processing presents, and what the future might look like, especially as AI starts playing a bigger role in how we process and analyze streaming data.

Overview of stream processing

We can unpack what stream processing is by starting with a broad overview of the flow of data:

Data source: All stream processing starts with at least one data source. That could be anything that continuously generates new data points. Think of a sensor on an IoT device, a log file generated by a microservice, or live social media feeds.
Message broker: An application ingests data from the data source and writes/produces them to a message broker such as Apache Kafka or RabbitMQ. This serves as a central source for distributing data to multiple readers/consumers in parallel.
Processing engine: In the middle is a stream processing engine such as Apache Storm, Spark, Flink, or Quix. This tool consumes the data stream, does some computation, and then passes it onto the next stage. Depending on how complex the process is, that could be another stream processor or a final output stream for delivery to another destination.
Exit the system via a data sink: Once we’ve done all we need to with the data, we write it to a destination, which is usually an external tool such as a database, or a visualization tool.

Let’s illustrate it with an example. Consider a utility company's billing system. The company offers customers an agile tariff that dynamically adjusts based on demand, wholesale pricing, and other factors. To determine the fluctuating retail price, the utility needs to continuously ingest data from multiple sources, process it in real time, and send the updated retail price to the billing system. Without real-time processing, such a dynamic tariff would be impossible.

Stream processing example: utility company's billing system

In a batch processing model, usage data is stored in legacy databases for later processing against a predefined tariff. That restricts the utility company's ability to capitalize on arbitrage opportunities and deprives customers of the chance to take advantage of lower prices when they become available.

If you’ve spent any time with just-in-time (JIT) or lean manufacturing, you might notice that stream processing bears a striking resemblance but with the principles applied to big data technology. In JIT manufacturing, components are delivered and assembled precisely when they are needed in the production line, eliminating waste and inefficiency. Similarly, in stream processing, data flows continuously into the system, and the processing work occurs at the precise moment it's required without introducing delays. This real-time approach minimizes data processing waste, ensures resources are utilized efficiently, and enables immediate responses to emerging insights and events.

Stream processing core concepts

Now we’ve got an overview of the basics, let’s go deeper into the core concepts of real-time data processing systems, including:

Data sources and formats
Data ingestion
Event time vs processing time
Stream processing architectures

Data sources and data formats

You need at least one data source, otherwise known as a data producer, to output data for you to process. There are a few ways we can think about these data sources but two are particularly useful. The most obvious, perhaps, is the type of source, such as:

IoT sensors: like our gas meter example above or a tyre pressure monitor on a combine harvester.
Log files: from a microservice, a web server, or elsewhere.
Stock and commodity data: often used for predictive analytics.
Sports results: for streaming to fans and for use in setting the odds in betting markets.
Location data: whether it’s a rideshare application, food delivery, or something more exotic such as wildlife tracking.
Traffic monitoring data: the flow rate of traffic through road systems, railways, harbours and airspace.

We can infer things like the volume and type of data just by knowing the source of an event stream. And that brings us onto the second, and perhaps more useful, way to think about data sources; the characteristics that affect how we process them:

Volume: how much data arrives in a particular time period.
Velocity: the speed at which the data arrives.
Variety and quality: for example, uniform data in a known format, such as location data from a GPS enabled device, or something more messy, like a social media stream.
Complexity: this will determine the stream processing logic needed for each data source.

Once we know the shape of our data sources, we can think about how to get those data streams into our stream processing engine.

Data ingestion

Getting the event streams from the data sources into our stream processing engine is the first step. It presents us with three challenges, as compared to batch processing:

Handling the volume of data: we need to make sure there’s enough capacity in our system to ingest all of the inbound data, even as that throughput fluctuates over time.
Minimizing latency: real-time analysis of the inbound data is at the heart of stream processing, so we need to ensure minimal latency, for example by introducing parallel computation.
Maintaining data integrity: can we trust the quality of the inbound data flows or do we need to add steps, and latency, for data cleanup?

Event time vs processing time

Minimal latency is a fundamental requirement of any real-time processing system. Stream processing systems live or die by the length of time between when an event occurs and when the system processes it. To measure the difference, we need to know:

Event time: The time at which the event occurred. For example, a gas meter reading taken at 15:01:32 has an event time of 15:01:32.
Processing time: The time at which our stream processor completed its work on that event. For example, that could be 15:01:49, giving us a delta of 17 seconds between event time and processing time.

But latency isn’t the only reason why this is important. Knowing when a data event actually occurred enables us to infer causality, analyze trends, and make predictions.

Minimal latency is a fundamental requirement of any real-time processing system

Related to this is windowing. Windowing in stream processing is where we group a sequence of events over a particular time frame or some other factor, such as the number of events. Collating events into manageable bounded chunks makes it easier to analyze and process the data stream. By applying windowing, we can observe patterns, trends, and anomalies over these defined intervals, making it easier to draw meaningful insights and make timely decisions based on the data's temporal characteristics.

Stateful stream processing

A subset of stream processing, stateful stream processing maintains a memory of past events. This memory, or "state", allows the processing system to track changes over time or across different data points. It also enables the system to perform complex calculations that depend on the sequence or accumulation of data points. This is in contrast to stateless stream processing, where each piece of data is processed independently, with no memory of previous events.

Check out this blog post to learn more about stateful stream processing, its challenges, key concepts, and common use cases.

Stream processing architectures

Stream processing is a complex problem to solve and that complexity has led to differing approaches to how data is ingested, processed, and managed in real-time. Those approaches make different trade-offs around latency, the complexity of the system, and similar factors.

The two principal data stream processing architectures are:

Lambda architecture
Kappa architecture.

It’s worth noting that lambda architecture has no connection to AWS Lambda.

Let’s compare them.

Factor	Lambda	Kappa
Method	Mixes batch processing, for historical data analysis, with real-time processing	Processes data in real-time
Latency	Batch processing introduces latency	Lower latency
Complexity	Harder to implement	Easier to implement
Scalability	Harder to scale as there are two systems (batch and stream) that must work in concert	Straightforward to scale on per-stream basis

The kappa architecture is best suited to use cases where real-time processing is the priority, such as analyzing real-time payments stream data for fraud detection. Lambda’s mix of batch processing and real-time stream processing opens the possibility of analyzing historical data alongside live streaming data. For example, the lambda architecture could be useful for a customer recommendation engine in ecommerce, combining past behavior with current clickstream data.

Stream processing engines

Everything we’ve seen so far points to stream processing being somewhat complex. That’s why there’s a whole ecosystem of specialized tools that ingest, process, and direct data streams.

Stream processing engine stages

Although they take somewhat different approaches to the problem, which we’ll look at in a moment, each stream processing engine breaks the problem into four or so stages:

Data ingestion: Data streams from sensors, logs, and other data producers enter the system via a message transport.
Data processing: Perhaps the most important stage of the stream processing engine, this is where the system draws out insights from the data and determines what actions should happen next. Examples of data processing that happens here include:
- Filtering: Choosing which data from the input stream to keep. This could be based on hashtags in a social media feed, for example.
- Aggregation: Calculations such as summing and averaging data points. For example, finding the average temperature from a range of thermometer readings.
- Enrichment: Adding data from other sources. For example, adding live exchange rate data whilst processing a customer’s card payments abroad.
- Complex event processing (CEP): Looking to identify relationships and patterns and then take decisions, such as rerouting a rideshare car due to increased traffic on the original route.
State management: Throughout the process, the engine maintains the state of each individual stream, for example the maximum observed value in a time window for that stream, and some engines will also maintain a view of the global system state.
Data output: When the processing is complete, the engine outputs the data into a message queue, lakehouse, warehouse, or a similar form of persistent storage.

Evaluating stream processing engines

With each processing engine taking a different approach to the problem, what is it that separates Apache Flink from Spark, Kafka, Quix, and others?

To help evaluate them, we need to consider the criteria that separates them, such as:

Latency: How much latency does the engine introduce and is that acceptable for the use case you have in mind?
State management: The level of granularity. Does it maintain global state or only per-stream state? Does it persist state or is it ephemeral? Can it deal with stateful operations encompassing many records?
Fault tolerance and reliability: How well does the engine recover from failures? What is the scope for data loss in the event of a failure?
Scalability: What scaling model, if any, does the engine offer? What is the approach to data consistency vs availability across the cluster?
Developer experience: Is it easy to configure and integrate into your application architecture? Is there a strong developer community that can help and offer professional services?
Ecosystem: Does it have code samples plugins and connectors to enable the data integration you need?
Business model: Do the licensing and support costs make financial sense? Or if you’re going for a purely open source offering, is the community strong enough to ensure it remains active for the lifetime of your project?

Using these factors, you can select the stream processing engine that best fits your needs.

So, Apache Kafka might be best suited if you need very high throughput and the ability to scale, with the flexibility of passing the actual processing off to other tools. Apache Flink, though, is better suited where low latency is a priority and you want to manage stateful processing.

And then there are fully managed services, such as our own Quix Cloud that brings microservices together, with Kafka and infrastructure as code.

Here’s how some of the most popular stream processing engines compare:

Stream processing use cases

Stream processing has had a profound impact on how we build applications and the expectations of end users. It’s reasonable to say that its ability to extract real-time insights from continuous data streams has revolutionized software development and user experiences.

For us as application developers, stream processing has enabled us to solve problems in ways that were previously impossible or impractical. Let’s take a look at some of the use cases that are particularly well suited to stream processing:

Real-time analytics: Using stream processing tools, e-commerce companies personalize product recommendations by analyzing how customers behave in real-time.
Fraud detection: Monitoring an individual’s real-time financial transactions and comparing them to historical data for that customer enables banks and card providers to identify and verify payments that appear to be out of character.
IoT and sensor data management: Fleets of IoT devices generate continuous and varied streams of data, which feed into stream processing engines for real-time decision making across diverse industries such as maritime and agriculture.
Network traffic monitoring: Continuous monitoring of network traffic enables rapid detection and response to security threats, performance bottlenecks, and other issues in network behavior.

Quix: a fully managed streaming engine

It’s almost hard to imagine a software landscape without stream processing. But provisioning and managing a streaming application architecture means diverting your engineering attention away from building the unique value delivered by your product.

That’s why we created Quix; to make Python stream processing simple, for ML and AI. We first built Quix Streams, an open source Python library for building containerized stream processing applications with Apache Kafka. We then wrapped it up with Quix Cloud, which provides fully managed containers, Kafka and observability tools to run your applications in production. With Quix, you can focus entirely on building serverless event streaming applications instead of dealing with the headache of managing the underlying infrastructure. To learn more, check out the Quix docs.

Last updated:

Apr 30, 2024

Share this article: