The challenges of processing data from devices with limited connectivity and how to solve them
Need to process data from frequently disconnected devices? Better use an event streaming platform paired with a powerful stream processing engine. Here's why.
Real-time processing is becoming more important in the IoT industry. This is because engineers use sensor data to detect when a device or machine is about to fail and can act to mitigate these failures. But real-time processing is challenging because messages often arrive out of sequence, making it difficult to maintain accurate time-series analysis and generate reliable alerts. This article explains why using an event streaming platform with a streaming processing engine will best help you handle these challenges and why message queues are unsuitable for this task in the long run.
The problem of late-arriving data
In any data pipeline that integrates multiple data sources, you’re going to get data that arrives late and out of order (“late” here, means there's a lag between when the data is created and when it arrives in the pipeline). However this problem is especially acute in the IoT world. You’re collecting data from thousands of devices out in the world that are intermittently online.
This issue complicates time-sensitive processing tasks, such as:
Detecting thresholds and sending alerts
When processing sensor data, you often need to determine if a measurement exceeds a threshold and whether an alert should be sent. However, you usually don’t want to send an alert if the threshold is only crossed for a split second. The threshold should be repeatedly crossed over a longer time period such as a minute before warranting an alert. You also want to send alerts as close as possible to the time that the threshold was crossed. This requires messages to be processed on time and in the correct order.
Checking for complex patterns over a window of time
Anomaly detection models are trained to evaluate inputs that cover a specific length of time as 30mins or an hour. A data pipeline must provide the model with the required input, but again, if messages arrive late and out of order, the ML model doesn’t have the whole picture and can’t provide an accurate output.
Event streaming platforms are better at processing data from devices with limited connectivity
Even streaming platforms such as Apache Kafka excel at this kind of challenge. Systems like Apache Kafka can seem more complicated than using a message queue, but they make things simpler in the long run.
Here’s why:
They retain a historical log of messages
A Kafka topic is a durable, immutable message log. Kafka never deletes messages unless they fall outside of the configured retention time or size limit. Kafka itself doesn't automatically put messages in the correct order (like message queues, messages in a topic are ordered in the sequence that they arrive). However, unlike message queues, a consumer application can read the message history from any point in time, and reorder the messages in a local buffer before performing time-sensitive calculations. You easily can’t do that with message queues—once a message has been consumed, it’s gone.
They support parallel processing
Parallel processing means that late arriving data doesn’t slow down your whole data pipeline. Unlike simple message queues, Kafka follows the “pub/sub” messaging model which means that multiple consumers can read from the same topic (which is a log rather than a queue). Not only that, Kafka’s data partitioning allows consumers to read from different “partitions” of a topic to process different ranges of data in parallel. You could achieve something similar with multiple message queues, but this takes a lot more manual work and makes your data pipeline a lot more complicated than it needs to be.
They support data partitioning
As mentioned above, Kafka lets you partition data in a topic by any type of key such as “machine ID” or “site ID” which facilitates parallel processing.
Suppose that you have the following scenario:
- You have a site with multiple machines that generate sensor data—let’s call it “site 123”. The site’s data is assigned to “partition A” with “consumer A” in charge of data processing.
- Likewise, you have a second site, “site 456”, which also generates machine data. This data is assigned to partition B with its own “consumer B”.
Now, suppose that data transmission from “site 123” has been fine, but “site 456” has had constant connectivity issues the whole week.
It's finally back online and is sending a backlog of data to be processed. Consumer B needs time to catch up with the backlog of data from “site 456” , but this won't affect data processing from the first site (“site 123”). Consumer A (which is processing data from “site 123”) runs as a completely separate process with its own system resources.
Using event streaming together with a stream processing engine
Neither an event streaming platform nor a message queue can reorder your messages for you. You need a consumer application to do that. And to manage ordering correctly, an application needs to handle different time semantics.
Two senses of time
Most pipelines that process sensor data work with two types of timestamps:
- Event time represents when the sensor reading was actually taken. For example, when a vehicle's temperature sensor recorded a specific reading. This is the most important time stamp for analysis and alerting. This is contained within the message payload and the format can vary between producing applications.
- Ingestion time marks when the message first reaches your cloud infrastructure. This might be hours or days after the event time if the device was offline. It’s metadata that is added by the messaging system itself.
To process sensor data accurately, a processing application needs to use the event time (the ingestion time is mostly useful for monitoring and debugging). This will allow it to order messages properly and perform other time-sensitive tasks such as updating previous calculations when late data arrives.
How stream processing engines solving the ordering problem
Kafka Streams is Kafka’s native stream processing library and a popular choice for solving the ordering problem. It can handle different time semantics and, like most stream processors, it supports a “grace period” that lets you wait for late arriving messages before finalizing a processing task.
For example, if you know that messages can often arrive up to 15mins late, you can set a grace period of 15mins for a windowed calculation (assuming you’re OK with delaying the results by 15mins).
An example scenario
Let’s say you’re doing anomaly detection on a “hopping” window of 30mins of temperature data for a set of machines. Your ML model needs temperature averages broken down by minute so you have a Kafka Streams application to calculate these required “features” and write them to a “temp_features” topic.
However, between 2pm and 2:30pm, machine #127 has some connectivity issues, goes offline for a while, and buffers the data locally.
Then, it comes back online and sends a burst of accumulated sensor data at 2:31. Yet these late-arriving readings are from 2-2:30pm—1 min too late! The processing window for 2-2:30pm has closed.
Since you have a grace period of 15mins, that data can still be incorporated into the processing window. The Kafka streams application buffers the data in state (by default it uses RocksDB as a local database). When the late data arrives, Kafka Streams seamlessly integrates it into the existing data, sorts it by the custom time stamp, and revises its calculations. After the 15 minute grace period is over (at 2:45pm), Kafka Streams releases its calculation results for the time window of 2-2:30pm and produces a message to the downstream “temp_features” topic.
Again, if you were using a regular message queue, you would have to write your own custom stateful processing logic. It’s much easier to use a library that provides a stateful processing API out of the box.
Other messaging and processing options
It’s worth noting that Kafka Streams isn’t the only stream processing tool that can solve the ordering problem. Apache Flink is another popular option and you don’t necessarily have to use it with Kafka (even though they’re often used together). Flink is very powerful but also notoriously complex, so it's only worth considering if you need top-notch processing capabilities.
Another caveat is that Flink and Kafka Streams are both JVM-based tools which are not easy for the “casual programmer” to pick up. Casual programmers are people who specialize in another discipline such as biology or mechanical engineering and write code to solve certain problems. For casual programmers, Python has long been the first choice due to its simplicity.
Fortunately, there are now several libraries for doing real-time processing in Python. The oldest is Faust, and new contenders include Bytewax, Glassflow and Quix Streams.
Likewise, you also don’t have to use Kafka as your event streaming platform. Kafka-compatible platforms such as Redpanda and Warpstream provide the same set of features as Kafka itself and can be cheaper to operate. There’s also AWS Kinesis which doesn’t support the Kafka API, but offers similar features. You can use it in combination with AWS Managed Flink, but again, this can get complicated (and expensive!).
Use event streaming to process data from devices with limited connectivity
When choosing a stack to process sensor data, avoid using message queues like AWS SQS or ActiveMQ. Those tools are designed for reliable asynchronous communication between applications, particularly for decoupling services and handling background tasks/jobs. But they’re not designed for high-throughput processing of ordered event data and stateful windowed calculations—both common requirements for sensor data processing.
The ideal stack should comprise an event streaming platform and a complimentary stream processing engine.
The event streaming platform should allow you to retain and “replay” data in topics (or topic equivalents). It should also support parallel processing using a pub/sub model. This will enable you to leverage a complimentary stream processing engine that can consume data from different partitions.
The stream processing engine should support different time semantics with configurable event times. It should also allow you to perform stateful processing with windowed calculations that include a configurable grace period.
Here, we’ve mostly talked about using Apache Kafka with Kafka Streams but as mentioned before you can use other event streaming platforms with other stream process engines including a range of easy-to-use Python libraries. Note that we haven’t talked about the infrastructure required to run your processing applications since that’s really a different subject.
Nevertheless, if you’re in the IoT industry, you probably want something simple that doesn’t take ages to set up. Quix Cloud is a stream processing and Container as a Service (CaaS) platform that lets you host and run Kafka Streams and Quix Streams applications with minimal cost and a short learning curve. It’s a great starting point if you want to experiment with stream processing and gradually migrate your data pipeline from legacy infrastructure.
What’s a Rich Text element?
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Mike Rosam is Co-Founder and CEO at Quix, where he works at the intersection of business and technology to pioneer the world's first streaming data development platform. He was previously Head of Innovation at McLaren Applied, where he led the data analytics product line. Mike has a degree in Mechanical Engineering and an MBA from Imperial College London.