I first heard about Apache Flink in 2016 when my day-to-day job was building batch-based data pipelines for demand forecasting using Apache Spark. A real-time stream processing requirement had emerged with data being ingested into Apache Kafka. Flink's cluster architecture was similar to Spark's and it seemed ideal for streaming use cases. However, this was a few months before it would conveniently be made available in Amazon EMR 5.1.0 so this meant our team would have to build and manage its infrastructure. For mainly this reason of operational complexity, the decision was made to process the data using Kafka Streams and Spark Structured Streaming. I would learn that this was a common decision-making pattern for teams at a time when Kafka was king.
Fast forwarding to today, Flink has become a mature battle-tested solution that has proven it can handle stateful stream processing at scale. We are also at the advent of great choice for Flink as a fully managed service so pairing Kafka with Flink has become a reality for teams these days. With their distinct capabilities, these two frameworks can either join forces or operate independently to cater to an organization's data volumes and deployment strategies.
In this article, we delve into the nuances of Kafka and Flink, carefully examining their unique features and limitations. As the data landscape evolves, understanding the adaptability and potential of these streaming frameworks becomes crucial. So let's uncover the inner workings of these open source streaming titans.
If you’re here because you’re planning to build an event-driven application, I recommend the “Guide to the Event-Driven, Event Streaming Stack,” which talks about all the components of EDA and walks you through a reference use case and decision tree to help you understand where each component fits in.
Key principles of Apache Kafka and Apache Flink
Kafka and Flink are designed to address a wide range of streaming challenges for companies. However, they specialize in optimizing distinct aspects of streaming data and cater to slightly different engineering skill sets.
Understanding Apache Kafka
Originally developed in 2011 at LinkedIn for real-time data processing, Kafka has since evolved from a messaging system into a full-fledged streaming platform, supporting both computing and storage (which launched many a debate on whether it could replace a database). Kafka Streams is one of the four key APIs of Apache Kafka, joining the ranks of the Producer API, Consumer API, and Connector API. Notable benefits of Kafka Streams include:
- No manual cluster building required
- Automatic failover capabilities
- Impressive scalability features
- Seamless integration with other technology stacks
As a comprehensive stream processing engine, Kafka Streams powers a diverse range of applications, such as microservices and reactive/event-driven applications.
Understanding Apache Flink
Flink's versatility lies in its support for both streaming and batch processing. In Flink, streams can be either unbounded (stream processing) or bounded (batch processing). It's important to note that, unlike Apache Spark, Flink's foundation and default runtime prioritize streaming over batch processing.
Flink also excels in stateful computing, enabling the processing of events based on event data over time at in-memory speed. This allows for more advanced streaming operations, such as joins and aggregations. When used in conjunction with Apache Kafka, Kafka effectively serves as a storage layer.
Comparing architectures: Apache Kafka vs Apache Flink
Architecture design is a key difference between Kafka and Flink. Kafka Streams is designed to simplify cluster management, while Flink is built on a controller/worker, cluster-based paradigm.
Apache Kafka Architecture
The Kafka Streams API consists of three main elements: producers, consumers, and brokers (Figure 1). Producers generate data and send it to brokers, while consumers read the data ingested by the brokers.
Brokers run on a Kafka cluster, and producers/consumers are entirely decoupled from the system. Each broker stores the actual data sent by producers in topics, collections of messages belonging to the same group/category. These topics can be divided into multiple partitions for optimization. Partitioning data offers benefits like fault tolerance, scalability, and parallelism. Additionally, each broker may only contain parts of the partitions of a topic, distributing the rest across other brokers. This approach helps balance the workload between brokers. To improve reliability, the Kafka cluster can be configured to have replicas for different topics, limiting downtime if a broker becomes unavailable.
Apache Flink Architecture
Flink follows a controller-worker paradigm composed of Job Managers and Task Managers, along with a Kappa architecture (input data enters a single processor stream, and batch processing is handled as an edge case).
When new code is written and run using Flink, an execution graph is created and submitted to the Job Manager. The Job Manager orchestrates the instructions received and sends them to different task managers for execution (Figure 2). Task Managers consist of Task Slots allocated with a fixed amount of resources for processing.
For example, if a task manager has three task slots, a third of its controlled RAM would be assigned to each slot. Additionally, as Task Managers are a JVM (Java Virtual Machine) process, multiple slots can share the same JVM and exchange data sets and data structures to limit operations overhead. In this way, the number of Task Slots defines the number of tasks a Task Manager can perform in parallel. If optimization is needed, Task Slots definitions can be managed by the ResourceManager in the Job Manager.
Finally, the Dispatcher in the Job Master can provide a REST interface to create new Job Managers and submit Flink applications.
With this multi-level approach, Flink can process large amounts of data in near real-time and provide fault tolerance (allowing operations to restart after failure with limited data loss).
Some key architectural limitations of Kafka Streams and Flink:
- Using Flink and Kafka together on separate clusters can enable connections to non-Kafka sources and sinks. However, this approach may lead to messy integrations if overused.
- Flink requires cluster infrastructure management, while Kafka Streams has no setup requirements and is mainly used within microservices. As a result, Flink may require support from an infrastructure team, unlike Kafka Streams.
Code demonstration and comparison: Apache Kafka vs Apache Flink
When choosing a framework, usability and support availability are crucial factors to consider. Both Kafka Streams and Flink provide extensive support for Java and Scala users, with Flink additionally offering SQL and Python APIs.
Kafka Streams (Kafka’s native stream processing library) can be used with Python in a roundabout way, via the ksqlDB REST API. Otherwise, its full capabilities are not really available to Python users. As more Data Science and ML teams are moving to Python, this can be an essential differentiating factor. However, Flink and Kafka are still mainly based on Java and Scala and the documentation for other languages isn’t quite on par with the Java documentation.
To compare the two systems thoroughly, it’s worthwhile to look at some code samples. Note that the following examples both use Kafka as a data source and sink, because Flink doesn’t have any way of durably storing data—it’s purely a processing framework. Conversely, Kafka can store data and can do some native stream processing (using Kafka Streams), but this is not easily available to Python engineers.
Using Kafka without Flink
The following example uses the kafka-python library which is a simple library for consuming and producing data. It consumes data from a Kafka topic, filters out values below a certain threshold, computes a moving average over a window of values, and produces the results to another Kafka topic.
- Note that an external library is needed to transform the data (Pandas). You can’t do this with kafka-python alone.
- Also, temporary data is stored in an array (data = ) which is not fault tolerant. The state will get lost if the process is interrupted.
In the previous example, it’s actually Pandas doing all the stream processing while Kafka is used to source the raw data and store the processing results. This is in contrast to Flink which can do the processing natively.
Using Kafka together with Flink
The following example uses the PyFlink library. It leverages Flink’s native Kafka consumer and producer classes to source and store data. It also uses Flink's native APIs to again do the same calculation as before (filter out values below a certain threshold, compute a moving average over a window of values), then produce the results to another Kafka topic.
Note that in Flink, you don't need to manually manage the state like in the Kafka-only example. Flink's windowing API abstracts the state management for you. When you define a window operation on a DataStream, Flink creates and manages the state internally.
Comparing Flink’s API with the Kafka Streams API
Although there’s not a lot of Python support for Kafka Streams, it would be remiss not to provide a general comparison of its stream processing capabilities vs Flink’s.
Both Flink and Kafka Streams are powerful and flexible tools for real-time data processing. They offer distinct features and abstractions that cater to specific use cases.
- Flink's API provides a rich set of built-in operators and extensive support for advanced stream processing features, such as event time processing, complex event processing, and flexible windowing. Additionally, Flink's state management and checkpointing mechanisms enable efficient handling of large-scale stateful applications with strong consistency guarantees.
- On the other hand, Kafka Streams is designed as a lightweight library tightly integrated with Kafka. It focuses on simplicity and ease of use while still offering core stream processing capabilities, such as stateful transformations and windowed operations. However, Kafka Streams might not provide the same level of advanced features and flexibility as Flink.
The following table illustrates the key differences and similarities between Flink's and Kafka Streams' API:
Table 1: Flink vs Kafka Streams Comparison
In summary, Apache Flink and Kafka Streams offer distinct approaches to stream processing and distinguish themselves in the following ways:
- Kafka Streams only supports unbounded streams, while Flink caters to both unbounded and bounded streams.
- With the introduction of exactly-once semantics (EOS), streaming services like Flink and Kafka can process transaction data without the risk of duplication or loss. However, it's crucial to always consider strict SLAs (service level agreements) and potential scaling limitations.
- Flink holds an advantage over Kafka in its support for ANSI SQL and its Python API, PyFlink, which facilitates seamless integration with ETL and ML workflows.
- Both platforms deliver low latency and high throughput, with Flink generally achieving lower latency due to its event-driven processing. Additionally, they both benefit from extensive open source communities and commercial support.
The choice between Flink and Kafka Streams ultimately depends on your specific requirements and use cases. Flink offers more advanced features and flexibility, while Kafka Streams provides a more lightweight solution, boasting tight integration with Kafka.
Apache Flink is a powerful companion to Apache Kafka
The relationship between Apache Kafka and Flink is not one of rivalry, but rather of synergy. While Kafka serves as a distributed streaming platform, Flink operates as a stream processing framework. Both can be employed to achieve similar objectives, either through native Kafka products or by combining Flink with Kafka. It is worth noting that Flink can be thought of as an augmented version of Kafka Streams, albeit with a higher cost of ownership due to its cluster infrastructure. As we’ve said before, “Handling streaming data is not for the faint of heart or thin of wallet”.
Both Kafka and Flink are designed to scale effectively. Kafka's architecture emphasizes fault tolerance and high throughput, enabling it to operate on a cluster of servers that can be readily expanded to boost data processing capacity. Flink, on the other hand, focuses on scaling the computation needed to analyze and manipulate data streams in real-time. Therefore, when running Flink alongside Kafka, be prepared to manage multiple clusters to accommodate their respective scaling requirements.
For Python teams, choosing between Kafka with Kafka Streams and Flink with Kafka will depend on how sophisticated your processing needs are. Since Kafka Streams is not available in Python, you would need to rely on the Kafka Consumer and Producer APIs along with external Python libraries to implement stream processing logic (or use the ksqlDB REST API). The advantage of this approach is that you can integrate the stream processing code directly into your application, eliminating the need for a separate cluster dedicated to stream processing. While this method works well for simpler use cases, it may prove to be more cumbersome and less performant compared to dedicated stream processing frameworks like Flink.
Conversely, if your team requires more advanced stream processing features, such as complex event processing, windowing, or stateful processing, combining Flink with Kafka may be a more suitable option. PyFlink, Flink's Python API, enables Python teams to harness Flink's robust stream processing capabilities while ensuring seamless integration with Kafka.
As always with these comparisons, you have to weigh processing power and performance against complexity and infrastructure costs. Or you could have your cake and eat it too with Quix, a fully managed cloud platform that has its own stream processing library: Quix Streams. As we’ve shown in our Apache Flink vs Quix comparison, you get the same power and performance as Flink without the headache of running a separate Flink cluster. If your team has the necessary Java and Scala chops to get the best out of Kafka and/or Flink, then it makes more sense to manage everything yourselves. But if you’re a lean Python team that would prefer to be more autonomous then it’s worth considering Quix as a third option.
What’s a Rich Text element?
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Tun Shwe is the VP of Data at Quix, where he leads data strategy and developer relations. He is focused on helping companies imagine and execute their strategic data vision with stream processing at the forefront. He was previously a Head of Data and a Data Engineer at high growth startups and has spent his career leading teams in developing analytics platforms and data-intensive applications. In his spare time, Tun goes surfing, plays guitar and tends to his analogue cameras.