Apache Kafka vs Apache Flink: friends or rivals?

Explore the unique features and limitations of Apache Kafka and Apache Flink and learn how these open source streaming titans can either join forces or operate independently.

Tun Shwe

VP Data

The 4 Pillars of a Successful AI Strategy

Foundational strategies that leading companies use to overcome common obstacles and achieve sustained AI success.

Get the guide

Guide to the Event-Driven, Event Streaming Stack

Practical insights into event-driven technologies for developers and software architects.

Get the guide

Introduction

I ‌first heard about Apache Flink in 2016 when my day-to-day job was building batch-based data pipelines for demand forecasting using Apache Spark. A real-time stream processing requirement had emerged with data being ingested into Apache Kafka. Flink's cluster architecture was similar to Spark's and it seemed ideal for streaming use cases. However, this was a few months before it would conveniently be made available in Amazon EMR 5.1.0 so this meant our team would have to build and manage its infrastructure. For mainly this reason of operational complexity, the decision was made to process the data using Kafka Streams and Spark Structured Streaming. I would learn that this was a common decision-making pattern for teams at a time when Kafka was king.

Fast forwarding to today, Flink has become a mature battle-tested solution that has proven it can handle stateful stream processing at scale. We are also at the advent of great choice for Flink as a fully managed service so pairing Kafka with Flink has become a reality for teams these days. With their distinct capabilities, these two frameworks can either join forces or operate independently to cater to an organization's data volumes and deployment strategies.

In this article, we delve into the nuances of Kafka and Flink, carefully examining their unique features and limitations. As the data landscape evolves, understanding the adaptability and potential of these streaming frameworks becomes crucial. So let's uncover the inner workings of these open source streaming titans.

If you’re here because you’re planning to build an event-driven application, I recommend the “Guide to the Event-Driven, Event Streaming Stack,” which talks about all the components of EDA and walks you through a reference use case and decision tree to help you understand where each component fits in.

Key principles of Apache Kafka and Apache Flink

Kafka and Flink are designed to address a wide range of streaming challenges for companies. However, they specialize in optimizing distinct aspects of streaming data and cater to slightly different engineering skill sets.

Understanding Apache Kafka

Originally developed in 2011 at LinkedIn for real-time data processing, Kafka has since evolved from a messaging system into a full-fledged streaming platform, supporting both computing and storage (which launched many a debate on whether it could replace a database). Kafka Streams is one of the four key APIs of Apache Kafka, joining the ranks of the Producer API, Consumer API, and Connector API. Notable benefits of Kafka Streams include:

No manual cluster building required
Automatic failover capabilities
Impressive scalability features
Seamless integration with other technology stacks

As a comprehensive stream processing engine, Kafka Streams powers a diverse range of applications, such as microservices and reactive/event-driven applications.

Understanding Apache Flink

Flink's versatility lies in its support for both streaming and batch processing. In Flink, streams can be either unbounded (stream processing) or bounded (batch processing). It's important to note that, unlike Apache Spark, Flink's foundation and default runtime prioritize streaming over batch processing.

Flink also excels in stateful computing, enabling the processing of events based on event data over time at in-memory speed. This allows for more advanced streaming operations, such as joins and aggregations. When used in conjunction with Apache Kafka, Kafka effectively serves as a storage layer.

Comparing architectures: Apache Kafka vs Apache Flink

Architecture design is a key difference between Kafka and Flink. Kafka Streams is designed to simplify cluster management, while Flink is built on a controller/worker, cluster-based paradigm.

Apache Kafka Architecture

The Kafka Streams API consists of three main elements: producers, consumers, and brokers (Figure 1). Producers generate data and send it to brokers, while consumers read the data ingested by the brokers.

Kafka cluster scheme. — *Figure 1: Kafka Streaming Architecture*

Brokers run on a Kafka cluster, and producers/consumers are entirely decoupled from the system. Each broker stores the actual data sent by producers in topics, collections of messages belonging to the same group/category. These topics can be divided into multiple partitions for optimization. Partitioning data offers benefits like fault tolerance, scalability, and parallelism. Additionally, each broker may only contain parts of the partitions of a topic, distributing the rest across other brokers. This approach helps balance the workload between brokers. To improve reliability, the Kafka cluster can be configured to have replicas for different topics, limiting downtime if a broker becomes unavailable.

Apache Flink Architecture

Flink follows a controller-worker paradigm composed of Job Managers and Task Managers, along with a Kappa architecture (input data enters a single processor stream, and batch processing is handled as an edge case).

When new code is written and run using Flink, an execution graph is created and submitted to the Job Manager. The Job Manager orchestrates the instructions received and sends them to different task managers for execution (Figure 2). Task Managers consist of Task Slots allocated with a fixed amount of resources for processing.

For example, if a task manager has three task slots, a third of its controlled RAM would be assigned to each slot. Additionally, as Task Managers are a JVM (Java Virtual Machine) process, multiple slots can share the same JVM and exchange data sets and data structures to limit operations overhead. In this way, the number of Task Slots defines the number of tasks a Task Manager can perform in parallel. If optimization is needed, Task Slots definitions can be managed by the ResourceManager in the Job Manager.

Finally, the Dispatcher in the Job Master can provide a REST interface to create new Job Managers and submit Flink applications.

Flink architecture. — Figure 2: Flink Architecture

With this multi-level approach, Flink can process large amounts of data in near real-time and provide fault tolerance (allowing operations to restart after failure with limited data loss).

Architecture limitations

Some key architectural limitations of Kafka Streams and Flink:

Using Flink and Kafka together on separate clusters can enable connections to non-Kafka sources and sinks. However, this approach may lead to messy integrations if overused.
Flink requires cluster infrastructure management, while Kafka Streams has no setup requirements and is mainly used within microservices. As a result, Flink may require support from an infrastructure team, unlike Kafka Streams.

Code demonstration and comparison: Apache Kafka vs Apache Flink

When choosing a framework, usability and support availability are crucial factors to consider. Both Kafka Streams and Flink provide extensive support for Java and Scala users, with Flink additionally offering SQL and Python APIs.

Kafka Streams (Kafka’s native stream processing library) can be used with Python in a roundabout way, via the ksqlDB REST API. Otherwise, its full capabilities are not really available to Python users. As more Data Science and ML teams are moving to Python, this can be an essential differentiating factor. However, Flink and Kafka are still mainly based on Java and Scala and the documentation for other languages isn’t quite on par with the Java documentation.

To compare the two systems thoroughly, it’s worthwhile to look at some code samples. Note that the following examples both use Kafka as a data source and sink, because Flink doesn’t have any way of durably storing data—it’s purely a processing framework. Conversely, Kafka can store data and can do some native stream processing (using Kafka Streams), but this is not easily available to Python engineers.

Using Kafka without Flink

The following example uses the kafka-python library which is a simple library for consuming and producing data. It consumes data from a Kafka topic, filters out values below a certain threshold, computes a moving average over a window of values, and produces the results to another Kafka topic.

Note that an external library is needed to transform the data (Pandas). You can’t do this with kafka-python alone.
Also, temporary data is stored in an array (data = []) which is not fault tolerant. The state will get lost if the process is interrupted.

 <script> # Kafka Consumer with Pandas for Moving Average Calculation
import pandas as pd
from kafka import KafkaConsumer, KafkaProducer

threshold = 50

# Create a Kafka Consumer to get the raw data
consumer = KafkaConsumer(
    "input-topic",
    bootstrap_servers=["localhost:9092"],
    api_version=(0, 10),
    value_deserializer=lambda x: int(x.decode("utf-8")),
)

# Create a Kafka Producer to send the processing results
producer = KafkaProducer(
    bootstrap_servers=["localhost:9092"],
    api_version=(0, 10, 2),
    value_serializer=lambda x: str(x).encode("utf-8"),
)

# Data processing and moving average calculation
window_size = 3
data = []
for message in consumer:
    value = message.value
    if value >= threshold:
        data.append(value)
        if len(data) >= window_size:
            moving_average = pd.Series(data).rolling(window=window_size).mean().iloc[-1]
            producer.send("output-topic", moving_average)
            data.pop(0)
</script>

In the previous example, it’s actually Pandas doing all the stream processing while Kafka is used to source the raw data and store the processing results. This is in contrast to Flink which can do the processing natively.

Using Kafka together with Flink

The following example uses the PyFlink library. It leverages Flink’s native Kafka consumer and producer classes to source and store data. It also uses Flink's native APIs to again do the same calculation as before (filter out values below a certain threshold, compute a moving average over a window of values), then produce the results to another Kafka topic.

from pyflink.common.serialization import SimpleStringSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer, FlinkKafkaProducer
from pyflink.datastream.window import TumblingProcessingTimeWindows

threshold = 50

def example():
    # 1. create a StreamExecutionEnvironment
    env = StreamExecutionEnvironment.get_execution_environment()

    # Create a Kafka Consumer to get the raw data
    kafka_consumer = FlinkKafkaConsumer(
        "input-topic",
        SimpleStringSchema(),
        {"bootstrap.servers": "localhost:9092", "group.id": "flink"}
    )

    # Create a Kafka Producer to send the processing results
    kafka_producer = FlinkKafkaProducer(
        "output-topic",
        SimpleStringSchema(),
        {"bootstrap.servers": "localhost:9092"}
    )

    # 4. Create a Source DataStream
    ds = env.add_source(kafka_consumer)

    # 5. Define the execution logic of the
    # data processing and moving average calculation
    ds = ds.map(lambda x: int(x), output_type=Types.INT()).filter(lambda x: x >= threshold)
    ds = ds.window_all(TumblingProcessingTimeWindows.of("3 seconds"))
    ds = ds.reduce(lambda a, b: a + b).map(lambda x: x / 3, output_type=Types.FLOAT())

    # 6. emit result
    ds.add_sink(kafka_producer)

    # 7. execute the job
    env.execute("flink_kafka_moving_average_example")

example()

Note that in Flink, you don't need to manually manage the state like in the Kafka-only example. Flink's windowing API abstracts the state management for you. When you define a window operation on a DataStream, Flink creates and manages the state internally.

Comparing Flink’s API with the Kafka Streams API

Although there’s not a lot of Python support for Kafka Streams, it would be remiss not to provide a general comparison of its stream processing capabilities vs Flink’s.

Both Flink and Kafka Streams are powerful and flexible tools for real-time data processing. They offer distinct features and abstractions that cater to specific use cases.

Flink's API provides a rich set of built-in operators and extensive support for advanced stream processing features, such as event time processing, complex event processing, and flexible windowing. Additionally, Flink's state management and checkpointing mechanisms enable efficient handling of large-scale stateful applications with strong consistency guarantees.
On the other hand, Kafka Streams is designed as a lightweight library tightly integrated with Kafka. It focuses on simplicity and ease of use while still offering core stream processing capabilities, such as stateful transformations and windowed operations. However, Kafka Streams might not provide the same level of advanced features and flexibility as Flink.

The following table illustrates the key differences and similarities between Flink's and Kafka Streams' API:

| **Attribute** | **Apache Flink** | **"Kafka Streams ** | |------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------| | **(native Java API)"** | | **Processing Model** | "- Event-driven | | **- Processes data streams as events"** | "- Record-at-a-time processing | | **- Processes data streams as key-value records"** | | **Windowing & Aggregations** | "- Rich support for windowing (tumbling, sliding, session, global) | | **- Flexible windowing based on event time, processing time, and ingestion time"** | "- Support for windowing (tumbling, sliding, session) | | **- Windowing based on event time and processing time"** | | **State Management** | "- Stateful processing with built-in state backends (e.g., RocksDB) | | **- Checkpointing and savepoints for fault tolerance and recovery"** | "- Stateful processing with state stores (e.g., RocksDB) | | **- State is backed up through changelog topics in Kafka"** | | **Time Semantics** | - Strong support for event-time processing and handling out-of-order events | - Support for event-time processing but with limited handling of out-of-order events | | **API Complexity & Flexibility** | "- Provides DataStream API, Table API, and SQL for different levels of abstraction | | **- More complex but flexible API"** | "- Provides a single, high-level Streams DSL API | | **- More straightforward but less flexible API"** | | **Join Operations** | "- Supports inner, outer, and interval joins | | **- More flexible join options"** | "- Supports inner and outer joins | | **- Limited join options compared to Flink"** | | **Connectors and Integrations** | "- Rich set of connectors for various data stores and messaging systems | | **- Can be used with multiple data sources and sinks"** | "- Connectors provided through Kafka Connect | | **- Primarily designed to work with Kafka as the data source and sink"** | | **Deployment and Scaling** | "- Deployable on standalone clusters, YARN, Mesos, Kubernetes, and other environments | | **- Scaling based on the number of TaskManagers and task slots"** | "- Runs as a library within your application, scaling through Kafka Streams instances | | **- Scaling based on the number of Kafka partitions and application instances"** | | **Latency** | - Low latency due to event-driven processing | - Low latency due to record-at-a-time processing, but generally higher than Flink | | **Throughput** | "- High throughput | | **- Optimized for stream processing"** | "- High throughput | | **- Optimized for stream processing"** | | **Community and Support** | "- Large open source community and commercial support from Ververica | | **- Extensive documentation and resources"** | "- Large open source community and commercial support from Confluent | | **- Extensive documentation and resources"** | | **Ownership** | Infrastructure and BI teams | Business application owners |

Table 1: Flink vs Kafka Streams Comparison

In summary, Apache Flink and Kafka Streams offer distinct approaches to stream processing and distinguish themselves in the following ways:

Kafka Streams only supports unbounded streams, while Flink caters to both unbounded and bounded streams.
With the introduction of exactly-once semantics (EOS), streaming services like Flink and Kafka can process transaction data without the risk of duplication or loss. However, it's crucial to always consider strict SLAs (service level agreements) and potential scaling limitations.
Flink holds an advantage over Kafka in its support for ANSI SQL and its Python API, PyFlink, which facilitates seamless integration with ETL and ML workflows.
Both platforms deliver low latency and high throughput, with Flink generally achieving lower latency due to its event-driven processing. Additionally, they both benefit from extensive open source communities and commercial support.

The choice between Flink and Kafka Streams ultimately depends on your specific requirements and use cases. Flink offers more advanced features and flexibility, while Kafka Streams provides a more lightweight solution, boasting tight integration with Kafka.

Apache Flink is a powerful companion to Apache Kafka

The relationship between Apache Kafka and Flink is not one of rivalry, but rather of synergy. While Kafka serves as a distributed streaming platform, Flink operates as a stream processing framework. Both can be employed to achieve similar objectives, either through native Kafka products or by combining Flink with Kafka. It is worth noting that Flink can be thought of as an augmented version of Kafka Streams, albeit with a higher cost of ownership due to its cluster infrastructure. As we’ve said before, “Handling streaming data is not for the faint of heart or thin of wallet”.

Both Kafka and Flink are designed to scale effectively. Kafka's architecture emphasizes fault tolerance and high throughput, enabling it to operate on a cluster of servers that can be readily expanded to boost data processing capacity. Flink, on the other hand, focuses on scaling the computation needed to analyze and manipulate data streams in real-time. Therefore, when running Flink alongside Kafka, be prepared to manage multiple clusters to accommodate their respective scaling requirements.

For Python teams, choosing between Kafka with Kafka Streams and Flink with Kafka will depend on how sophisticated your processing needs are. Since Kafka Streams is not available in Python, you would need to rely on the Kafka Consumer and Producer APIs along with external Python libraries to implement stream processing logic (or use the ksqlDB REST API). The advantage of this approach is that you can integrate the stream processing code directly into your application, eliminating the need for a separate cluster dedicated to stream processing. While this method works well for simpler use cases, it may prove to be more cumbersome and less performant compared to dedicated stream processing frameworks like Flink.

Conversely, if your team requires more advanced stream processing features, such as complex event processing, windowing, or stateful processing, combining Flink with Kafka may be a more suitable option. PyFlink, Flink's Python API, enables Python teams to harness Flink's robust stream processing capabilities while ensuring seamless integration with Kafka.

Conclusion

As always with these comparisons, you have to weigh processing power and performance against complexity and infrastructure costs. Or you could have your cake and eat it too with Quix, a fully managed cloud platform that has its own stream processing library: Quix Streams. As we’ve shown in our Apache Flink vs Quix comparison, you get the same power and performance as Flink without the headache of running a separate Flink cluster. If your team has the necessary Java and Scala chops to get the best out of Kafka and/or Flink, then it makes more sense to manage everything yourselves. But if you’re a lean Python team that would prefer to be more autonomous then it’s worth considering Quix as a third option.

Share this article:

Words by

Tun Shwe

VP Data

Tun Shwe is the VP of Data at Quix, where he leads data strategy and developer relations. He is focused on helping companies imagine and execute their strategic data vision with stream processing at the forefront. He was previously a Head of Data and a Data Engineer at high growth startups and has spent his career leading teams in developing analytics platforms and data-intensive applications. In his spare time, Tun goes surfing, plays guitar and tends to his analogue cameras.