Skip to content

Quix Lake overview

Quix Lake is the persistence layer of Quix Cloud. It captures Kafka topic data into your own cloud storage (Amazon S3, Azure Blob, Google Cloud Storage, or MinIO) in open formats — so you keep full control of the bytes and can reuse them across replay, analytics, and ad-hoc query workloads.

Quix Lake ships two complementary storage options, each with its own managed sink:

Option Writes Best for
Data Lake Raw Kafka messages as Avro segments + Parquet index High-fidelity Replay, audit/compliance, append-only forensic record
Lakehouse Apache Iceberg tables (Parquet + snapshots) registered in a Catalog SQL analytics, dashboards, time-series queries

You can run either, both, or fan a single topic out to both — they share the same blob storage connection but are independent connectors with independent consumer groups.

Choosing between them

Pick based on what you want to do with the data after it lands:

  • You need Replay to push historical data back into Kafka, byte-for-byte, with original timestamps, partitions, offsets, headers, and gaps preserved.
  • You need a forensic record of every message — audit, regulatory, or incident-replay use cases.
  • Your payloads are arbitrary bytes (binary, mixed schemas, encrypted) and you don't want the platform to interpret them.
  • You want to point external tools at the raw Avro and do your own indexing.

Data Lake docs →

  • You want to query topic data with SQL without standing up your own warehouse.
  • You have structured payloads (the Lakehouse sink auto-discovers schema from Kafka messages).
  • You need time-series queries (range scans, aggregations) at interactive speed.
  • You're building dashboards, reports, or analytics features that consume historical data.

Lakehouse docs →

  • You want Replay fidelity for ops/incident response and SQL access for analytics over the same topic.
  • You're starting with Data Lake (replay) today and plan to layer analytics on top later.

Run a Data Lake Sink and a Lakehouse Sink deployment side-by-side with different consumer groups on the same topic. Storage cost roughly doubles for that topic.

Two distinct sinks

The two storage options have separate connector applications — they don't share an image, configuration surface, or library:

Data Lake Sink Lakehouse Sink
Output Raw Kafka messages as Avro + Parquet index Apache Iceberg tables (Parquet + snapshots)
Schema-aware? No — bytes pass through Yes — schema auto-discovered from messages
Iceberg support No Yes
Primary consumer Replay, external tools SQL via the Lakehouse Query engine

Both share these traits:

  • Consume from a single Kafka topic per deployment (run more than one for multiple topics).
  • Use the cluster's blob storage connection — you don't paste credentials into the sink.
  • Are managed services — Quix hosts and updates them; you provide configuration only.
  • Honor your cloud's IAM, encryption, and retention controls.

Architecture at a glance

flowchart TB
    topic([Kafka topic])
    dlSink["DataLake.Sink<br/>Avro + Parquet index"]
    lhSink["Lakehouse sink<br/>Iceberg tables"]
    bucket[("Your blob storage<br/>S3 / GCS / Azure / MinIO")]
    dlUi["Data Lake UI / API<br/>Replay → Kafka"]
    lhQuery["Lakehouse Query + UI<br/>SQL → dashboards"]
    external["External tools<br/>Spark, Trino, DuckDB, …"]

    topic --> dlSink
    topic --> lhSink
    dlSink --> bucket
    lhSink --> bucket
    bucket --> dlUi
    bucket --> lhQuery
    bucket --> external

Prerequisites

Both options require a blob storage connection configured for the cluster. See Blob storage connections. The Lakehouse is then provisioned on top of that connection — see the Lakehouse overview for what gets set up.

Where to next