Skip to content

Data Lake overview

Data Lake is the replay-first option in Quix Lake. It captures Kafka topic data into your blob storage (Amazon S3, Azure Blob, Google Cloud Storage, or MinIO) as raw Avro segments alongside lightweight Parquet index files — so every message is preserved byte-for-byte and is discoverable without scanning the raw data.

If you're looking for SQL-queryable, columnar storage instead, see Lakehouse. Not sure which to pick? Read Choosing between them in the Quix Lake overview.

What you get

  • Portable — open Avro and Parquet, readable by DuckDB, Spark, Trino, Athena, BigQuery, and friends
  • Faithful — Kafka messages persisted exactly as they arrived: timestamps, headers, partitions, offsets, idle gaps
  • Discoverable — Parquet index summarizes every segment so the UI and API can list and filter without scanning Avro
  • Replayable — push any persisted dataset back into Kafka with original order and timing preserved or simulated
  • Yours — data lives in your bucket; you control IAM, keys, encryption, retention, and audit

Prerequisites

A blob storage connection must be configured for the cluster.

Storage layout

Data is written to your bucket in a predictable, Hive-style layout for easy discovery and external tooling.

Raw Avro:
<bucket>/<workspaceId>/Raw/Topic=csv-data/Key=B/Start=2025-08-21/
  ts_1755776884034_1755776886089_part_0_off_331135_331334.avro.snappy

Parquet index and custom metadata:
<bucket>/<workspaceId>/Metadata/Topic=data-source-json/Key=6/
  index_raw_0_129879.parquet
  metadata_<...>.parquet

See Open format for the full layout and schemas.

What you can do

  • Explore datasets with the Data Lake UI or API
  • Replay persisted datasets back into Kafka with full fidelity — see Replay
  • Search and filter by time ranges, topics, keys, and custom metadata
  • Query externally using DuckDB, Spark, Trino, Athena, or BigQuery over the raw Avro and Parquet

Cross-environment access

With the right permissions, you can browse datasets written by other environments using the Environment switcher in the Data Lake UI.

How it works

  1. Ingest — the Data Lake Sink writes raw Kafka messages to Avro segments in your storage.
  2. Index — Parquet index files summarize time, partition, offsets, and sizes for each segment.
  3. Discover — the UI and APIs read the index to list and filter quickly, never scanning Avro for catalog operations.
  4. Replay — any discovered dataset can be streamed back to Kafka with original order and timing preserved or simulated.
  5. Use — build pipelines that mix historical data with live streams, or run queries directly over the open files.

Operational behavior

  • Soft deletion — catalog deletions move items to Trash for a short retention window before permanent removal, with restore and delete-forever actions.
  • Security — you control IAM, keys, encryption, retention, and audit in your own cloud account.

See also