Data Lake overview

Data Lake is the replay-first option in Quix Lake. It captures Kafka topic data into your blob storage (Amazon S3, Azure Blob, Google Cloud Storage, or MinIO) as raw Avro segments alongside lightweight Parquet index files — so every message is preserved byte-for-byte and is discoverable without scanning the raw data.

If you're looking for SQL-queryable, columnar storage instead, see Lakehouse. Not sure which to pick? Read Choosing between them in the Quix Lake overview.

What you get

Portable — open Avro and Parquet, readable by DuckDB, Spark, Trino, Athena, BigQuery, and friends
Faithful — Kafka messages persisted exactly as they arrived: timestamps, headers, partitions, offsets, idle gaps
Discoverable — Parquet index summarizes every segment so the UI and API can list and filter without scanning Avro
Replayable — push any persisted dataset back into Kafka with original order and timing preserved or simulated
Yours — data lives in your bucket; you control IAM, keys, encryption, retention, and audit

Prerequisites

A blob storage connection must be configured for the cluster.

Storage layout

Data is written to your bucket in a predictable, Hive-style layout for easy discovery and external tooling.

Raw Avro:
<bucket>/<workspaceId>/Raw/Topic=csv-data/Key=B/Start=2025-08-21/
  ts_1755776884034_1755776886089_part_0_off_331135_331334.avro.snappy

Parquet index and custom metadata:
<bucket>/<workspaceId>/Metadata/Topic=data-source-json/Key=6/
  index_raw_0_129879.parquet
  metadata_<...>.parquet

See Open format for the full layout and schemas.

What you can do

Explore datasets with the Data Lake UI or API
Replay persisted datasets back into Kafka with full fidelity — see Replay
Search and filter by time ranges, topics, keys, and custom metadata
Query externally using DuckDB, Spark, Trino, Athena, or BigQuery over the raw Avro and Parquet

Cross-environment access

With the right permissions, you can browse datasets written by other environments using the Environment switcher in the Data Lake UI.

How it works

Ingest — the Data Lake Sink writes raw Kafka messages to Avro segments in your storage.
Index — Parquet index files summarize time, partition, offsets, and sizes for each segment.
Discover — the UI and APIs read the index to list and filter quickly, never scanning Avro for catalog operations.
Replay — any discovered dataset can be streamed back to Kafka with original order and timing preserved or simulated.
Use — build pipelines that mix historical data with live streams, or run queries directly over the open files.

Operational behavior

Soft deletion — catalog deletions move items to Trash for a short retention window before permanent removal, with restore and delete-forever actions.
Security — you control IAM, keys, encryption, retention, and audit in your own cloud account. Within Quix, each team only sees its own data by default. See the Storage Access Gateway.