Continuously updating a vector store
A three-step pipeline that you can use to ingest embeddings into a vector database as new content as it is published. When new content arrives, an event is emitted to Kafka with the text of the content as a payload. A consumer process listens for new content and passes it to the embedding model to turn the text into vectors. The resulting vectors are passed to a downstream Kafka topic where any vector database can consume and ingest the vectors at its own pace.
Main project components
CSV Producer Jobs
Two jobs that show you how to incrementally produce data to Kafka using Quix streams. These jobs simulate Change Data Capture (CDC) where embeddings are generated for content as soon as it's entered into a database.
Create embeddings
A worker service that uses sentence transformers to generate embeddings for any incoming documents it detects in the "raw documents" Kafka topic.
Ingest embeddings into Qdrant
A consumer service that reads from the embedding topic and uses the Qdrant client library to write to a vector database in Qdrant Cloud.
Streamlit similarity search UI
A basic user interface that you can use to search the Qdrant vector database for semantically similar matches.
Technologies used
Using this template
This project could be easily adapted for use cases such as:
- Retrieval Augmented Generation (RAG)
- Product searches for ecommerce
- Recommendation systems