Rethinking “Build vs Buy” for Data Pipelines

“Build vs buy” is outdated — most companies need tools that provide the flexibility of a build with the convenience of a buy. It’s time for a middle ground.

Mike Rosam

CEO & Co-Founder

Banner image for the article "Rethinking Build vs Buy" published on the Quix blog

The 4 Pillars of a Successful AI Strategy

Foundational strategies that leading companies use to overcome common obstacles and achieve sustained AI success.

Get the guide

Guide to the Event-Driven, Event Streaming Stack

Practical insights into event-driven technologies for developers and software architects.

Get the guide

When it comes to data pipelines, we already know what the best tech stack is. These stacks are often reflected in complex-looking architecture diagrams that appear in engineering blogs from companies like Netflix or Uber. They power data pipelines that are custom-built by large engineering teams to address complex internal requirements.

Smaller-scale companies often have complex requirements too, but they don’t have the engineering resource to invest. They’re stuck with off-the-shelf solutions that can’t solve the really hard problems — the so-called “last mile” towards the ideal stack.

The best solutions are often open-source (such as Apache Airflow, Spark, Kafka and so on) but for many teams they’re too complex to set up and maintain and their roadmaps can be unpredictable. Managing one of these tools is doable, but data pipelines typically comprise a combination of several. The real challenge is selecting the right components and making them work together.

Thus, the traditional "build vs buy" decision often falls short for small to mid-sized teams.

"Build" diagram source | "Buy" diagram source

The hidden costs of buying

The “modern data stack” is a bit like the IKEA of the data world: easy to set up and cheap on the surface. An example stack could be Fivetran for integration, dbt Cloud for transformation, and Snowflake for storage. Buying pre-built solutions gets you up and running fast, saves you money, and frees your in-house team to focus on impactful analytics work. It’s easy to see the appeal—these tools are well-designed, easy to use, and they solve many of the common problems that data teams face.

The downside is that the cost of the modern data stack tends to grow quickly as your data volumes increase. These tools often seem affordable at first, but as you gather more data and your needs become more complex, the usage-based pricing can lead to sticker shock. Additionally, there's the issue of vendor lock-in. You may want to customize workflows or move to a different infrastructure setup, but find that the tools you’re using don’t make this easy.

What Data Teams Really Need

Data teams need a better balance between ease of use on one end of the spectrum, and control and flexibility on the other.

For example, if you were to ask data engineers about their core requirements, you would often get variants of the following three answers:

“It should be easy for me to spin up compute resource”:
“It should be flexible enough to let me integrate any data source or sinks”
“It should be flexible enough to let me add any processing logic and workflow”

Historically, you could only get these capabilities by building everything yourself. Infrastructure management alone requires significant expertise, and creating a future-proof system that enables rapid iteration simply isn't feasible for most teams.

Quix attempts to address these requirements by serving as a “pipeline builder” platform that separates infrastructure concerns from data processing logic.

Here’s how:

Simple Infrastructure: Data engineers can provision compute resources in a few clicks, and there’s no labyrinthine permissions system to deal with because the pipeline is entirely contained in one environment. The Quix platform handles the complex underlying infrastructure (Kafka, Kubernetes, Docker) so engineers don’t need to worry about fine-grained configuration.

Easy data integration: The Quix Streams Python SDK includes a modular source and sink API which makes it incredibly easy to integrate a new data source. There are also plenty of existing open source connectors based on the same API that engineers or data scientists can use as a reference.

Flexible workflow and processing: In Quix, Python is a first-class citizen because it is much more flexible than SQL. Many of our customers need to do complex statistical calculations on sensor data which aren't easy to write as SQL queries. Python also lets you build any workflow or business logic you want such as calling external APIs, or writing files.

Maintaining control over infrastructure without the burden of maintenance

The Quix BYOC Enterprise Edition is a great example of how you can get the control of build with the convenience of a buy. It runs on your own cluster (e.g., GKE, AWS EKS, Azure AKS, or on premise), retaining infrastructure control while letting Quix handle networking, scaling, and other operational aspects.

Quix includes an Apache Kafka installation for testing real-time processing pipelines, providing a simpler approach than running these applications directly on Kubernetes. Real-time processing can be challenging, but with Quix, you’re getting an out-of-the-box experience that eases much of the operational complexity.

An illustration of how Quix can run in your VPC while integrating with external sources

‍

Enabling the “full stack data scientist”

This approach also eases the burden on data scientists, who have often been expected to grapple with infrastructure when engineers are too busy. Chip Huyen (author of “Designing Machine Learning Systems”) discussed this in her 2021 article, “Why data scientists shouldn’t need to know Kubernetes.”

A few years ago, many organizations rushed to hire data scientists without proper data infrastructure in place, leading to a shortage of data engineers and the dubious rise of the "Full-Stack" Data Scientist.

In theory, this role could do it all—from data acquisition and preprocessing to model deployment and monitoring. In practice, it often left data scientists overwhelmed and wrestling with infrastructure tasks that diverted them from their core strengths.

This problem still lingers today, as many companies struggle to balance the needs of their data teams with available resources. Better tooling is key to solving this—tooling that allows data scientists some high-level control without the need for deep infrastructure expertise.

This is especially important for building real-time processing applications, traditionally handled by software engineers. Real-time processing moves your data upstream, creating a more efficient pipeline. Data scientists, with their domain expertise, are best placed to build this processing logic, but they shouldn’t need to know the intricate details of managing Kubernetes or tuning Kafka clusters. Nevertheless, to develop real-time processing algorithms, they need to understand how much system resources they’ll consume. Quix make this task easier for data scientists because they can create scratchpads, and deploy their code to test environments before going to production-

‍

Shifting spend to the middle ground

When it comes to data pipelines, there’s no reason to either craft your own artisanal tools or rely entirely on IKEA-like SaaS solutions. There are other options along the build vs buy continuum.

The modern data stack works if you’re a small startup with low data volumes, but as you scale, costs grow exponentially. Processing massive raw data volumes in dbt Cloud or Snowflake is pricey, and the performance trade-offs become more apparent. You don’t have to abandon these tools entirely, but in the long run, it’s cost-effective to shift your processing left, towards “build.”

In this case, building is more like refactoring, since much of the processing logic is already encoded in your batch tools. You’re converting SQL queries or Spark jobs into small real-time applications that continuously refine raw data into high-quality, usable information. This reduces costs by storing and querying refined data instead of raw, unfiltered data. Raw data can still be archived in object storage for research use with tools like Apache Iceberg and Trino, but operational pipelines should use real-time services that fit seamlessly into a modern backend. If you deploy managed services using BYOC, you’ll also save on data transfer costs by keeping your pipeline enclosed within your VPC.

The build vs buy decision isn’t binary. The best solution is likely a combination of both, depending on your data journey. The key is to find a balance that provides control, cost-efficiency, and flexibility—while freeing your data scientists from unnecessary infrastructure burdens. Instead of being bogged down by operational complexity, they can focus on what they do best—building models, extracting insights, and driving value from your data. By selecting the right combination of managed services, BYOC models, and flexible pipeline-building tools, you can create an infrastructure that scales effectively, supports real-time processing, and doesn’t break the bank. In the end, it’s about recognizing that, as your data needs grow, so too should the sophistication and adaptability of your data infrastructure. Luckily, there are managed tools like Quix that can make this process less expensive.

‍

Share this article:

Words by

Mike Rosam

CEO & Co-Founder

Mike Rosam is Co-Founder and CEO at Quix, where he works at the intersection of business and technology to pioneer the world's first streaming data development platform. He was previously Head of Innovation at McLaren Applied, where he led the data analytics product line. Mike has a degree in Mechanical Engineering and an MBA from Imperial College London.