back
November 1, 2024
|
Industry insights

Shifting Left: Discover What's Possible When You Process Data Closer to the Source

Learn how 'shifting left' in data engineering improves data quality by processing it closer to the source, following Netflix's example and modern best practices

Banner image for the article "Shifting Left: Discover What's Possible When You Process Data Closer to the Source" published on the Quix blog

Python stream processing, simplified

Pure Python. No JVM. No wrappers. No cross-language debugging. Use streaming DataFrames and the whole Python ecosystem to build stream processing applications.

Python stream processing, simplified

Pure Python. No JVM. No wrappers. No cross-language debugging. Use streaming DataFrames and the whole Python ecosystem to build stream processing applications.

Data integration, simplified

Ingest, pre-process and load high volumes of data into any database, lake or warehouse, without overloading your systems or budgets.

The 4 Pillars of a Successful AI Strategy

Foundational strategies that leading companies use to overcome common obstacles and achieve sustained AI success.
Get the guide

Guide to the Event-Driven, Event Streaming Stack

Practical insights into event-driven technologies for developers and software architects.
Get the guide
Quix is a performant, general-purpose processing framework for streaming data. Build real-time AI applications and analytics systems in fewer lines of code using DataFrames with stateful operators and run it anywhere Python is installed.

You might have heard of 'shifting left' in software engineering—where testing happens earlier in the development life cycle. The same idea is making waves in data engineering. Originally, the concept of shifting left in data engineering was about treating data as a first class product, applying a similar discipline to data as to software. The technical shift, which involves acting on data closer to its source, followed as a natural extension of this philosophy. For example, applying data validation directly at the point of customer interaction ensures any incorrect or incomplete data is filtered out immediately. This leads to higher quality, faster insights, and new opportunities that can transform your business.

By validating and refining data closer to its source, you improve its quality before it flows downstream, delivering a better data product. For instance, applying data validation rules right at the point of ingestion ensures that erroneous or incomplete data is filtered out or imputed early, resulting in higher quality datasets for analytics and machine learning. This proactive approach helps reduce errors and inconsistencies that can accumulate through traditional multi-stage processes, ensuring that data consumers—from analysts to data scientists—work with better data from the outset.

A Case Study: Netflix's Shift-Left Approach

Take Netflix as an example. Netflix processes over 450 billion unique events daily from more than 100 million active users in 190 countries. They shifted from traditional batch ETL processes to stream processing using Apache Flink, significantly improving efficiency and the time taken from data generation to decision. Their previous batch jobs took up to eight hours to complete, but by shifting left and using stream processing, they drastically reduced this latency. This resulted in real-time insights that powered a more personalized homepage with optimized content recommendations. By processing data as soon as it was generated, Netflix was able to adapt its homepage layout based on real-time user interactions, ensuring that its viewers were presented with content they were most likely to watch within the same session, thus enhancing overall engagement and satisfaction.

By acting on this data immediately, they achieved significant business wins. This included reduced storage costs (since raw data doesn’t need to be stored for batch processing), improved infrastructure efficiency, and faster, more effective decision-making that directly impacted user satisfaction.

Limitations of Traditional Data Architecture

Netflix's approach also highlighted the limitations of the traditional 'multi-hop' medallion architecture for data lakes and lakehouses. In the medallion architecture, raw data is classified as 'bronze' and then progressively refined through computationally expensive stages to reach 'silver' and ultimately 'gold' quality. However, the multi-hop implementation of the medallion architecture is time consuming and resource intensive, particularly between the bronze and silver layers. Shifting left allows teams to produce ‘silver’ and ‘gold’ quality data closer to the source, cutting down on the iterative refinement cycles typical of batch processes. By working on data before it even reaches the data lake, Netflix ensured higher quality and consistency from the start, reducing both latency and operational complexity, and ultimately improving the overall reliability of their data.

Shifting Left Isn’t Just for Tech Giants

But what if you're not Netflix? If you're a traditional company—say, a car rental company or indie gaming studio—you might not have the technical sophistication or resources of a tech giant. Fortunately, the technical ecosystem for streaming data and event-driven architecture has matured significantly, making it easier to adopt similar approaches without needing a Netflix-sized engineering team. Managed versions of tools like Apache Kafka and Flink—such as those offered by Confluent, Ververica, and AWS—are now more accessible. These managed platforms simplify implementation for teams without deep technical expertise, making it easier to leverage real-time data processing capability, even for organizations that aren't traditionally software-centric. Managed infrastructure lowers barriers to adoption and higher abstracted APIs simplify complex workflows. Removing operational burden allows your teams to focus on deriving value from data.

Imagine monitoring player behavior in games in real time, optimizing car maintenance schedules based on live data, or delivering personalized customer experiences without the delays associated with traditional data processing. Thanks to the maturity and accessibility of these managed streaming tools, even organizations outside of the bleeding-edge tech world can now achieve these benefits.

Shifting Data Processing to the Operational Plane

To fully understand the value of shifting left, it's important to understand the difference between what data practitioners call the 'operational plane' and the 'analytics plane.' The operational plane refers to the systems that are core to your business operations—where data is generated in real time, such as during transactions or while tracking user interactions. The analytics plane, on the other hand, is where data is ultimately stored and analyzed to derive insights and inform decision-making. Traditionally, data would be processed in batch jobs and only reach the analytics plane after introducing delays through scheduling. This is perfectly fine for historical reporting but also results in outdated insights and missed opportunities.

Shifting left means moving much of this data processing to the operational plane, ensuring that the initial transformations and validation happens right when the data is generated. This approach reduces latency and ensures a cleaner, more consistent flow of information from the start, which eventually leads to better decision-making and more responsive business operations. By reducing the dependency on large, monolithic batch jobs, you not only improve agility but also gain the ability to react to issues as they occur, significantly enhancing operational resilience.

Replicating 'Bronze' and 'Silver' in the Operational Plane

In a shift-left architecture, you can replicate the traditional bronze and silver stages of the medallion architecture using Kafka topics and real-time processing. Tools like Apache Flink, or managed services like AWS Kinesis and Quix, enable continuous data cleaning and refinement, ensuring the data reaches the analytics plane already in 'silver' state, ready to be transformed into gold by the analytics team.

This connection to the analytics plane is made even simpler by open table formats such as Apache Iceberg. Netflix initially developed Apache Iceberg in 2018 to address the internal challenges they faced with managing large-scale datasets in data lakes, but they soon open-sourced it. Today, it's widely used for managing datasets across cloud environments, particularly for large-scale analytics and reconciling historical data with real-time data. Because it’s a vendor-neutral open format, Apache Iceberg has helped to foster increased collaboration between teams. It works with multiple query engines and data processing frameworks. This allows different teams to all work with the same dataset while using their preferred tools (such as Apache Spark or Presto).

Recently, Quix have also released an Iceberg connector that allows you to ingest, pre-process and load high volumes of data into your cloud lakehouse with minimal effort. This simplifies the process of materializing Kafka streams into Iceberg tables, ensuring teams always have access to clean, up-to-date data.

New Possibilities with Shifting Left

This shift-left approach to data processing opens up a wide range of new possibilities, offering a chance to rethink how data is handled at every stage. Instead of waiting for overnight batch jobs, machine learning models can be trained on the freshest data, while real-time auditing, faster error correction, and personalized marketing become achievable. The immediate availability of high-quality data allows organizations to be more agile, making better decisions as soon as the needs of a customer changes. Imagine a car manufacturer monitoring sensor data from vehicles in real time to predict battery lifetime or a games company adjusting a game based on live feedback from players. Shifting left gives you the power to be proactive and reactive, transforming the way your business makes decisions.

The Foundation of Shifting Left: Treating Data as a Product

To fully embrace shifting left, teams need to treat data as a first class product, applying the same discipline to data as they do to software development. This includes practices like versioning, rigorous testing, governance, and continuous monitoring to ensure data quality and reliability. Originally, the concept of shifting left was rooted in this organizational principle—ensuring data is treated with the same rigor and structure as code. The technical processes we discussed, such as validating data at the source, emerged from this foundation. 

Data contracts—formal agreements that define the structure, behavior, and governance of data—are also essential. Data contracts ensure that data producers and consumers are aligned, preventing downstream issues. Schema registries also play a crucial role, as they help enforce these contracts by ensuring data consistency across systems. They also allow you to more efficiently end up with gold standard data.  Aligning data producers and consumers through these contracts fosters better collaboration, reduces friction, and ultimately leads to more reliable and actionable data. Treating data as a first class product is the foundation of successful shift-left practices, enabling proactive decision-making and efficient data handling.

Conclusion

Shifting left enables you to realize new, innovative use cases by moving data processing into the operational plane. Imagine being able to detect potential issues before they escalate, offering personalized recommendations as customers interact with your service, or optimizing resource allocation in real time. These capabilities help you become more agile and responsive to your customers needs.

However, this shift requires collaboration across teams. Software engineers, data engineers, and data scientists must work closely together, breaking down the silos that traditionally separate them. This cultural shift is essential for fostering an environment where data is treated with the same importance as code (and it's a shift that we’ll cover in more detail in the next chapter). When data is prioritized, teams can innovate faster, respond to challenges more effectively, and ultimately deliver better products and services. 

Shifting left is about making your pipeline architecture more efficient and reducing waste by acting on data closer to its source. This enables exciting new opportunities that can drive value for your business. It empowers every part of your organization to work with real-time, accurate information, making you more responsive and better equipped to make informed decisions. Are you ready to shift left and start realizing the full potential of your data?

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Related content

Banner image for the article "Rethinking Build vs Buy" published on the Quix blog
Industry insights

Rethinking “Build vs Buy” for Data Pipelines

“Build vs buy” is outdated — most companies need tools that provide the flexibility of a build with the convenience of a buy. It’s time for a middle ground.
Mike Rosam
Words by
Banner image for the article "When a European Manufacturing Leader Needed to Modernize Their Data Stack" published on the Quix blog
Industry insights

When a European manufacturing leader needed to modernize their data stack

Learn how an industrial machinery provider went from processing their sensor data in batches to real time using Python and Quix Streams instead of Flux..
Tun Shwe
Words by
Banner image for the article "How to Empower Data Teams for Effective Machine Learning Projects" published on the Quix blog
Industry insights

How to Empower Data Teams for Effective Machine Learning Projects

Learn how to boost success rates ML projects by empowering data teams through a shift-left approach to collaboration and governance.
Mike Rosam
Words by