back
November 14, 2024
|
Industry insights

How to Empower Data Teams for Effective Machine Learning Projects

Learn how to boost success rates ML projects by empowering data teams through a shift-left approach to collaboration and governance.

Banner image for the article "How to Empower Data Teams for Effective Machine Learning Projects" published on the Quix blog

Python stream processing, simplified

Pure Python. No JVM. No wrappers. No cross-language debugging. Use streaming DataFrames and the whole Python ecosystem to build stream processing applications.

Python stream processing, simplified

Pure Python. No JVM. No wrappers. No cross-language debugging. Use streaming DataFrames and the whole Python ecosystem to build stream processing applications.

Data integration, simplified

Ingest, pre-process and load high volumes of data into any database, lake or warehouse, without overloading your systems or budgets.

The 4 Pillars of a Successful AI Strategy

Foundational strategies that leading companies use to overcome common obstacles and achieve sustained AI success.
Get the guide

Guide to the Event-Driven, Event Streaming Stack

Practical insights into event-driven technologies for developers and software architects.
Get the guide
Quix is a performant, general-purpose processing framework for streaming data. Build real-time AI applications and analytics systems in fewer lines of code using DataFrames with stateful operators and run it anywhere Python is installed.

If you want your machine learning (ML) projects to succeed, your software engineering and data teams can’t stay in their own bubbles. Too often, they misunderstand each other, creating a gap that drags progress down. If you want to fix this, you need to rethink how these teams work together.

The "shift-left" approach, which we covered in a previous blog article, is a smart way to do this, bringing data governance and quality control into the mix early on. 

It’s about giving your data teams the power to move faster and make real contributions.

When you start exploring the technical changes required to shift left, you’ll often notice that it starts to  transform team dynamics, foster collaboration, and embed data governance throughout your organizational workflow.

Breaking Down Team Silos

Large companies love their departments and hierarchies—they bring order and stability. But in data science and software engineering, these walls do more harm than good. Clearly delineated departments help to keep teams focused and organized, yet they also can block the flow of ideas and slow down innovation. The collaboration required to shift left can lead to cross-departmental squads forming organically. This is to be encouraged rather than hindered.

Take GoCardless as an example. As Andrew Jones writes in “Driving Data Quality with Data Contracts”,  GoCardless used to have a centralized data engineering team, which quickly became a bottleneck. The data engineering team was solely responsible for ingesting data into the data warehouse. Other teams had to submit requests and wait for the data engineering team to prioritize and deliver on those requests. This restricted the accessibility of data and limited the value the organization could derive from it. 

So, they tried a different strategy: Decentralized Ownership. GoCardless shifted responsibility for data and related resources to the data generators—such as software engineers. They’re the ones who create services that output data as a result of performing a task, such as placing an order or processing a payment. Decentralized ownership granted engineers autonomy and control over their datasets. Each team could define their schemas, choose their tools, and manage their resources, eliminating the need to go through a central team for every change. Product teams took control of their own data and made it accessible across the organization through well-defined interfaces. These changes required these teams to speak to one another and agree on what “good” data should look like.

This shift aligns with what Adrian Brudaru (Co-Founder & CDO of dlthub) calls "Shift-Left Data Democracy" (SLDD), which means integrating data governance right from the start, instead of waiting until things go wrong.

To establish a data democracy, it’s essential to build cross-functional teams consisting of data scientists, engineers, analysts, and domain experts who work closely together. You’ll also need to reassess the rules for who can access what data. Being more flexible with data access helps teams to work together efficiently while still keeping your data secure. 

Likewise, governance frameworks need to be set up in a way that ensures compliance without stifling innovation. Promoting data literacy across the organization will empower every team member to contribute meaningfully to data-driven initiatives, ensuring that data becomes a collaborative asset rather than a restricted resource.

Brudaru also emphasizes the importance of embedded analysts—people who act as the link between data teams and the specific needs of different departments. They make sure data insights are actually useful and relevant to the business.

Automating Infrastructure Provisioning

One big problem data teams face is that they can’t work independently. They’re often stuck waiting for engineers to give them access to systems and data they need. 

Why? 

Because setting up the infrastructure is complex, and engineers don’t want to let data scientists do this themselves—for fear of having to “babysit” them later on.

That’s why more companies are automating infrastructure provisioning, so data teams can move ahead on their own.

Infrastructure automation can cover a very broad range of measures, but here are a few examples of what companies are doing to simplify the setup process:

  • Using Infrastructure as Code (IaC) tools like Pulumi or Terraform to manage infrastructure through code—such as data warehouses and ETL servers.
  • Packaging data applications with containerization (think Docker) and use Kubernetes (or tools that sit on top of Kubernetes) to deploy them across environments without a hassle.
  • Building self-service platforms where data teams can set up environments on their own, without bugging IT.
  • Setting up automated data pipelines using batch tools like Apache Airflow or Prefect, or real-time data tools like Quix or Confluent so data moves smoothly without constant manual fixes.
  • Implementing automatic quality checks to ensure the data is always reliable.
  • Using standardized templates for common infrastructure needs so teams can get things up and running faster.
  • Adopting GitOps practices to keep everything versioned and peer-reviewed.

You don’t need your data scientists to become infrastructure experts. The goal is to build tools that make their lives easier so they can focus on the data itself, not the technology required to manage it.

Incidentally, the Quix platform itself is one example. It uses YAML for IaC, integrates Kubernetes, and gives teams an intuitive interface to manage resources without needing to know the technical details.

Leveraging Data APIs to Foster Innovation

Robust, well-designed data APIs are an essential part of team enablement because they enable seamless data access across different teams. This reduces bottlenecks and ensures that everyone can do their analysis without the complexities of manual data retrieval or inefficient handovers.

For example, when GoCardless implemented data contracts, they treated them like APIs for data. This approach let teams easily find and use data, making it much easier to build on each other’s work.

In some sense, this approach is similar to Jeff Bezos’ 2002 mandate at Amazon, where every team had to expose data and functionality through service interfaces. This was in fact how Amazon Web Services (AWS) was born. A similar strategy for data can fuel innovation and experimentation by giving teams the tools they need.

To adopt this API-oriented approach, you’ll need to rethink how you work on software projects in general.

For example, when building a new service, your engineers will  need to think about how the resulting data can be useful to others and expose it through APIs from day one.

This will require a bit more discipline and standardization across your different engineering teams. 

They’ll need to:

  • Use consistent data formats and query parameters, so different teams can use the same APIs without a learning curve.
  • Make sure data APIs are scalable—and handle large datasets efficiently with features like pagination and filtering.
  • Implement governance and access controls to keep things compliant with privacy regulations.
  • Provide detailed documentation on the data—schemas, update frequencies, and limitations—so teams know what they’re working with.
  • Offer APIs at different levels of granularity, from raw data to aggregated metrics.
  • Make sure data APIs are versioned, so updates don’t break existing integrations.
  • Consider adding real-time streaming APIs to support time-sensitive data alongside the request-based ones.

Once the number of data APIs starts to grow, you should build an internal catalog of available APIs, making it easy for teams to discover new insights by combining different data sources.

By following these guidelines, you’re effectively helping to realize a “shift-left data democracy” which builds governance into the process early on, so teams can innovate faster without worrying about compliance. By using consistent, well-structured data APIs, they can experiment freely, knowing the data is reliable and meets all standards. This speeds up development and keeps everything running smoothly.

Operationalizing Data Science, ML & AI into Products

Many companies pour resources into data science, machine learning (ML), and artificial intelligence (AI) prototypes, but those investments only pay off when you convert these experiments into real product features. 

All too often, companies never make it that far. You might have heard the sobering statistic that only about 15% of machine learning projects ever make it into production. This is because most companies have trouble operationalizing their ML models. But what does operationalize actually mean in practice?

It’s about making sure ML-based systems can handle live data, perform well over time, and fit right into your existing business systems. And it’s more than automating model deployment. Operationalization is the full process of getting ML-models into production—commonly known as “MLOps”.

ML Ops is about taking DevOps principles and applying them to machine learning. It covers everything from tracking experiments to registering models and setting up automated retraining pipelines. Adopting ML Ops practices helps you move faster and ensure that your AI systems stay reliable once they’re in production. 

MLOps includes the following sub-disciplines:

Monitoring

ML models aren’t like regular software—they can get worse over time. A model that worked yesterday might not be as accurate today because of concept drift or data drift. That’s why you need to set up robust monitoring systems. Tools like Prometheus (for metrics) and Grafana (for visualization) can help you keep an eye on model health and alert you when performance starts to drop, so you can retrain or recalibrate before quality starts to degrade.

Version Control

You probably already use version control for your code, but managing versions of datasets and models is a different beast. Tools like DVC (Data Version Control) and MLflow can help you keep track of your models and experiments. This is crucial for debugging issues or meeting compliance standards, especially when things break in production.

Scalability and Performance

This involves optimization techniques like model compression, quantization, or hardware acceleration (GPUs or TPUs) to keep things running smoothly at scale. Using serverless deployment can also help you manage costs while ensuring your system can scale as needed.

By operationalizing these aspects effectively, you streamline the path from development to deployment. By reducing technical obstacles, you allow teams to iterate faster, enabling them to innovate and deliver impactful solutions without unnecessary delays. This means new ML models can transition smoothly from experiments to reliable, scalable product features.

Conclusion

If you want your data teams to succeed, you can’t just give them new tools and hope for the best. You have to deal with the deeper issues—who’s responsible for what, how teams work together, and the culture that drives them. 

As we’ve seen in the GoCardless case study, adopting data contracts did more than streamline processes; it changed how people worked, made roles clear, and freed the teams to actually do their jobs well.

The shift-left approach helps with this transformation because it brings governance in from the start. When you build governance early, you skip the chaos and expense of trying to fix things after everything’s already gone wrong.

If you encourage cross-functional collaboration, automate what can be automated, and give data practitioners the tools they need to become more autonomous, you’ll increase the velocity of data-driven insights. 

The real question is: How quickly can you increase this velocity before your competitors do?

This challenge isn’t about keeping up—it’s about creating an environment where your data teams can push boundaries and outpace other data teams. This is the real payoff of the shift-left approach. 

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Related content

Banner image for the article "Are data historians getting in the way of Industry 4.0?" published on the Quix blog
Industry insights

Are data historians getting in the way of Industry 4.0?

Learn how data historians impact Industry 4.0 adoption, understand their limitations and discover alternative approaches to managing data from OT systems.
Mike Rosam
Words by
Banner image for the article "Rethinking Build vs Buy" published on the Quix blog
Industry insights

The challenges of processing data from devices with limited connectivity and how to solve them

Need to process data from frequently disconnected devices? Better use an event streaming platform paired with a powerful stream processing engine. Here's why.
Mike Rosam
Words by
Banner image for the article "Rethinking Build vs Buy" published on the Quix blog
Industry insights

Rethinking “Build vs Buy” for Data Pipelines

“Build vs buy” is outdated — most companies need tools that provide the flexibility of a build with the convenience of a buy. It’s time for a middle ground.
Mike Rosam
Words by