Bridging the gap between data scientists and engineers in machine learning workflows

Moving code from prototype to production can be tricky—especially for data scientists. There are many challenges in deploying code that needs to calculate features for ML models in real-time. I look at potential solutions to ease the friction.

Mike Rosam

CEO & Co-Founder

Illustration of two people in the desert.

Introduction

In a companion article, The drawbacks of ksqlDB in machine learning workflows, I talked about the impedance gap in machine learning workflows—the term “impedance” here, refers to the technical barriers in getting an ML model from prototype to production. And ksqlDB was expected to help bridge this gap. Yet, these expectations were not met because ksqlDB suffers from several limitations that hinder its effectiveness for data scientists and ML engineers. I pointed out that ksqlDB’s underlying technology, Kafka Streams, also has performance limitations when operating at scale which is why Apache Kafka is often paired with Apache Flink instead. Indeed Confluent themselves recently announced their own managed Flink offering, which is likely intended as a more powerful complement to their managed Kafka solution.

Unfortunately, Kafka and Flink don’t solve the impedance gap either, since this dynamic duo is also firmly anchored in the Java ecosystem. Python is the lingua franca in the data science world so it makes sense to offer them a pure Python solution (which ksqlDB isn’t). Of course, we need more than this to ease the friction that causes this gap. So what else to we need? The short answer is more tooling to help data scientists contribute to parts of the machine learning workflow.

To properly understand what I’m talking about, let’s take a closer look at the details of this workflow.

A good reference is the following diagram from ml-ops.org (by INNOQ), which gives an overview of the end-to-end machine learning workflow:

A diagram showing the process of machine learning engineering.

Note that this a very batch-oriented view of the ML workflow which is still fairly typical. As Fennel.ai explain in their excellent piece “Challenges of Building Realtime Machine Learning Pipelines” most companies start with an all-batch ML pipeline before incorporating more real-time processes into their workflow.

The data pipeline

A diagram showing the process of data pipeline.

This pipeline includes all the steps leading up to the creation of the machine learning model, such as data collection, ingestion, preprocessing, and exploration. Data engineers usually help data scientists ingest the data, but in smaller companies, data scientists are often left to figure this out on their own. After the data has been ingested, data scientists work on preprocessing it and conducting Exploratory Data Analysis (EDA). Then, they work on feature engineering tasks such as identifying features that could improve the model.

The friction of code translation

For businesses where real-time data is crucial (think food delivery, ride sharing, stock trading, and so on). Apache Kafka is often used to ingest real-time data into a database so that it can be queried and passed to an ML model for training. This means that data from Kafka needs to be ingested into a format that’s easier for data scientists to work with. Raw Kafka messages are typically serialized in formats like Avro, JSON, or Protobuf, but data scientists are used to working with tabular formats such as CSVs and pandas DataFrames. That’s why some kind of stream processing tool such as Flink or Spark is used to ingest the data for further processing.

Tools such as ksqlDB attempted to reduce the reliance on connector code and allow data scientists to query a Kafka topic directly and create intermediate transformations with a simple SQL syntax. However, ksqlDB simply didn’t have the flexibility of a framework like pandas (which couldn’t be run on a ksqlDB cluster). Instead, custom transformations had to be implemented in Java as UDFs or standalone Kafka Streams applications.

A similar problem exists for Flink and Spark. Though they have Python APIs, they still use the JVM for their underlying transformations. You cannot use pandas transformations on a Flink cluster without implementing some kind of UDF. Flink has its own transformations but translation work is required to convert a pandas transformation to the Flink or Spark syntax. This translation process can slow down the overall data analysis and model development process, leading to fewer insights and production deployments.

Possible Solutions

A natural question to ask would be “why do data scientists need to translate their pandas transformations into another syntax?” or “why can’t you run pandas transformations in production?”. For real-time streaming use cases this isn't practical because pandas isn't designed for high-throughput, distributed data processing. You would quickly run into performance bottlenecks. That’s why most companies use scalable, fault-tolerant stream processing frameworks like Flink or Kafka Streams

To remove the friction of code translation, there needs to be a stream processing framework with a similar capabilities to Flink or Spark, but one that will let you use pandas or other data science libraries in a scalable way.

The scalability problem has already been solved for batch processing. For example, Modin is an open-source library that aims to speed up and scale pandas operations on large datasets by using distributed computing with Dask or Ray as the backend. This in turn, eliminates the need to translate prototype code into Spark syntax when processing large datasets.

All data scientists need then, is a similar solution for applying feature transformations to high-volume data streams in a scalable way. There are open source libraries like Streamz and Faust which have gone some way to solving this problem, but the long-term support and stability of these projects is unclear since they’re now maintained by volunteers. They also don’t reliably solve all of the complex problems that come with parallel processing and horizontal scaling. This is why, for the time being, Kafka is still usually married to Flink.

The machine learning pipeline

A diagram showing the process of machine learning pipeline.

Here, data scientists focus on feature engineering, feature selection, model training, evaluation, and tuning. The primary goal of model development is to create a robust and accurate model that can be deployed in a production environment.

It starts with model training, where machine learning algorithms are applied to the training data, ‌often in a Notebook UI environment (such as Jupyter, Google Colab, or Amazon Sagemaker) using Python frameworks like TensorFlow, ScikitLearn, PyTorch and so on. Following this, the model is assessed and tested to verify its alignment with the initial original business goals. Finally, the model usually needs to be converted into a format suitable for serving (such as Pickle, ONYX or PMML) and integrated into a service.

The friction of performance vs accuracy

Here, data scientists focus on selecting the appropriate algorithms and feature engineering to build accurate and useful models, while engineers prioritize the scalability, performance, and maintainability of the models in production environments. These priorities often come into conflict when one side has to make a tradeoff.

For example, data scientists might choose an algorithm like Support Vector Machines (SVM) because it provides high accuracy and handles high-dimensional data. Yet, engineers might prefer a different algorithm or approach (such as a simpler linear model or a decision tree) which may be more efficient and easier to scale, but might not offer the same level of accuracy as the SVM chosen by data scientists. Usually, the performance consideration wins because an application needs to meet the requirements of end users who expect real-time or near real-time feedback.

On top of that, there are many transformations that are straightforward to perform on static data but much harder in a real-time context. This makes it challenging to adapt the transformation for a production environment. For example, in a batch context, it’s straightforward to calculate the average purchase price for a customer over the past four weeks. It’s more challenging if you want to calculate the average purchase price while incorporating the data from the current session in real time. You need to access the historical data from an online feature store and combine it with the data from the session in real time which requires significant refactoring.

Engineers also need to integrate the model into the existing production system, which could be a web application, a mobile app, or an internal tool. To do this, they might use different languages, frameworks, or tools than those used by data scientists. For instance, they could be working with Java, .NET, or Node.js for the backend infrastructure.

To avoid this complexity, models are often deployed as standalone services using platforms such as Amazon Sagemaker or Google’s Vertex AI. In this setup, an application uses an API to get a prediction from the model, following a request and response pattern. Yet this approach has some drawbacks such as increased latency, limited scalability and the potential for bottlenecks.

Possible Solutions

When building their prototypes, data scientists need to work in an environment that mirrors the production environment. That way, they can actually see how their models and feature transformations behave on real time data. There also needs to be a back-end platform that makes it easier to integrate an ML model into an event-driven application without requiring the model to run behind a REST API.

Confluent envisaged Apache Kafka as a tool that would help simplify the deployment and monitoring of ML models by providing a consistent interface for serving predictions, tracking model performance, and updating models as needed. In this decoupled architecture, a service running an ML model could subscribe to a topic, run its predictions, and write those predictions to another topic. Dependent services no longer have to wait for an API request to be completed before they can continue.

However, given the current tools, creating a service and connecting it to Apache Kafka is beyond the capabilities of most data scientists, even for testing. There needs to be more tooling that simplifies the process of deploying a service that ingests data from Kafka. This is where having a more integrated platform would come in handy. For an example of well integrated products, take Google Cloud. Their Cloud Functions product is tightly integrated with Google Pub/Sub. Another example is AWS where AWS Lambda functions can be connected to Kinesis streams.

Examples of user interfaces that integrate tightly with message brokers.

Their user interface makes it easy even for novice users to set up a serverless function and have it react to data from a Pub/Sub topic. There needs to be a system with a similar level of usability for Kafka, where the services and the broker are hosted in one environment and can be tightly integrated.

Another promising challenger is Bytewax, which primarily a deployment and management tool for Python functions (though they are continuously adding stream processing features too). Bytewax focuses on providing a platform for deploying, scaling, and monitoring Python functions as services, making it easier to create and manage data processing pipelines. However, you still have to install it and configure it in the cloud environment of your choice which requires some Kubernetes and Docker knowledge. So not exactly a self-serve solution for data scientists.

The software code pipeline

A diagram showing the process of software code pipeline.

Once the model has been finalized, it must be integrated into the application code and deployed. The same process applies to the features on which it was trained. ML models often need the most up-to-date values for a feature (such as traffic conditions or weather). These must be continuously calculated in real-time by dedicated processes which provide new features to the model. Following that, the model's performance is continuously monitored and logged so that further improvements can be identified.

The friction of code integration

Data-driven pipelines are applications just like the rest of a back-end architecture (like the front-end). Software engineers have an established set of best practices for any one who contributes to an application needs to adhere to these best practices. But now that data teams are contributing directly to software applications (rather than back-office systems), there’s a culture clash.

Data scientists especially are often unfamiliar with software engineering best practices or the specific requirements of the application, leading to difficulties when integrating their models into the system. For example, data scientists and engineers may not use version control systems or if they do, it’s often not the same version control system or workflow as the engineers, which leads to difficulties when tracking changes and collaborating on the codebase. Lastly data scientists may not be familiar with deployment techniques or monitoring tools used by engineers, which can lead to challenges when deploying and maintaining models in production environments. Conversely, software engineers might not be familiar with the specific requirements of machine learning models, which can lead to issues during integration and deployment.

Possible Solutions

Again, the answer is tightly linked to the solutions provided in the previous two sections.

If prototype code did not require translation into another language before it could be integrated into the application, then data scientists and engineers could collaborate on the same code repository.
If the logic for consuming data from a Kafka topic was easy enough for data scientists to code themselves, they could test how their transformations and machine learning models performed on production workloads.
If ‌deployment tools were made more accessible, they could also test how their code runs in a test environment that more closely resembles the production environment.

But data scientists still need to be more integrated into the software development workflow. The challenge is that these practices are often alien to those who don’t write software day in and day out. Nevertheless it is possible to learn them.

One parallel example is in the technical writing industry. As the docs as code approach became more popular, technical writers found themselves collaborating with software engineers on the same repositories. This forced writers to learn the principles of git, continuous integration and deployment. This practice was aided by a wide range of tools that made it easier to work with Git, such as Visual Studio Code, Github Desktop, and Sourcetree. There are even tools that let you save a file without you being aware that you’re making a Git commit (such as Sparkleshare), not that this is a great practice for versioning code, but you get the idea. It’s now arguably a lot easier than learn software management principles as a non-developer.

However, there is still large amount fragmentation in toolchain. This problem is less acute for software engineers because, in a sense, the terminal window is their unified interface. Version control, builds and deployments are all managed from the command line. Non-developers, such as data scientists don’t have the benefit of a unified user interface to commit, deploy and monitor their specific code contributions.

All of these tools are especially important when trying to understand multi-step feature transformations that a ML model depends on in production. Confluent recognized this and released Stream Designer which was designed to complement kqsqlDB, creating a more usable layer of abstraction for building pipelines on top of Apache Kafka.

The catch is that it only works with ksqlDB queries—which have their own limitations as we've discussed. It's likely however, that you'll soon be able to use Stream Designer with Confluent's new managed Flink solution.

A fragmented patchwork of solutions

At this point, it’s useful to do a recap of all the solutions we’ve discussed a speculate how they could be used together. The ultimate goal being ‌to data scientists more autonomy on optimizing their algorithms and models for production

Data ingestion — suffers from code translation friction.

Solution: Let data scientists ‌code feature transformations in pure Python (i.e. not pySpark or PyFlink) using libraries that work just as well for prototyping as they do in production. This would require less code tuning, and it would be easier for data scientists to learn how to tune the code themselves.
Tools: Current pure Python stream processing libraries are Faust, Streamz, maybe Bytewax.

Machine learning pipeline — suffers from performance friction.

Solution: data scientists should be able to simulate production environments using tools that abstract away some of the complexity involved in deploying applications. This would enable them to test the performance of their own algorithms and models before handing them over to engineers.
Tools: Point-and-click user interfaces that allow you to deploy ‌serverless functions and easily connect them to message brokers and other services without having to worry about IAM roles or Docker configurations. Partial solutions include Google Cloud Functions coupled with Google Pub/Sub, AWS Lambda with Amazon Kinesis, or Confluent ksqlDB with Stream Designer. Another option is Bytewax hosted in any cloud environment (although it doesn’t appear to have its own point-and-click UI). However, all of these are still very complex solutions.

Software pipeline — suffers from integration friction.

Solution: help data scientists learn standard software development practices by allowing them to contribute to the application codebase. Consolidate the application codebase by using the same language and libraries for prototype and production code. Give data scientists simpler tools to work with Git or other version control systems.
Tools: Simplified integrated development environments that allow data scientists to write, run, manage and deploy code from the same place. Visual Studio Code is a popular option, but requires you to set up a local environment first. GitHub codespaces is built on top of Visual Studio Code enables you to write, run, test, debug, and commit code without the need to configure your local development environment or install dependencies on your local machine.

Thus, there are tools that can help solve ‌the friction at each stage of the machine learning workflow, but they don’t solve the overall problem of fragmentation in the data stack.

Designing a more unified solution

What would benefit data scientists and engineers alike is a more unified platform that includes all the features that I’ve outlined above, such as:

A pure Python stream processing library that has feature parity with Apache Flink (required for scalability and fault tolerance).
A user-friendly cloud IDE that lets data scientists:
— write and test stream processing and ML projects without having to set up a local environment for each one.
— easily commit changes to a Git repo and roll back to previous versions.
— connect to a Kafka broker and choose which topics they want to use as input or output.
— deploy projects as services without being confronted with too many configuration options.
A managed cloud environment with visual tools that let data scientists and engineers:
— visualize and edit data pipelines and inspect the data flowing through each connection.
— allocate more resources to specific processes if necessary.
— run ML models that are integrated into pipeline processes rather than behind a REST API in a separate system.

The closest system to fulfilling this vision so far is Confluent’s ksqlDB. It has an online IDE for SQL queries, a visual tool for designing pipelines, and an easy way to deploy feature transformations. You can even integrate an ML model into a UDF and deploy it to a ksqlDB cluster. It fails however, to ease the friction of code translation (transformations often need porting into Java) and the friction of performance (it doesn’t have all the scalability features of Apache Flink).

Given the lack of a feasible unified solution, a few colleagues and I decided to build one ourselves. Back when we were still working at McLaren, we needed to create physics models that would predict Formula One race outcomes based on vehicle sensor data. Rather than using a cloud environment, we were hosting and managing Apache Kafka ourselves. This made the development process even more complicated.

Despite this, we found that we could go into production faster when we were able to stay within the Python ecosystem and simplify the number of disparate technologies we used to train, deploy and run ML models in production. This ultimately led to us creating our own client library for our C# and Python developers. We also built our own cloud enviroment with it’s own UI which went a long way to giving us the unified experience that we wanted.

It was hard work, but more that worth it. It showed me how much progress you can make when you dismantle the impedance gap between data teams and engineering teams. If this can happen on a larger scale, it could make the industry for AI-driven software a lot more productive. By working in a more unified manner, data scientists and software engineers can combine their specific skill sets to create more sophisticated data-driven applications. Certainly, many points of friction can’t be solved with technology alone. There’s a whole other organizational discussion to be had. But I believe that new tools can be the lubricant that allows teams to mesh together like well oiled gears. If technical writers and engineers can do it, data scientists certainly can too!

For ‌more details on our vision, here’s some further reading:

Last updated:

Apr 11, 2024

Share this article: