Dark data: how companies are capturing data and failing to use it

“Dark data” is the data being collected and stored without a plan for future use. It’s often dumped into data lakes with little to no thought for how it could actually be used in the future — that’s someone else’s problem.

Mike Rosam

CEO & Co-Founder

The 4 Pillars of a Successful AI Strategy

Foundational strategies that leading companies use to overcome common obstacles and achieve sustained AI success.

Get the guide

Guide to the Event-Driven, Event Streaming Stack

Practical insights into event-driven technologies for developers and software architects.

Get the guide

Companies collect an enormous amount of data. Some are collected for specific purposes — analytics, compliance, and performance measures. But a surprising amount of data is collected without a specific future use in mind.

This data is collected on the premise that it has value. Someone might need it, so it’s dumped into data lakes without context, creating a false sense of security that the data is available. In reality, without context, that data will be hard, if not impossible to use.

I came face-to-face with this phenomenon when I worked as a mechanical engineer in the auto industry. Dealerships collect a ton of data when cars come in for repairs. When a particular part starts showing up for repairs frequently they turn to the mechanical engineers to figure out what’s causing the problem.

Of course, we’d want to see the data, but there was no easy way for engineers to access the data from the dealerships or to use it once we had the data files. After all, we were mechanical engineers without the software or skills to work with raw data files.

This type of disconnect between the data a company collects and the people who want to use it, limits the value of data and results in a lot of data going unused or “dark.” In this post, I’ll share some tips on improving data collection to make it more accessible to the people who need it.

More than half of an organization’s data is “dark data”

Collecting data and failing to use it is a common occurrence. Gartner coined the term “dark data” to describe information that organizations collect, process and store, but fail to use for other purposes.

Dark data makes up an estimated 55-75% of an organization’s data, according to the State of Dark Data report. In many cases, people don’t even know what data is being collected. This can include data collected for compliance purposes, data generated by digital processes and a growing amount of data generated by IoT sensors and devices.
What is the potential value of data that is currently underused?

Just as the diagnostic data dealerships collect would have been useful to my mechanical engineering team, much of this dark data has potential value. The healthcare industry provides some good examples of how data that’s already being collected can be put to greater use.

Flat Iron Healthcare took on the challenge of analyzing data that was collected and stored in disconnected systems such as data derived from electronic health records, medical devices, laboratory testing, and insurance claims. Using machine learning, they turn that data into insights that have helped expand cancer treatments, improve clinical trials and identify racial disparities in treatment.

This shows how historical data can yield valuable insights, but it’s just the tip of the iceberg.

To maximize the value of your data, it needs to be accessible to people across the organization. People who are experts in their respective domains understand how data can help them work more efficiently, make smarter business decisions, and develop new products and features. Enabling them to access data on their own accelerates their ability to solve problems and develop new products.

Recognizing that doctors needed more access to data to make the best decisions quickly, the University of Chicago (UCM) worked with Tibco to connect data silos and implement streaming analytics. Not only did this help get information to people faster, but it also enabled them to develop a system that alerts staff in real time when a patient is at high risk of cardiac arrest. The system, which uses streaming analytics and a predictive algorithm developed by one of UCM’s researchers, reduced the number of cardiac arrests in the hospital by an estimated 15-20%.

That’s the power of making the data you’re already collecting accessible to people with the expertise to use it. Whether it’s a mechanical engineer trying to improve parts performance, an operations manager trying to increase efficiency or a doctor trying to save patients, access to relevant data helps people build better solutions, faster.
How can companies make data more accessible so it doesn’t go dark?
The key to preventing data from going dark is to consider how people will find and make sense of the data you’re collecting, instead of just dumping it into a data lake. This can be done by improving data ingestion, developing data skills across the organization and increasing access to data.

Better manage the volume of data being collected

Data is being generated at an unprecedented rate. Traditional batch processing simply can’t keep up with our need to process and use data faster. This problem only gets worse with time, as more data streams in faster only to lose value sitting in a data lake or warehouse.

The solution is to go upstream to clean and process your data before it lands in storage. Known as stream-processing, this method of data ingestion enables companies to process data before it flows into a warehouse so it’s analytics-ready and can quickly and efficiently be shared with any downstream system.

Gartner estimates that one-third of companies use stream processing, also called stream data integration, to ingest and store data for later use or analysis. Simply cleaning data and appending business context before storing it can make it easier for end users to work with.

See how Quix makes the benefits of stream processing available to teams of all sizes.

Increase employees’ ability to work with data

In the State of Dark Data report, 76% of respondents said training more current employees in data science and analytics would help solve their data challenges and 75% said that using software to enable less technical employees to work with data without a data expert would help.

Levi Strauss & Co. has already helped 100-plus employees gain data skills through machine learning bootcamps where employees with a variety of backgrounds learn Python programming language, coding, statistics and more. The employees then apply these new skills to their roles in product development, planning, marketing, HR, commercial operations, finance, sales and direct-to-consumer teams.

One notable thing about these boot camps is that participants learn Python, a popular coding language that is relatively easy for people to learn. This is something we recognized when we built Quix and one of the features that differentiates Quix from other data solutions.

In order to make stream processing and the data it transforms more accessible, we first built a reliable streaming infrastructure so companies don’t have to set up and manage these complex technologies on their own. Then we made it accessible to anyone with basic knowledge of Python code, reducing the skill barrier to accessing and using data, even real-time data

If you know Python, you can try Quix now.

Make data more accessible

There is a rich set of data from business processes, IoT and other sources being collected that sits behind an IT or data team. Making this data accessible and usable to people outside these teams, often comes down to trust.

IT and data teams need reassurance that less technical users won’t break delicate production environments. End users need to know that the data they’re accessing is accurate and complete. If the data is hard to understand or unreliable, people are likely to shy away from using it.

This is often the case in companies where data solutions have grown organically. They start to look like a plate of spaghetti, making it hard to tell where data is coming from or how it’s been processed.

Quix avoids this tangled mess, by providing a single layer in which business logic resides. This provides more transparency and consistency in how data is collected, processed and stored.

Our stream processing solution provides the resilience and reliability companies need to ensure that every piece of data is captured and processed every time, which builds trust with end users. What’s more, we provide sandbox environments that enable less technical employees to work with data while protecting the production environment.

With Quix you get well-ordered data flowing into a warehouse, the ability to work with data in Python and secure sandbox environments that enable more people to self-serve information for reporting and machine learning. It’s our way of increasing access to data and chipping away at all that data currently going dark.

Share this article:

Words by

Mike Rosam

CEO & Co-Founder

Mike Rosam is Co-Founder and CEO at Quix, where he works at the intersection of business and technology to pioneer the world's first streaming data development platform. He was previously Head of Innovation at McLaren Applied, where he led the data analytics product line. Mike has a degree in Mechanical Engineering and an MBA from Imperial College London.