Back

5 Apr, 2022 | Strategy

Dark data: how companies are capturing data and failing to use it

“Dark data” is the data being collected and stored without a plan for future use. It’s often dumped into data lakes with little to no thought for how it could actually be used in the future — that’s someone else’s problem.

1611064394032
Words by
Mike Rosam, CEO & Co-Founder
Blog 162 feature

Companies collect an enormous amount of data. Some are collected for specific purposes — analytics, compliance, and performance measures. But a surprising amount of data is collected without a specific future use in mind.

This data is collected on the premise that it has value. Someone might need it, so it’s dumped into data lakes without context, creating a false sense of security that the data is available. In reality, without context, that data will be hard, if not impossible to use.

I came face-to-face with this phenomenon when I worked as a mechanical engineer in the auto industry. Dealerships collect a ton of data when cars come in for repairs. When a particular part starts showing up for repairs frequently they turn to the mechanical engineers to figure out what’s causing the problem.

Of course, we’d want to see the data, but there was no easy way for engineers to access the data from the dealerships or to use it once we had the data files. After all, we were mechanical engineers without the software or skills to work with raw data files.

This type of disconnect between the data a company collects and the people who want to use it, limits the value of data and results in a lot of data going unused or “dark.” In this post, I’ll share some tips on improving data collection to make it more accessible to the people who need it.

More than half of an organization’s data is “dark data”

Collecting data and failing to use it is a common occurrence. Gartner coined the term “dark data” to describe information that organizations collect, process and store, but fail to use for other purposes.

Dark data makes up an estimated 55-75% of an organization’s data, according to the State of Dark Data report. In many cases, people don’t even know what data is being collected. This can include data collected for compliance purposes, data generated by digital processes and a growing amount of data generated by IoT sensors and devices.
What is the potential value of data that is currently underused?

Just as the diagnostic data dealerships collect would have been useful to my mechanical engineering team, much of this dark data has potential value. The healthcare industry provides some good examples of how data that’s already being collected can be put to greater use.

Flat Iron Healthcare took on the challenge of analyzing data that was collected and stored in disconnected systems such as data derived from electronic health records, medical devices, laboratory testing, and insurance claims. Using machine learning, they turn that data into insights that have helped expand cancer treatments, improve clinical trials and identify racial disparities in treatment.

This shows how historical data can yield valuable insights, but it’s just the tip of the iceberg.

To maximize the value of your data, it needs to be accessible to people across the organization. People who are experts in their respective domains understand how data can help them work more efficiently, make smarter business decisions, and develop new products and features. Enabling them to access data on their own accelerates their ability to solve problems and develop new products.

Recognizing that doctors needed more access to data to make the best decisions quickly, the University of Chicago (UCM) worked with Tibco to connect data silos and implement streaming analytics. Not only did this help get information to people faster, but it also enabled them to develop a system that alerts staff in real time when a patient is at high risk of cardiac arrest. The system, which uses streaming analytics and a predictive algorithm developed by one of UCM’s researchers, reduced the number of cardiac arrests in the hospital by an estimated 15-20%.

That’s the power of making the data you’re already collecting accessible to people with the expertise to use it. Whether it’s a mechanical engineer trying to improve parts performance, an operations manager trying to increase efficiency or a doctor trying to save patients, access to relevant data helps people build better solutions, faster.
How can companies make data more accessible so it doesn’t go dark?
The key to preventing data from going dark is to consider how people will find and make sense of the data you’re collecting, instead of just dumping it into a data lake. This can be done by improving data ingestion, developing data skills across the organization and increasing access to data.

Better manage the volume of data being collected

Data is being generated at an unprecedented rate. Traditional batch processing simply can’t keep up with our need to process and use data faster. This problem only gets worse with time, as more data streams in faster only to lose value sitting in a data lake or warehouse.

The solution is to go upstream to clean and process your data before it lands in storage. Known as stream-processing, this method of data ingestion enables companies to process data before it flows into a warehouse so it’s analytics-ready and can quickly and efficiently be shared with any downstream system.

Gartner estimates that one-third of companies use stream processing, also called stream data integration, to ingest and store data for later use or analysis. Simply cleaning data and appending business context before storing it can make it easier for end users to work with.

See how Quix makes the benefits of stream processing available to teams of all sizes.

Increase employees’ ability to work with data

In the State of Dark Data report, 76% of respondents said training more current employees in data science and analytics would help solve their data challenges and 75% said that using software to enable less technical employees to work with data without a data expert would help.

Levi Strauss & Co. has already helped 100-plus employees gain data skills through machine learning bootcamps where employees with a variety of backgrounds learn Python programming language, coding, statistics and more. The employees then apply these new skills to their roles in product development, planning, marketing, HR, commercial operations, finance, sales and direct-to-consumer teams.

One notable thing about these boot camps is that participants learn Python, a popular coding language that is relatively easy for people to learn. This is something we recognized when we built Quix and one of the features that differentiates Quix from other data solutions.

In order to make stream processing and the data it transforms more accessible, we first built a reliable streaming infrastructure so companies don’t have to set up and manage these complex technologies on their own. Then we made it accessible to anyone with basic knowledge of Python code, reducing the skill barrier to accessing and using data, even real-time data

If you know Python, you can try Quix now.

Make data more accessible

There is a rich set of data from business processes, IoT and other sources being collected that sits behind an IT or data team. Making this data accessible and usable to people outside these teams, often comes down to trust.

IT and data teams need reassurance that less technical users won’t break delicate production environments. End users need to know that the data they’re accessing is accurate and complete. If the data is hard to understand or unreliable, people are likely to shy away from using it.

This is often the case in companies where data solutions have grown organically. They start to look like a plate of spaghetti, making it hard to tell where data is coming from or how it’s been processed.

Quix avoids this tangled mess, by providing a single layer in which business logic resides. This provides more transparency and consistency in how data is collected, processed and stored.

Our stream processing solution provides the resilience and reliability companies need to ensure that every piece of data is captured and processed every time, which builds trust with end users. What’s more, we provide sandbox environments that enable less technical employees to work with data while protecting the production environment.

With Quix you get well-ordered data flowing into a warehouse, the ability to work with data in Python and secure sandbox environments that enable more people to self-serve information for reporting and machine learning. It’s our way of increasing access to data and chipping away at all that data currently going dark.

share

Talk to a technical expert about your use case if you’re considering using stream processing in your business.

Book a demo
1611064394032
words by
Mike Rosam, CEO & Co-Founder

Mike Rosam is Co-Founder and CEO at Quix, where he works at the intersection of business and technology to pioneer the world's first streaming data development platform. He was previously Head of Innovation at McLaren Applied, where he led the data analytics product line. Mike has a degree in Mechanical Engineering and an MBA from Imperial College London.

Previous Post Next Post

Related content

View all
0182 feature
Strategy | 27 Oct, 2022
Get more from IIoT with streaming data integrations
As the internet of things evolves, so does its application in industrial settings. Find out how to take Industrial IoT from a helpful tool to transformative technology with streaming data integrations.
1611064394032
words by
Mike Rosam, CEO & Co-Founder
Why Io T projects fail 02
Strategy | 11 Oct, 2022
Why IoT projects fail — and what you can do differently to succeed
If your last IoT project wasn’t wholly successful or you’d like to progress along the learning curve faster, we’ve got you covered.
1611064394032
words by
Mike Rosam, CEO & Co-Founder
Blog 185 feature
Strategy | 30 Aug, 2022
Stream processing at the edge makes “real-time” even faster
It turns out that “real time” is a bit of a misnomer — there’s always some lag between an event and its processing — even if it’s only a few milliseconds. But we can help decrease even that item.
1611064394032
words by
Mike Rosam, CEO & Co-Founder

The Stream

Updates to your inbox

Get the data stream processing community's newsletter. It's for sharing insights, events and community-driven projects.

Background image