Deep vs wide data: choosing the right schema for R&D engineering workflows

Building data pipelines for R&D requires making important architectural decisions before committing engineering resources. What factors affect those decisions?

Steve Rosam

Head of Content

The 4 Pillars of a Successful AI Strategy

Foundational strategies that leading companies use to overcome common obstacles and achieve sustained AI success.

Get the guide

Guide to the Event-Driven, Event Streaming Stack

Practical insights into event-driven technologies for developers and software architects.

Get the guide

When you're building data pipelines for R&D teams working on complex electro-mechanical systems, one of the first architectural decisions you'll face is how to structure your data schema. Should you store sensor measurements individually as they arrive (deep data) or aggregate them into normalised records (wide data)? This choice impacts everything from storage efficiency to query performance, and it's particularly crucial in R&D environments where test rig data volumes can reach terabytes per testing campaign.

The decision between deep and wide data schemas isn't just academic. It directly affects how your R&D teams can access, analyze, and collaborate on critical testing data. Whether you're processing high-frequency measurements from battery testing rigs, drone flight telemetry, or automotive engine diagnostics, understanding these schema patterns will help you build more effective data infrastructure.

Understanding deep vs wide data in R&D contexts

Deep data schema stores each measurement as a separate record with its own timestamp, sensor identifier, and value. Imagine a battery testing rig collecting voltage, current, and temperature readings every second. In a deep schema, each measurement becomes its own row:

time        sensor_id    value
10:00:00    voltage      12.6V
10:00:00    current      2.3A
10:00:00    temperature  23.1°C
10:00:01    voltage      12.4V
10:00:01    current      2.5A
10:00:01    temperature  23.2°C

Wide data schema aggregates measurements from the same time interval into a single record with multiple fields:

time        voltage    current    temperature
10:00:00    12.6V      2.3A       23.1°C
10:00:01    12.4V      2.5A       23.2°C

This distinction becomes particularly important when you're dealing with R&D test scenarios where different sensors operate at different frequencies, or when you need to correlate measurements across multiple systems.

Deep data: optimised for sensor-centric workflows

Deep data schema aligns naturally with how most R&D instrumentation systems work. Test equipment typically streams measurements as individual events, making deep schema the path of least resistance for data ingestion.

Easier data ingestion: Each sensor measurement arrives independently and gets stored immediately without waiting for other sensors to synchronize. This is particularly valuable in rocket engine testing where critical sensors might operate at different frequencies. Some measure pressure every millisecond while others track temperature every second.

Robust handling of late data: In distributed R&D environments, network issues or sensor synchronisation problems can cause measurements to arrive out of order. Deep schema handles this gracefully since each measurement is independent and can be indexed correctly by timestamp regardless of arrival order.

Flexible sensor metadata: You can attach specific metadata to individual sensors without affecting others. For example, in automotive testing, you might need to track the calibration date for each sensor, or flag certain temperature readings as being from sensors with known drift characteristics.

Natural fit for time series databases: Most modern time series databases like InfluxDB are optimised for deep data patterns, making storage and retrieval efficient for sensor-heavy R&D applications.

However, deep data comes with trade-offs that become apparent when your R&D teams need to perform cross-sensor analysis or visualisation.

Wide data: built for analytics and correlation

Wide data schema excels when R&D teams need to analyse relationships between different measurements or create comprehensive dashboards showing system behaviour.

Efficient cross-parameter analysis: When drone engineers need to correlate motor temperature with battery voltage and flight altitude, wide schema makes these queries straightforward. All related measurements share the same timestamp, eliminating complex joins.

Streamlined visualization: Creating multi-parameter charts in tools like Grafana becomes much simpler when all measurements from a time interval exist in a single record. This is particularly valuable for HVAC system testing where you need to simultaneously track temperature, pressure, and airflow across multiple zones.

Reduced storage overhead: By eliminating repeated timestamp and metadata fields, wide schema can reduce storage requirements by 20-30% in high-frequency measurement scenarios. For R&D programs generating terabytes of test data, this translates to significant cost savings.

Better batch processing performance: When running analytics across large datasets such as analyzing six months of battery degradation data, wide schema reduces the number of records that need to be processed, improving query performance.

The downside is that wide schema requires more complex data pipeline logic to handle the aggregation and synchronisation of measurements from different sensors.

Technical challenges in R&D data transformation

Converting between deep and wide schemas presents several technical challenges specific to R&D environments:

Frequency mismatch handling: Different sensors often operate at different frequencies. High-speed pressure sensors might sample at 1kHz while temperature sensors update every second. Converting to wide schema requires downsampling decisions. Do you use the last value, mean, or maximum within each interval?

Grace period management: Wide schema requires defining how long to wait for late-arriving sensor data before finalizing a record. In rocket testing, this might be 5 seconds, but in long-duration battery testing, you might wait 30 seconds to ensure all measurements are captured.

Data cleaning and validation: R&D sensor data often contains outliers, dropouts, or calibration errors. The transformation process needs to handle these consistently across all sensors, which is more complex in wide schema where you're aggregating multiple potentially problematic measurements.

Dynamic sensor configurations: R&D test setups frequently change between test campaigns. New sensors get added, old ones removed, or measurement frequencies adjusted. Your schema transformation logic needs to accommodate these changes without breaking existing analytics workflows.

Impact on R&D productivity and decision-making

The choice between deep and wide data schemas directly affects how quickly R&D teams can move from data collection to insights.

Query performance implications: Deep schema can slow down complex analytical queries, particularly when correlating measurements across dozens of sensors. A typical engine test might generate 50GB of data per hour across 200+ sensors. Wide schema reduces query times by 40-60% for multi-parameter analysis, directly impacting how quickly engineers can iterate on design decisions.

Dashboard responsiveness: R&D teams rely on real-time dashboards during testing campaigns. Wide schema enables faster dashboard updates, particularly important during critical tests where engineers need immediate feedback on system performance.

Historical data access: When investigating failures or validating design changes, engineers often need to query months of historical data. Wide schema's reduced record count makes these queries more responsive, enabling faster root cause analysis.

Collaboration workflows: Centralised data platforms benefit from wide schema's analytics advantages, making it easier for distributed R&D teams to share insights and collaborate on design decisions.

Practical solutions and best practices

Successfully implementing either schema requires addressing the specific challenges of R&D environments:

Hybrid approaches: Many successful R&D data platforms use both schemas strategically. Store raw sensor data in deep schema for flexibility and fault tolerance, then create wide schema views for analytics and visualisation. This approach provides the best of both worlds while managing the complexity through automation.

Automated schema conversion: Build pipeline components that can convert between schemas based on query requirements. For example, automatically aggregate deep data into wide schema for dashboard queries while maintaining the original deep data for detailed analysis.

Sensor grouping strategies: Group related sensors into logical units that share similar frequencies and analysis requirements. Battery testing might group cell-level measurements separately from pack-level measurements, allowing different schema approaches for different analysis needs.

Metadata management: Implement robust metadata tracking to handle sensor configuration changes over time. This is particularly important in R&D environments where test setups evolve frequently between campaigns.

Data retention policies: Define clear retention policies for each schema type. You might keep deep data for 90 days for detailed analysis, then aggregate to wide schema for long-term trending and compliance reporting.

Conclusion

Choosing between deep and wide data schemas in R&D environments isn't about finding the "correct" answer. It's about understanding your specific requirements and trade-offs. Deep schema offers flexibility and robust data ingestion, making it ideal for sensor-heavy environments with complex measurement patterns. Wide schema provides superior analytics performance and storage efficiency, particularly valuable for cross-parameter analysis and dashboard applications.

The most successful R&D data platforms often employ hybrid approaches, using deep schema for data ingestion and fault tolerance while creating wide schema views for analytics and collaboration. This strategy addresses the fundamental challenge of R&D environments: balancing the need for complete, accurate data capture with the requirement for fast, flexible analysis.

As R&D teams increasingly rely on data-driven decision making, investing in the right data infrastructure becomes critical for maintaining competitive advantage. Whether you choose deep, wide, or hybrid approaches, the key is building systems that can evolve with your R&D requirements while enabling the fast iteration cycles that modern product development demands.

Ready to explore how modern data platforms can streamline your R&D workflows? Consider evaluating solutions that can handle both schema types and provide the flexibility to adapt as your requirements evolve.

Share this article:

Words by

Steve Rosam

Head of Content

Steve Rosam is the Head of Content at Quix, where he oversees the creation and maintenance of content for publication both in-house and externally. With a background in software development spanning two decades, Steve has experience in a variety of industries including automotive, finance, media, and security. His technical expertise now fuels his leadership in content strategy and development at Quix.