Why AI in Engineering Must Start With Data Quality - Blog

When people talk about AI in engineering, the conversation usually jumps straight to prediction, optimisation, or digital twins. Those are genuinely exciting destinations. But they often skip an uncomfortable reality: in most organisations, the underlying data is not close to ready.

Incomplete equipment records, inconsistent tag naming across systems, legacy formats that predate modern software, and manual copy-paste workflows that introduce silent errors at every step — these are not edge cases. They are the normal operating environment for most large engineering projects and industrial facilities.

Even the best models cannot fix data that is fundamentally unreliable. Garbage in, garbage out has never been more relevant than in the context of AI applied to engineering data.

*The gap between what people think they have and what they actually have**

One of the most consistent surprises I encounter when working with engineering data is how confident organisations are about their data quality before we start, and how different the reality looks once we look carefully.

A common scenario: a company has maintained an instrument list for 15 years across multiple project phases. It has been through three EPC contractors, two system migrations, and several departmental reorganisations. Each phase added columns, changed naming conventions, and introduced new validation rules — without cleaning up the data that predated them.

The result is a spreadsheet that looks complete. It has all the rows and columns. But when you check whether the tag names are consistent, whether the engineering units match the expected range, whether the instrument type codes still correspond to the current classification standard — the picture changes quickly.

This is the hidden cost that AI projects expose. Not because AI is fragile, but because it makes inconsistencies visible in ways that manual workflows never did.

*Why validation must be treated as a first-class feature**

The conventional approach to data quality is a one-time clean-up before a major system migration or project kick-off. A team spends weeks or months reconciling records, and then the project moves forward. Within six months, the drift begins again.

The only way to maintain engineering data quality at scale is to make validation continuous and embedded in the workflow. Not a phase. Not a gate. A persistent background process that catches issues as they are introduced — not years later when they have propagated through downstream systems.

This means building validation into every point where data is created or modified. It means treating a missing mandatory field or a conflicting attribute as something the system flags immediately, not something an analyst discovers during a quarterly review.

It also means thinking about validation in layers. Mandatory field checks are the baseline — fast, cheap, and unambiguous. Range and format validations add the next layer. Relational checks — where the system verifies that values are consistent with each other across related records — are more complex but significantly more valuable. And AI-assisted anomaly detection adds the final layer, catching the subtle patterns that rigid rules cannot reach.

*The relationship between data quality and AI trust**

There is a direct connection between the reliability of the underlying data and how much engineers are willing to trust AI recommendations built on top of it.

If an AI validation tool flags a conflict between two attributes, and the engineer knows that the source data is often inconsistent, they will dismiss the flag as noise. Probably wrong. Not worth investigating. The tool trains users to ignore it.

If the same tool operates on data that has been consistently maintained, the same flag carries real weight. It is worth stopping for. It is probably indicating something real.

This means that data quality work is not just a technical prerequisite for AI. It is the foundation of user trust in AI. Organisations that invest in clean data before building AI on top of it end up with AI tools that actually get used. Those that skip the foundation find that their expensive models generate expensive noise that gets dismissed.

*A practical starting point**

For engineering organisations beginning this journey, I consistently recommend the same starting point: pick the most critical dataset — the one where errors cause the most downstream problems — and build continuous validation for it first. Make it visible. Make it actionable. Make the metrics public.

Once teams see that validation catches real errors that they would otherwise have found late, expensive, and manually, the argument for expanding the approach makes itself.

AI in engineering should not begin with a model. It should begin with a quiet, systematic commitment to trusting the data we already have — and building the infrastructure to keep it trustworthy over time.