
Data quality before models: a practical checklist
- 14 May 2026
- In Blog, Governance
- ~8 min read
A capable model on dirty data does not produce insight — it produces confident nonsense at scale. Before you spend budget on retrieval, summarisation, or workflow automation, run the same hygiene pass we use with clients: prove the underlying records are unique, attributable, and safe to combine.
Identity and duplication
Define what makes a customer, account, or asset “one row” in each system. List merge rules when two records collide (email match, ABN, address normalisation). If humans routinely create duplicates because the CRM is painful, fix that first — automation will only replicate the pain faster.
Field ownership and meaning
For each field you plan to read or write, document who owns updates, allowed values, and what “empty” means (unknown versus not applicable). Ambiguous enums and free-text blobs in critical paths are where models hallucinate plausibly.
PII and retention boundaries
Separate identifiers from sensitive narrative fields. Decide minimum retention for logs that capture model inputs and outputs — especially where health, finance, or HR text appears. If legal has not signed off on processing, pause before wiring production mailboxes or attachments.
Golden scenarios
- Ten real cases your team agrees are “good” outcomes after enrichment or routing.
- Ten edge cases you expect to fail — missing attachments, multi-party deals, refunds mid-flight.
- A regression pack you can replay after every change to connectors or prompts.
Models interpolate; operations extrapolate. Only one of those should happen on unvetted data.
Measurement that precedes “accuracy”
Track coverage: what percentage of target records have the fields your automation needs? Track timeliness: how stale is each feed? Those two charts tell you whether a pilot is data-limited before you debate temperature settings.
Where we plug in
We pair data assessment with delivery so pilots do not outrun governance. Start from Knowledge base & internal Q&A or workflow integrations, and see Work for how programmes are phased.
Stress-test your data layer
Share the systems you want to connect — we will help prioritise hygiene work before automation spend.