Data quality before models: a practical checklist

What is data quality before models?

Data quality before models means proving your CRM, documents, and ticket data are unique, attributable, and safe to combine before you spend on retrieval, summarisation, or workflow automation. The goal is trustworthy records and clear field meaning, so pilots measure business outcomes instead of debating model settings alone.

Who this guide is for

Data leads, finance and ops owners, and sponsors wiring Australian SME systems into AI layers. Use with data cadence so hygiene keeps pace with delivery demos.

Identity and duplication

Define what makes a customer, account, or asset “one row” in each system. List merge rules when two records collide (email match, ABN, address normalisation). If humans routinely create duplicates because the CRM is painful, fix intake first. See our CRM deduplication guide for a focused pass before automation spend.

Field ownership and meaning

For each field you plan to read or write, document who owns updates, allowed values, and what “empty” means (unknown versus not applicable). Ambiguous enums and free-text blobs in critical paths are where models hallucinate plausibly.

PII and retention boundaries

Separate identifiers from sensitive narrative fields. Decide minimum retention for logs that capture model inputs and outputs, especially where health, finance, or HR text appears. If legal has not signed off on processing, pause before wiring production mailboxes or attachments. See PII boundaries for document retrieval.

Golden scenarios

Ten real cases your team agrees are good outcomes after enrichment or routing.
Ten edge cases you expect to fail, such as missing attachments or refunds mid-flight.
A regression pack you replay after every connector or prompt change.

Measure coverage and timeliness on real records before you tune model parameters.

Measurement that precedes model tuning

Track coverage: what percentage of target records have the fields your automation needs? Track timeliness: how stale is each feed? Those two charts show whether the pilot is data-limited before you change prompts or models.

How Yarli applies this checklist

We pair data assessment with delivery and review cadence. See Knowledge base & internal Q&A, workflow integrations, and Work.

Published by Yarli Data, Sydney. Australia-wide delivery for operational Data and AI pilots.