Data preparation

27 August 2025

Some good pointers on data cleansing which also doubles as a good checklist of things to think about at the start of a data project, at least on the data engineering side.

Always Start With Understanding the Data

Know your source – Where is the data coming from? Internal database? Third-party API? Manual upload?
Understand its purpose – Why was this data collected? Who will use it?
Check data freshness – Is it near real-time, daily batch, or static historical data?
Understand scale – How many rows/records are we talking about? This will affect tooling and approach
Look for key business drivers – Which columns/fields matter most to the business use case?
Get context from stakeholders – Talk to PMs, analysts, or SMEs before you write a single cleaning script

Define the Schema Early

Choose correct data types – int, float, string, timestamp, boolean, pick precisely
Apply constraints – Not Null, Unique, Primary Key, Foreign Key where applicable
Default values – Decide defaults for missing fields only if business rules allow
Precision and scale – For decimals, know how many digits you need after the point
Data size control – Avoid VARCHAR(255) everywhere; optimize for reality
Match schema to usage – The schema should reflect how data will be queried and processed

Validate Data Types & Formats

Check for mismatches – Numeric columns containing strings (e.g., “N/A” in an int column)
Enforce formats – Dates should follow a standard (ISO 8601 or company format)
Parse complex types – JSON fields, arrays, structs, ensure they’re valid and parsable
Catch unit mismatches – e.g., “km” vs “miles” in the same column
Test with sampling – Randomly check 100–200 rows to spot patterns
Automate validation – Use scripts or frameworks like Great Expectations to enforce rules

Handle Missing Data Properly

Identify NULLs – Know exactly which fields are missing and how often
Decide on action – Drop, fill, or leave missing based on use case
Use domain logic – If birth_date is missing, you can’t just fill with 01-01-1900 without reason
Flag critical NULLs – Create alerts when important fields (like customer_id) are missing
Avoid blind imputation – Filling missing values can distort downstream analysis
Document decisions – Always record why you handled missing data a certain way

Detect & Remove Duplicates

Define uniqueness – What makes a record unique? Single field or combination?
Check both exact and fuzzy duplicates – E.g., same customer name with slightly different spelling
Decide on deduplication strategy – Keep first, keep last, or merge data
Flag potential duplicates – Don’t delete blindly; verify with business teams
Use hashing – Create hash keys for comparison when datasets are large
Automate dedup checks – Schedule scripts to run regularly for recurring data sources

Standardize Units & Formats

Pick one unit – Convert all weights, distances, currencies to a single standard
Use consistent casing – Lowercase or uppercase text fields consistently
Normalize whitespace – Trim spaces before/after values
Standardize date formats – Pick one and stick to it (e.g., YYYY-MM-DD)
Apply consistent naming – “US”, “USA”, and “United States” should be unified
Automate standardization – Use mapping tables or scripts to enforce rules

Identify & Fix Outliers

Define valid ranges – Age can’t be 250, salary can’t be -5000
Use statistical methods – Z-score, IQR to detect anomalies
Validate with domain experts – Sometimes “outliers” are actually correct
Flag for review – Never delete unusual values without business sign-off
Track frequency – Recurrent outliers may mean an upstream issue
Document everything – If you remove or alter data, log it

Enforce Referential Integrity

Check foreign keys – Ensure referenced IDs exist in parent tables
Avoid orphan records – A transaction without a valid customer should be flagged
Validate join keys – Ensure keys are consistent in type and format across tables
Cascade rules – Decide how deletes/updates propagate across related tables
Regular audits – Schedule scripts to check integrity at intervals
Collaborate with upstream teams – Fix issues at the source if possible

Track Data Lineage

Know the source – Where each field came from
Document transformations – Every cleaning step should be traceable
Map dependencies – Know which downstream systems use your cleaned data
Use versioning – Keep snapshots of data and schemas over time
Audit regularly – Compare current and historical data for unexpected changes
Adopt a lineage tool – Like OpenLineage or DataHub for visibility

Build Automated Data Quality Checks

Threshold alerts – E.g., alert if NULL rate exceeds 5%
Schema drift detection – Notify if schema changes unexpectedly
Volume checks – Alert if record count deviates drastically
Value checks – Alert on unexpected category values
Freshness checks – Alert if data hasn’t been updated in X hours
Integrate into pipelines – Make quality checks a default pipeline step

Work Closely With Domain Experts

Clarify ambiguous fields – Ask if “status” means order status or payment status
Verify assumptions – Don’t assume “active” means the same to every team
Review sample records together – See if values match expectations
Learn business rules – Data cleaning without business rules is guesswork
Get sign-offs – Before making irreversible changes
Educate non-technical teams – Help them understand why cleaning matters
Review sample records together – See if values match expectations.
Learn business rules – Data cleaning without business rules is guesswork.
Get sign-offs – Before making irreversible changes.
Educate non-technical teams – Help them understand why cleaning matters.

Optimize for Performance

Filter early – Remove irrelevant data as soon as possible.
Use efficient joins – Avoid cartesian products.
Index where needed – Speeds up lookups during cleaning.
Batch processing – For large datasets, break into chunks.
Push down operations – Do cleaning in the database when possible.
Test scalability – Will your cleaning process handle 10x more data?

Build for Repeatability

Script everything – Avoid manual cleaning in Excel.
Use parameterized scripts – Makes them reusable across datasets.
Store scripts in version control – Git it or regret it.
Automate schedules – So cleaning runs on time without human intervention.
Document every step – Future you will thank present you.
Create modular functions – For common cleaning tasks like trimming text or removing duplicates.

Monitor Post-Cleaning Quality

Re-run checks after cleaning – Did the changes actually improve quality?
Validate downstream outputs – Reports/dashboards should reflect expected changes.
Watch for regressions – Sometimes cleaning breaks something else.
Track KPIs – e.g., error rate, completeness score.
Share results with stakeholders – Show the value of your work.
Iterate – Data cleaning is never truly “done.”

Keep a Data Quality Mindset Always

Don’t chase perfection – Focus on what matters most for the use case.
Think prevention – Fix issues at the source, not just downstream.
Stay curious – Ask why an issue exists, not just how to fix it.
Balance speed vs accuracy – Sometimes “good enough” is enough to move forward.
Document lessons learned – Every dataset teaches you something.
Be the advocate – If you don’t care about data quality, no one else will.

Source: Shubham Srivastava, Amazon Data Engineer