Data preparation

27 August 2025

Some good pointers on data cleansing which also doubles as a good checklist of things to think about at the start of a data project, at least on the data engineering side.

  1. Always Start With Understanding the Data
  • Know your source – Where is the data coming from? Internal database? Third-party API? Manual upload?
  • Understand its purpose – Why was this data collected? Who will use it?
  • Check data freshness – Is it near real-time, daily batch, or static historical data?
  • Understand scale – How many rows/records are we talking about? This will affect tooling and approach
  • Look for key business drivers – Which columns/fields matter most to the business use case?
  • Get context from stakeholders – Talk to PMs, analysts, or SMEs before you write a single cleaning script
  1. Define the Schema Early
  • Choose correct data types – int, float, string, timestamp, boolean, pick precisely
  • Apply constraints – Not Null, Unique, Primary Key, Foreign Key where applicable
  • Default values – Decide defaults for missing fields only if business rules allow
  • Precision and scale – For decimals, know how many digits you need after the point
  • Data size control – Avoid VARCHAR(255) everywhere; optimize for reality
  • Match schema to usage – The schema should reflect how data will be queried and processed
  1. Validate Data Types & Formats
  • Check for mismatches – Numeric columns containing strings (e.g., “N/A” in an int column)
  • Enforce formats – Dates should follow a standard (ISO 8601 or company format)
  • Parse complex types – JSON fields, arrays, structs, ensure they’re valid and parsable
  • Catch unit mismatches – e.g., “km” vs “miles” in the same column
  • Test with sampling – Randomly check 100–200 rows to spot patterns
  • Automate validation – Use scripts or frameworks like Great Expectations to enforce rules
  1. Handle Missing Data Properly
  • Identify NULLs – Know exactly which fields are missing and how often
  • Decide on action – Drop, fill, or leave missing based on use case
  • Use domain logic – If birth_date is missing, you can’t just fill with 01-01-1900 without reason
  • Flag critical NULLs – Create alerts when important fields (like customer_id) are missing
  • Avoid blind imputation – Filling missing values can distort downstream analysis
  • Document decisions – Always record why you handled missing data a certain way
  1. Detect & Remove Duplicates
  • Define uniqueness – What makes a record unique? Single field or combination?
  • Check both exact and fuzzy duplicates – E.g., same customer name with slightly different spelling
  • Decide on deduplication strategy – Keep first, keep last, or merge data
  • Flag potential duplicates – Don’t delete blindly; verify with business teams
  • Use hashing – Create hash keys for comparison when datasets are large
  • Automate dedup checks – Schedule scripts to run regularly for recurring data sources
  1. Standardize Units & Formats
  • Pick one unit – Convert all weights, distances, currencies to a single standard
  • Use consistent casing – Lowercase or uppercase text fields consistently
  • Normalize whitespace – Trim spaces before/after values
  • Standardize date formats – Pick one and stick to it (e.g., YYYY-MM-DD)
  • Apply consistent naming – “US”, “USA”, and “United States” should be unified
  • Automate standardization – Use mapping tables or scripts to enforce rules
  1. Identify & Fix Outliers
  • Define valid ranges – Age can’t be 250, salary can’t be -5000
  • Use statistical methods – Z-score, IQR to detect anomalies
  • Validate with domain experts – Sometimes “outliers” are actually correct
  • Flag for review – Never delete unusual values without business sign-off
  • Track frequency – Recurrent outliers may mean an upstream issue
  • Document everything – If you remove or alter data, log it
  1. Enforce Referential Integrity
  • Check foreign keys – Ensure referenced IDs exist in parent tables
  • Avoid orphan records – A transaction without a valid customer should be flagged
  • Validate join keys – Ensure keys are consistent in type and format across tables
  • Cascade rules – Decide how deletes/updates propagate across related tables
  • Regular audits – Schedule scripts to check integrity at intervals
  • Collaborate with upstream teams – Fix issues at the source if possible
  1. Track Data Lineage
  • Know the source – Where each field came from
  • Document transformations – Every cleaning step should be traceable
  • Map dependencies – Know which downstream systems use your cleaned data
  • Use versioning – Keep snapshots of data and schemas over time
  • Audit regularly – Compare current and historical data for unexpected changes
  • Adopt a lineage tool – Like OpenLineage or DataHub for visibility
  1. Build Automated Data Quality Checks
  • Threshold alerts – E.g., alert if NULL rate exceeds 5%
  • Schema drift detection – Notify if schema changes unexpectedly
  • Volume checks – Alert if record count deviates drastically
  • Value checks – Alert on unexpected category values
  • Freshness checks – Alert if data hasn’t been updated in X hours
  • Integrate into pipelines – Make quality checks a default pipeline step
  1. Work Closely With Domain Experts
  • Clarify ambiguous fields – Ask if “status” means order status or payment status
  • Verify assumptions – Don’t assume “active” means the same to every team
  • Review sample records together – See if values match expectations
  • Learn business rules – Data cleaning without business rules is guesswork
  • Get sign-offs – Before making irreversible changes
  • Educate non-technical teams – Help them understand why cleaning matters
  • Review sample records together – See if values match expectations.
  • Learn business rules – Data cleaning without business rules is guesswork.
  • Get sign-offs – Before making irreversible changes.
  • Educate non-technical teams – Help them understand why cleaning matters.
  1. Optimize for Performance
  • Filter early – Remove irrelevant data as soon as possible.
  • Use efficient joins – Avoid cartesian products.
  • Index where needed – Speeds up lookups during cleaning.
  • Batch processing – For large datasets, break into chunks.
  • Push down operations – Do cleaning in the database when possible.
  • Test scalability – Will your cleaning process handle 10x more data?
  1. Build for Repeatability
  • Script everything – Avoid manual cleaning in Excel.
  • Use parameterized scripts – Makes them reusable across datasets.
  • Store scripts in version control – Git it or regret it.
  • Automate schedules – So cleaning runs on time without human intervention.
  • Document every step – Future you will thank present you.
  • Create modular functions – For common cleaning tasks like trimming text or removing duplicates.
  1. Monitor Post-Cleaning Quality
  • Re-run checks after cleaning – Did the changes actually improve quality?
  • Validate downstream outputs – Reports/dashboards should reflect expected changes.
  • Watch for regressions – Sometimes cleaning breaks something else.
  • Track KPIs – e.g., error rate, completeness score.
  • Share results with stakeholders – Show the value of your work.
  • Iterate – Data cleaning is never truly “done.”
  1. Keep a Data Quality Mindset Always
  • Don’t chase perfection – Focus on what matters most for the use case.
  • Think prevention – Fix issues at the source, not just downstream.
  • Stay curious – Ask why an issue exists, not just how to fix it.
  • Balance speed vs accuracy – Sometimes “good enough” is enough to move forward.
  • Document lessons learned – Every dataset teaches you something.
  • Be the advocate – If you don’t care about data quality, no one else will.

Source: Shubham Srivastava, Amazon Data Engineer