Some good pointers on data cleansing which also doubles as a good checklist of things to think about at the start of a data project, at least on the data engineering side.
- Always Start With Understanding the Data
- Know your source – Where is the data coming from? Internal database? Third-party API? Manual upload?
- Understand its purpose – Why was this data collected? Who will use it?
- Check data freshness – Is it near real-time, daily batch, or static historical data?
- Understand scale – How many rows/records are we talking about? This will affect tooling and approach
- Look for key business drivers – Which columns/fields matter most to the business use case?
- Get context from stakeholders – Talk to PMs, analysts, or SMEs before you write a single cleaning script
- Define the Schema Early
- Choose correct data types – int, float, string, timestamp, boolean, pick precisely
- Apply constraints – Not Null, Unique, Primary Key, Foreign Key where applicable
- Default values – Decide defaults for missing fields only if business rules allow
- Precision and scale – For decimals, know how many digits you need after the point
- Data size control – Avoid VARCHAR(255) everywhere; optimize for reality
- Match schema to usage – The schema should reflect how data will be queried and processed
- Validate Data Types & Formats
- Check for mismatches – Numeric columns containing strings (e.g., “N/A” in an int column)
- Enforce formats – Dates should follow a standard (ISO 8601 or company format)
- Parse complex types – JSON fields, arrays, structs, ensure they’re valid and parsable
- Catch unit mismatches – e.g., “km” vs “miles” in the same column
- Test with sampling – Randomly check 100–200 rows to spot patterns
- Automate validation – Use scripts or frameworks like Great Expectations to enforce rules
- Handle Missing Data Properly
- Identify NULLs – Know exactly which fields are missing and how often
- Decide on action – Drop, fill, or leave missing based on use case
- Use domain logic – If birth_date is missing, you can’t just fill with 01-01-1900 without reason
- Flag critical NULLs – Create alerts when important fields (like customer_id) are missing
- Avoid blind imputation – Filling missing values can distort downstream analysis
- Document decisions – Always record why you handled missing data a certain way
- Detect & Remove Duplicates
- Define uniqueness – What makes a record unique? Single field or combination?
- Check both exact and fuzzy duplicates – E.g., same customer name with slightly different spelling
- Decide on deduplication strategy – Keep first, keep last, or merge data
- Flag potential duplicates – Don’t delete blindly; verify with business teams
- Use hashing – Create hash keys for comparison when datasets are large
- Automate dedup checks – Schedule scripts to run regularly for recurring data sources
- Standardize Units & Formats
- Pick one unit – Convert all weights, distances, currencies to a single standard
- Use consistent casing – Lowercase or uppercase text fields consistently
- Normalize whitespace – Trim spaces before/after values
- Standardize date formats – Pick one and stick to it (e.g., YYYY-MM-DD)
- Apply consistent naming – “US”, “USA”, and “United States” should be unified
- Automate standardization – Use mapping tables or scripts to enforce rules
- Identify & Fix Outliers
- Define valid ranges – Age can’t be 250, salary can’t be -5000
- Use statistical methods – Z-score, IQR to detect anomalies
- Validate with domain experts – Sometimes “outliers” are actually correct
- Flag for review – Never delete unusual values without business sign-off
- Track frequency – Recurrent outliers may mean an upstream issue
- Document everything – If you remove or alter data, log it
- Enforce Referential Integrity
- Check foreign keys – Ensure referenced IDs exist in parent tables
- Avoid orphan records – A transaction without a valid customer should be flagged
- Validate join keys – Ensure keys are consistent in type and format across tables
- Cascade rules – Decide how deletes/updates propagate across related tables
- Regular audits – Schedule scripts to check integrity at intervals
- Collaborate with upstream teams – Fix issues at the source if possible
- Track Data Lineage
- Know the source – Where each field came from
- Document transformations – Every cleaning step should be traceable
- Map dependencies – Know which downstream systems use your cleaned data
- Use versioning – Keep snapshots of data and schemas over time
- Audit regularly – Compare current and historical data for unexpected changes
- Adopt a lineage tool – Like OpenLineage or DataHub for visibility
- Build Automated Data Quality Checks
- Threshold alerts – E.g., alert if NULL rate exceeds 5%
- Schema drift detection – Notify if schema changes unexpectedly
- Volume checks – Alert if record count deviates drastically
- Value checks – Alert on unexpected category values
- Freshness checks – Alert if data hasn’t been updated in X hours
- Integrate into pipelines – Make quality checks a default pipeline step
- Work Closely With Domain Experts
- Clarify ambiguous fields – Ask if “status” means order status or payment status
- Verify assumptions – Don’t assume “active” means the same to every team
- Review sample records together – See if values match expectations
- Learn business rules – Data cleaning without business rules is guesswork
- Get sign-offs – Before making irreversible changes
- Educate non-technical teams – Help them understand why cleaning matters
- Review sample records together – See if values match expectations.
- Learn business rules – Data cleaning without business rules is guesswork.
- Get sign-offs – Before making irreversible changes.
- Educate non-technical teams – Help them understand why cleaning matters.
- Optimize for Performance
- Filter early – Remove irrelevant data as soon as possible.
- Use efficient joins – Avoid cartesian products.
- Index where needed – Speeds up lookups during cleaning.
- Batch processing – For large datasets, break into chunks.
- Push down operations – Do cleaning in the database when possible.
- Test scalability – Will your cleaning process handle 10x more data?
- Build for Repeatability
- Script everything – Avoid manual cleaning in Excel.
- Use parameterized scripts – Makes them reusable across datasets.
- Store scripts in version control – Git it or regret it.
- Automate schedules – So cleaning runs on time without human intervention.
- Document every step – Future you will thank present you.
- Create modular functions – For common cleaning tasks like trimming text or removing duplicates.
- Monitor Post-Cleaning Quality
- Re-run checks after cleaning – Did the changes actually improve quality?
- Validate downstream outputs – Reports/dashboards should reflect expected changes.
- Watch for regressions – Sometimes cleaning breaks something else.
- Track KPIs – e.g., error rate, completeness score.
- Share results with stakeholders – Show the value of your work.
- Iterate – Data cleaning is never truly “done.”
- Keep a Data Quality Mindset Always
- Don’t chase perfection – Focus on what matters most for the use case.
- Think prevention – Fix issues at the source, not just downstream.
- Stay curious – Ask why an issue exists, not just how to fix it.
- Balance speed vs accuracy – Sometimes “good enough” is enough to move forward.
- Document lessons learned – Every dataset teaches you something.
- Be the advocate – If you don’t care about data quality, no one else will.
Source: Shubham Srivastava, Amazon Data Engineer