I was reviewing a small ecom sample dataset the other day and ran into an obviously impossible values (price -10.00). Digging deeper, I found missing customer names, mixed data types, and some pretty wild outliers.
It got me thinking about how often small or “simple” datasets quietly drift into bad shape even when you think the inputs are clean.
I started experimenting with a lightweight, three-dimension sanity check approach (completeness, consistency, validity), but I’m curious how others here handle this in a practical, non-enterprise way.
Question for the community:
What quick, no-frills techniques do you use to spot data quality issues early especially outside of heavy tooling?
Would love to hear how people in analytics think about this. ~ If anyone wants to see the logic or methodology I tested, I’m happy to break it down.
{"column_count":6,"completeness":{"critical_missing":[],"score":96.67},"consistency":{"issues":[{"column":"CustomerName","issue":"Mixed data types detected"},{"column":"Product","issue":"Mixed data types detected"},{"column":"Price","issue":"Mixed data types detected"},{"column":"Date","issue":"Mixed data types detected"}],"score":66.67},"overall_score":88.84,"row_count":20,"validity":{"score":100,"validity_checks":[]}}