What Is Dirty Data?
Last week, I wrote about “normal” data - and how it can be a challenge making sure that your data is normal!
Fixing non-normal data is one of the challenges we face when helping customers migrate their database into a new tool - but other pain point for many customers is what we call “dirty data” - and - there aren’t a lot of tools available to help automate cleaning up dirty data - here’s why:
Dirty Data stems from inaccurate, incomplete, or redundant data, rather than data that is in the wrong format.
Sometimes, dirty data masquerades as non-normal data - for instance, if you don’t have strong coding or strong policies in place, you’ll end up with inaccurate or redundant data, like this:
FName LName Street City State Zip
Patrick Shaw 403 23rd Ave S Seattle WA 98144
Pat Shaw 403 23rd Ave S Seattle WA 98144
Shaw 403 23rd Ave S Seattle WA 98144
In this example - it can be challenging figuring out if each of those contact records is the same person - and over time - it gets worse. Perhaps I have two bank accounts, and one has a PO Box and the other my street address. Then you end up with the same first and last name - but a different address. You can see that this gets complicated - especially if I make a donation to your agency! But this is redundant, or non-normal data - and dirty data is different. Here are some examples of dirty data:
- A date field that has various information in it:
- 03/31/2007
- March
- 2nd Quarter
- Not sure
- Number fields that contain a combination of numbers and text:
- 3
- At least three
- Not sure
- Variations and misspellings of names:
- WA
- Wash.
- Washington
- Phone fields that contain more than one phone number or text:
- (206) 286-8880 and (206) 957-8880
- (206) 286-8880-Home
- Addresses that aren’t split into consistent fields for Street, City, State, Zip
You can see the challenges posed by dirty data - and how it can be hard to automate a fix - it usually involved Excel, auto filtering, search and replace, and so on.
Here are some tips for preventing dirty data:
- Make sure that you look up names in your database BEFORE creating a new record!
- Make sure that you can have an “alias” or a “nickname” field, so you can keep track of both a legal or official name as well as a nickname.
- Use Data Validation in your database. For instance, you can make both first and last name required fields - so that you won’t end up with a record without both.
- Create a “Quick Tips” guide for using your database, and spell out how you’d like people to enter data.
(Special thanks to the database experts at NPower Seattle for their help with this post, particularly Evan Callahan and V. Waters).

Annad Sethupathy wrote:
Hi Patrick,
This is a great article. We plan to forward to this info to some of our clients considering database migration projects!
Thanks for putting this together!
Anand
Posted on 02-Jul-07 at 7:28 am | Permalink