Clean data is critical for data analysis and data mining activities. Organizations of all sizes often struggle with the ability to get and maintain clean data.
To better understand how you can make sure your data becomes and stays clean, we will first discuss what ‘clean’ data really means.
What is ‘Clean’ Data?
If you have clean data, it means that the data currently stored in your systems is free of irrelevant, inaccurate, incomplete, and corrupt records.
For example, let’s say you operate a clothing goods store, and in exchange for a customer rewards card, you collect customers’ addresses.
This data helps you analyze your customers, and engage in more effective, more accurate, target marketing campaigns.
Continuing with this example, let’s assume that the cashiers need to manually key in every address field when a customer signs up for the rewards card. Unfortunately, you have one disgruntled, part-time cashier who has decided she doesn’t want to take the effort, so whenever a customer opts into the rewards card, she keys in a generic 123 Alpha Lane, OH, 12345 address.
This is a corrupt record, and it will negatively impact any data analysis you try to do. If she just mistyped part of the address, it would be an inaccurate record, and if she left out part of the address it would be an incomplete record.
In any of the above examples, every address input incorrectly is ‘bad data’ and means your data is no longer clean.
Why Is Clean Data So Important?
If your data is not clean, any information you try to garnish from it is unreliable, and it can lead to bad decision making.
To better demonstrate this, let’s continue with the above example.
You’ve decided you want to run a promotion where you will send out coupons in the mail.
In order to target consumers most likely to shop at your store, you want to pull customer addresses so that you can send your coupons to them and their close neighbors.
Upon ‘mining’ your data, you find out that 30% of your customers live at 123 Alpha Lane, OH, 12345.
You can tell right away that this is impossible. This means you now need to exclude 30% of your customer data, potentially missing out on profitable marketing opportunities.
An even worse scenario is that you don’t realize the data isn’t clean, and you spend a huge portion of your marketing budget sending coupons out to random people who are not in your target demographic.
Your marketing campaign won’t drive as much store traffic, and without recognizing that you had bad data, you may never understand why.
The Problem with Duplicate Customer Accounts
Even if you store validated, mailable postal addresses in your customer database, you can still have unreliable data in the database that could result in misleading data analysis insights. How is that possible?
Let’s assume an online store has a customer who likes to buy two or three times a year. She does not have a login ID. Every time she makes a purchase and spends $50, the online store sets her up with a new customer account number. Internally, she presents as three newly acquired customers with an AOV of $50 yet she really should be viewed as a retained multi buyer with YTD sales of $150.
Let’s assume 20% of the customer accounts in the database are duplicates. Many of the repeat customers are mistakenly classified as new acquisition. Because the database has not been deduped, the marketing department unknowingly makes strategic customer contact decisions based on flawed data insights. Marketing underestimates its customer retention rate and believes they are growing their active customer counts.
To make matters worse, they score their customer file for email and catalog marketing campaigns. Unfortunately, the high duplication rate results in many of their top customers being classified as marginal customers and now they are sent fewer emails and catalogs. Sales are not being maximized because of a missed opportunity.
How to Get Clean Data
So how do we clean our data?
The most efficient way to get clean data is to have standardized automated controls in place around the entry of data.
For example, if your system refused to allow fake zip codes and blank fields, it would reduce corrupt and incomplete data right from the start. If your system only requires you to type in the street address and zip, and everything else auto-populates, it would cut down on incorrect and incomplete data.
The more rules you have restricting what values can be entered the fields, the more likely you are to prevent bad data.
However, keep in mind that the larger and more complex your business is, the harder it will be to do this, as you are more likely to have unique exceptions you want to allow when necessary.
Another method of getting clean data is relying on well-trained, motivated staff who pay attention to details.
After data entry, process the addresses using data hygiene software. Next run a dedupe to identify duplicate customer records. Assign these duplicate customer records a common Dupe Group ID. You can now use the Dupe Group ID to aggregate sales and other customer marketing data for these duplicate records.
Customer metrics will be much more insightful and the small investment in keeping your data clean will pay dividends now and in the future.
How to Keep Data Clean?
Even the best systems and processes will not be foolproof. To ensure your data is clean, you will need to be able to implement controls and governance to check the data on a regular basis.
This can be as simple as someone eyeballing reports, or as complex as investing in an expensive data profiling software.
Customer addresses should be run through address hygiene software and deduped either weekly or monthly. If you mail catalogs or direct mail pieces, make sure you use National Change Of Address (NCOA) on a regular basis to update customer moves and stay compliant with the postal service.
If you need help managing your customer records, give us a call. Hansel Group Marketing can clean and dedupe customer data, manage catalog direct mail campaigns, and provide customer marketing data insights.
Copyright: iqoncept / 123RF Stock Photo