8 Steps to Clean Your Data
Data cleaning is removing or correcting inaccurate information within a data set; this could mean missing data, misspelled words, or unwanted prefixes, to name a few issues.
Data cleaning is not the most exciting and glamorous of processes. It can be rather tedious and time-consuming, but you know what, I don’t mind it. I’m a bit of a neat freak, so I get great satisfaction from cleaning and correctly arranging things.
Some data sets are incredibly icky and require a bit of brain juice to figure out the best way to clean them. It can involve manoeuvres like isolating a subset of the data and returning it to the main set. These data sets are good workouts to stretch your brain. A great exercise to ward off Alzheimer!
While many may find cleaning data dull, it is vital to your downstream process. Any analysis using uncleaned and subpar data will likely be flawed even if you follow every step of the data analytics process to a T; as the saying goes, garbage in, garbage out (GIGO).
So, are we ready for the 8 steps to cleaning data?
There are many approaches one can take to cleaning data. What I have here are high-level activities.
Right, let’s roll up our sleeves and get started!
Step 1: Remove Irrelevant Data
Review your data and identify what data is relevant and irrelevant to your analysis. Discard any records that are not relevant to your downstream need. For instance, if I were analysing the effectiveness of a promotional email campaign on a newly launched course on Productivity Hacks for Writers, I would remove all the email records not associated with this campaign from the dataset.
Removal of duplicate data is part of this step. Duplicate data is common when we stitch multiple datasets, scrape data online or receive third-party sources.
Step 2: Fix Structural Errors
Structural errors include misspelling, inconsistent naming conventions, improper capitalisation, leading and trailing spaces, and extra white space between works. Variations of the same things, such as Seven-Eleven, 7-11, seven eleven, 711, will appear as separate categories in your data set.
Keep an eye out for the underscores, dashes, and other rogue punctuations!
Step 3: Standardise the Data
This step is closely related to the previous step. In this step, you want to check that every cell type follows the same rule. For example, all the values in the same column should be lower or in the proper case. It also means that all the numerical data have the same unit of measurement. Having some values in km and some in miles causes problems.
Date is another one that you should be mindful of. The United States puts the month before the day. In Australia, we put the day before the month.
Step 4: Remove Unwanted Outliers
Outliers are data points that deviate significantly from others in the dataset. They can affect the result of the analysis. However, it is vital to approach removing them with caution. Just because an outlier exists doesn’t mean it shouldn’t be considered in your analysis. It depends on the analysis you’re running and the effect of removing or retaining the outlier on the result. If the outlier is obviously due to incorrect data entry, remove it.
Step 5: Correct Contradictory Data
Contradictory error refers to errors where the entire record contains inconsistent data. For example, a student’s grade is D when they scored 85 out of 100 on their exam.
Step 6: Fix Data Type Error
At this stage, your data set should look pretty good. However, you need to check the data type. Make sure that all the number columns have a numerical data type, and texts have string type, dates are presented as date objects. Note that some fields are numerical but should be assigned as text type, such as postcode. You’re not going to do any mathematical calculations on postcode.
Step 7: Handle Missing Data
There are three common approaches to handling missing data:
- Remove the entries associated with the missing data. This means you will lose other important info.
- Impute the missing data based on other similar data. You’ll reinforce existing observations which could be wrong.
- Flag the data as missing. You do this by entering “missing” value in the empty field or using “0” if it’s a numerical field.
Your approach will depend on the analysis you’re running and how you plan to preprocess your data. Also, removing a small volume of entries with missing data is unlikely to have a material impact if you have a massive dataset.
Step 8: Validate the Data
The final step is to check that the data is ready for analysis. If there are still errors (usually there are), you’ll need to return to earlier steps and fix them.
These are the eight steps. Fortunately, we don’t have to do the cleaning manually. Can you imagine just how painful that would be? We have plenty of tools available to us to help clean data. I use Microsoft Excel, Power Query, which is part of Excel, and Tableau Prep at work. You can also do the cleaning using Python, which is a programming language. Free, open-source tools are available for you, such as OpenRefine, previously known as Google Refine. It’s a powerful tool, check it out and play around with it to see if you like it.
I would say that Excel is likely your first port of call if you’re starting in analytics. Microsoft Support has a page on Top Ten Ways to Clean Your Data. The page sets out the common data cleaning tasks, and against each of these tasks are instructions on Excel features that can be used to handle the task. It’s a great reference. I highly recommend that you bookmark it.