What is data cleaning?

Data cleaning (also called data cleansing) prepares data for analysis by finding and dealing with problematic data points within a data set.

Data cleaning can involve fixing or removing incomplete data, cross-checking data against a validated data set, standardizing inconsistent data and more.

The overall objective of data cleaning is to improve data quality, consistency and utility before the data acquired is loaded into a database or a data warehouse in a structured format.

How can you clean data?

Before cleaning your data, run necessary sanity checks to detect and get rid of duplicate data.

Different datasets have different cleaning processes. Satellite images require the removal of cloud covers. Location data requires the removal of bad lat/longs using accuracy and other similar parameters.

However, in general, a standard data set should go through the process mentioned in this article.

Then fix structural errors such as white spaces, typos, capitalization and standardize the data to put each value in the same standard format.

Lastly, you should deal with missing data and filter unwanted outliers.

Numerical data can be cleaned using various statistical methods such as mean, standard deviation or range algorithms. There are several libraries, such as Pandas for Python and Dplyr for R available to manipulate and clean data.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. The ultimate guide to effective data cleaning [Ebook]
  2. Data cleaning, management and tagging: The best practices
  3. Why it’s important to standardize your data
  4. When should you delete outliers from a data set?