What is dirty data? what are its types?

What is dirty data? what are its types?

Identify dirty data and improve models efficiency

As we all know good data plays a very crucial role in the process of Data analysis. It is estimated that data scientists spend about 80% of their time cleaning data. This means only 20% of the time will be used to analyze and create insights from the data science process. A better data-driven decision can be possible if there’s clean data available.

Well, what really is dirty data? and what are its types?

Dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve.

Data is said to be dirty if it is

  • Incomplete — Due to some disturbance in transactions of the dataset, some human error, and unavailable values.

  • Incorrect — Insertion of false values.

  • Irrelevant — There might be incorrect information inserted in the wrong column.

Types of Dirty data:

  1. Duplicate data

  2. Outdated data

  3. Incomplete data

  4. Incorrect/inaccurate data

  5. Inconsistent data

1) Duplicate data-

Duplicate data can be described as any data that shows up more than once. Repeated data can lead to skewed metrics or analyses or inaccurate counts or wrong predictions, or confusion during the data retrieval process.

Cause of duplicate data includes data migration, manual data entry(Human error), or due to batch data imports.

Duplicate data example

2) Outdated data-

Outdated data is data that is old and should be replaced with newer and more accurate information. The use of outdated data can give you inaccurate insights, decision-making will not be more efficient, and bad analytics.

The cause of outdated data can occur due to people changing roles or companies, or software & systems becoming obsolete.

3) Incomplete data-

Data that is missing important data fields can be termed incomplete data. A dataset having missing values/ null values will affect analytics in a major way. It can even cause decreased productivity, insights, inaccurate, or inability to complete essential services.

The cause of incomplete data is improper data collection or incorrect data entry.

4) Incorrect/ inaccurate data-

Data that is not having any missing values but is still inaccurate is known as Incorrect/inaccurate data. Using incorrect data will result in inaccurate insights or decision-making based on bad information resulting in revenue loss.

The cause of inaccurate data is simply a Human error during the insertion of data input, due to fake information, or mock data.

5) Inconsistent data-

Inconsistent data can be defined as data that use different formats to represent the same thing. The most common example is some countries use dd/mm/yyyy, whereas some countries prefer mm/dd/yyyy. Opposed data points lead to confusion or an inability to classify or segment customers.

The cause of inconsistent data is incorrectly stored data or errors inserted during data transfer.

Thank you for reading.

I hope from this post you have gained some useful information.

Thank you for reading.

Have a nice day! 😁

For more such content make sure to subscribe to my Newsletter here

Follow me on

Twitter

GitHub

Linkedin

Did you find this article valuable?

Support writtenbykaushal by becoming a sponsor. Any amount is appreciated!