Introduction to Data Cleaning

Introduction to Data Cleaning
Introduction to Data Cleaning

Data Cleaning is also known as Scrubbing. The Data Cleaning process detects and removes the errors and inconsistencies and improves the quality of the data.

Data quality problems arise due to misspellings during data entry, missing values or any other invalid data.

Guidelines for Data Cleaning, to Drive Performance | Practical Ecommerce
practicalecommerce.com

–> Reasons for “Dirty” Data –

  • Dummy Values
  • Absence of data
  • Multipurpose fields
  • Cryptic data
  • Contradicting data
  • Inappropriate use of address lines
  • Violation of business rules
  • Reused primary keys
  • Non-unique identifiers
  • Data integration problems

Why Data Cleaning or Cleansing is required ?

  • Source Systems data is not clean; it contains certain errors and inconsistencies.
  • Specialized tools are available which can be use for cleaning the data.
  • Some of the Leading data cleansing vendors include Validity (Integrity), Harte-Hanks (Trillium) and Firstlogic.

Steps in Data Cleaning or Cleansing –

8 Ways To Clean Data Using Data Cleaning Techniques
digitalvidya.com

(1) Parsing –

  • Parsing is a process in which individual data elements are located and identified in the source systems and then these elements are isolated in the target files.
  • For example, parsing of name into First name, Middle name and Last name.

(2) Correcting –

  • This is the next phase after parsing.
  • In this phase individual data elements are correct using data algorithm and secondary data sources.
  • For example, in the address attribute replacing a vanity address and adding a zip code.

(3) Standardizing –

  • In Standardizing process conversion routines are use to transform data into a consistent format using both standard and custom business rules.
  • For example, addition of prename, replacing a nickname and using a preferred street name.

(4) Matching –

  • Matching process involves eliminating duplications by searching and matching records with parsed, corrected and standardized data.
  • For example, identification of similar name and addresses.

(5) Consolidating –

  • Consolidating involves merging the records into one representation by analyzing and identifying relationship between matched records.

(6) Data Cleansing must deal with many types of possible errors –

  • Data can have many errors like missing data, or incorrect data at one source.
  • When more than one source is involve there is a possibility of inconsistency and conflicting data.

(7) Data Staging –

  • Data staging is an interim step between data extraction and remaining steps.
  • Using different processes like native interfaces, flat files, FTP sessions, data is accumulated from asynchronous sources.
  • After a certain predefine interval, data is loaded into the warehouse after the transformation process.
  • No end user access is available to the staging file.
  • For data staging, operational data store may be used.

Missing Values –

DATA CLEANING
acaps.org

This involves searching for empty fields where values should occur. Data preprocessing is one of the most important stages in data mining.

Real world data is incomplete, noisy or inconsistent. This data can be correct in data preprocessing process by filling out the missing values, smoothening out the noise and correcting inconsistencies.

There are several techniques for dealing with missing data, choosing one of them would be dependent on problems domain and the goal for data mining process.

Following are the different ways for handle missing values in databases :

  1. Ignore the data row.
  2. Fill the missing values manually.
  3. Use a global constant to fill in for missing values.
  4. Use attribute mean.
  5. attribute mean for all samples belonging to the same class.
  6. Use a data-missing algorithm to predict the most probable value.