Data Quality Issues and Current Approaches to Data Cleaning Process In Data Warehousing

Page 1

GRD Journals- Global Research and Development Journal for Engineering | Volume 1 | Issue 10 | September 2016 ISSN: 2455-5703

Data Quality Issues and Current Approaches to Data Cleaning Process in Data Warehousing Jaya Bajpai MBA-IT Student SICSR, affiliated to Symbiosis International University (SIU), Pune, and Maharashtra, India

Pravin S. Metkewar Associate Professor SICSR, affiliated to Symbiosis International University (SIU), Pune, and Maharashtra, India

Abstract In this paper we have discussed the problems of data quality which are addressed during data cleaning phase. Data cleaning is one of the important processes during ETL. Data cleaning is especially required when integrating heterogeneous data sources. This problem should be addresses together with schema related data transformation. At the end we have also discussed the Current tool which supports data cleaning. Keywords- Data cleaning, Data warehousing, Data cleaning tools, ETL, Data quality

I. INTRODUCTION Data cleaning, also called data scrubbing, deals with detection and removing of errors and inconsistencies from data in order to improve the quality of data. Data quality problem are present in single data collection, such as files and databases. Problems can arises due to misspelling which take place during data entry process, missing information or other valid data. The need of data cleaning increases rapidly when multiple data stream are need to be integrated e.g. in data warehouses, federated database system or global web-based information system. This is because the sources often contain redundant data in different data in different representation. The main aim is to provide access to accurate and consistent data, consolidation of different data representation and to eliminate the duplicity of data.

II. DATA CLEANING PROBLEM This section classifies the major data quality problem to be solves by data cleaning and data transformation. Data transformations are needed to support any changes in the structure, repetition and content of data. These transformation becomes necessary in many situations e.g. of deal with schema evolution, migrating a legacy system to a new information system, or when multiple data sources are to be integrated. We roughly distinguish between single source and multi-source problems and between schema and instance related problem. Schema level problems are also reflected in instances .They can be addresses at the schema level by an improved schema design, schema translation and schema integration. Instance level problem refers to the errors and the inconsistencies in the real data content which are not visible at schema level. They are primary focus on data cleaning .The single source problems also occur in multisource cases. A. Issues Caused due to Single Source of Data In single source data problem arises at 2 levels. One is schema level whereas other is instance level 1) Problems at Schema Level There are various single source problems that occur at schema level. These problems occur due to lack of appropriate model specific or application specific integrity constraint. Sometimes there is limitation in data model or the design of schema is poor. Sometimes only few integrity constraints are defined to limit the overhead of integrity control. 2) Problem at Instance Level The problems which arises at instance level i.e. errors and inconsistencies, cannot be prevented at the schema level example misspelling. Attribute Record Record Type Sources

Table 1: Single Source Problems Scope Dirty Data Illegal Values Bday: 23.14.90 Violation of Attribute Dependencies Age=10; bday=20-08-1950 Cust 1=”mohan”,adhar no:124 Uniqueness Violation Cust 2=”hari” adhar no:124 Referential Integrity Violation Cust name=Sumit dept=409

Remark/Reason Values are outside of domain range Age=(current date-birth date) should hold Uniqueness of adhar should hold. Referential dept 409 not defined

All rights reserved by www.grdjournals.com

14


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
Data Quality Issues and Current Approaches to Data Cleaning Process In Data Warehousing by GRD Journals - Issuu