Most Read Research Articles


Warning: Creating default object from empty value in /var/www/html/sandbox.ijcaonline.org/public_html/modules/mod_mostread/helper.php on line 79

Warning: Creating default object from empty value in /var/www/html/sandbox.ijcaonline.org/public_html/modules/mod_mostread/helper.php on line 79

Warning: Creating default object from empty value in /var/www/html/sandbox.ijcaonline.org/public_html/modules/mod_mostread/helper.php on line 79

Warning: Creating default object from empty value in /var/www/html/sandbox.ijcaonline.org/public_html/modules/mod_mostread/helper.php on line 79

Warning: Creating default object from empty value in /var/www/html/sandbox.ijcaonline.org/public_html/modules/mod_mostread/helper.php on line 79
Call for Paper - May 2015 Edition
IJCA solicits original research papers for the May 2015 Edition. Last date of manuscript submission is April 20, 2015. Read More

Hybrid Technique for Data Cleaning

Print
PDF
IJCA Proceedings on National Conference on Role of Engineers in National Building
© 2014 by IJCA Journal
NCRENB
Year of Publication: 2014
Authors:
Ashwini M. Save
Seema Kolkur

Ashwini M Save and Seema Kolkur. Article: Hybrid Technique for Data Cleaning. IJCA Proceedings on National Conference on Role of Engineers in National Building NCRENB:4-8, June 2014. Full text available. BibTeX

@article{key:article,
	author = {Ashwini M. Save and Seema Kolkur},
	title = {Article: Hybrid Technique for Data Cleaning},
	journal = {IJCA Proceedings on National Conference on Role of Engineers in National Building},
	year = {2014},
	volume = {NCRENB},
	pages = {4-8},
	month = {June},
	note = {Full text available}
}

Abstract

Data warehouse contains large volume of data. Data quality is an important issue in data warehousing projects. Many business decision processes are based on the data entered in the data warehouse. Hence for accurate data, improving the data quality is necessary. Data may include text errors, quantitative errors or even duplication of the data. There are several ways to remove such errors and inconsistencies from the data. Data cleaning is a process of detecting and correcting inaccurate data. Different types of algorithms such as Improved PNRS algorithm, Quantitative algorithm and Transitive algorithm are used for the data cleaning process. In this paper an attempt has been made to clean the data in the data warehouse by combining different approaches of data cleaning. Text data will be cleaned by Improved PNRS algorithm, Quantitative data will be cleaned by special rules i. e. Enhanced technique. And lastly duplication of the data will be removed by Transitive closure algorithm. By applying these algorithms one after other on data sets, the accuracy level of the dataset will get increased.

References

  • Arindam Paul, V. Ganesan, and J. Challa, "HADCLEAN: A Hybrid Approach to Data Cleaning in Data Warehouses" IEEE, 2012.
  • Mortadha M. Hamad and AlaaAbdulkhar Jihad, "An Enhanced Technique to Clean Data in the Data Warehouse"IEEE,2011.
  • K. Ali and M. Warraich, "A framework to implement data cleaning in enterprise data warehouse for robust data quality" IEEE, 978-1-4244-8003-6/10, 2010.
  • C. Varol, C. Bayrak, R. Wagner and D. Goff, "Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data", Data Engineering - International Series in Operations Research & Management Science, vol. 132, pp. 91 -101, 2010.
  • M. A. Hernández and S J. Stolfo, "Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem", Data Mining and Knowledge Discovery, Springer Netherlands, vol. 2, no. 1, pp. 9-37, 1998.
  • R. Bheemavaram, J. Zhang and W. N. Li, "Efficient Algorithms for Grouping Data to Improve Data Quality", Proceedings of the 2006 International Conference on Information & Knowledge Engineering (IKE 2006), CSREA Press, Las Vegas, Nevada, USA, pp. 149-154, 2006.
  • R. Bheemavaram, J. Zhang, W. N. Li, "A Parallel and Distributed Approach for Finding Transitive Closures of Data Records: A Proposal", Proceedings of the Acxiom Laboratory for Applied Research (ALAR), pp. 71-81, 2006.
  • W. N. Li, R. Bheemavaram, X. Zhang, "Transitive Closure of Data Records: Application and Computation", Data Engineering - International Series in Operations Research & Management Science, Springer US, vol. 132, pp. 39-75, 2010.
  • Ballou, D. (1999) "Enhancing data quality in Data Warehousing Environment," Comm. ACM (42:1), pp. 73-78.
  • M. Bilenko and R. J. Mooney. "Adaptive duplicate detection using learnable string similarity measures" ACM SIGKDD, 2003, pp 39-48
  • A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. "Duplicate Record Detection": A Survey. IEEE TKDE, 19(1), 2007, pp 1-16
  • S. Reddy, A. Lavanya, V. Khanna, L. S. S. Reddy, "Research Issues onData Warehouse Maintenance", IEEE, ICACC '09. InternationalConference Advanced Computer Control, Singapore, Jan 2009, Page(s): 623 – 627