Streamlining Data Quality: A Conceptual Model and Excel-based Tool for Email Duplicate Removal

Authors

  • P.G. Naik Professor, Department of Computer Studies, Chhatrapati Shahu Institute of Business Education and Research, Kolhapur, Maharashtra, India
  • R.S. Kamath Associate Professor, Department of Computer Studies, Chhatrapati Shahu Institute of Business Education and Research, Kolhapur, Maharashtra, India
  • S.S. Jamsandekar Assistant Professor, Department of Computer Studies, Chhatrapati Shahu Institute of Business Education and Research, Kolhapur, Maharashtra, India
  • G.R. Naik Professor, Department of Computer Studies, KIT’s College of Engineering, Kolhapur, India

Keywords:

Data Cleaning, Data Preprocessing, Data Quality, Duplicate Removal, Email Dataset, microsoft excel visual basic for applications (VBA)

Abstract

In the realm of data management, ensuring data quality is paramount for meaningful analysis and decision making. This research paper presents a novel conceptual model and practical tool designed to address the challenge of removing duplicate entries from datasets containing email IDs. The proposed mathematical model for duplicate removal is meticulously developed and subsequently implemented in Microsoft Excel, leveraging Visual Basic for Applications. The Excel interface is enhanced with user-friendly macros organized under the newly added 'Manage Emails' tab, which logically groups tasks into three distinct categories: data cleaning, duplicate removal, and report generation. Specifically, the research focuses on the post-acquisition phase, assuming the availability of email datasets, while acknowledging that data acquisition falls beyond the paper's scope and warrants future exploration. The outcome of this study culminates in the creation of a 'Unique Emails' sheet, housing the pre-processed emails after duplicate removal. This research contributes to the domain of data quality enhancement by providing a practical and efficient solution for the removal of duplicate email entries, enabling organizations and analysts to work with cleaner and more reliable data. The proposed model and tool serve as valuable assets for data pre-processing and analysis, ultimately facilitating more accurate and insightful decision-making processes.

References

Baviskar D, Ahirrao S, Potdar V, Kotecha K. Efficient automated processing of the unstructured documents using artificial intelligence: a systematic literature review and future directions. IEEE Access. 2021; 9: 72894–72936. doi: 10.1109/ACCESS.2021.3072900.

Naik PG, Oza KS. Role of resource description framework in knowledge discovery in world wide web. In: Ranganathan G, Fernando X, Piramuthu S, editors. Soft Computing for Security Applications. Advances in Intelligent Systems and Computing, Volume 1428. Singapore: Springer; 2023. pp. 73–82. doi: 10.1007/978-981-19-3590-9_6.

Naik PG, Oza KS. Design and development of multithreaded web crawler for efficient extraction of research data. In: Kumar A, Mozar S, Haase J, editors. Advances in Cognitive Science and Communications. ICCCE 2023. Cognitive Science and Technology. Singapore: Springer; 2023. pp. 581–589. doi: 10.1007/978-981-19-8086-2_56.

Kaur J. Data Preprocessing and Data Wrangling in Machine Learning. [Online]. Available at https://

www.xenonstack.com/blog/data-preprocessing-wrangling-ml [Accessed on March 24, 2023].

Ehrlinger L, Woß W. A survey of data quality measurement and monitoring tools. Front Big Data. 2022; 5: 850611. doi: 10.3389/fdata.2022.850611.

Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowledge Data Eng. 2007; 19 (1): 1–16.

Kandel S, Paepcke A, Hellerstein JM, Heer J. Wrangler: interactive visual specification of data transformation scripts. In: CHI '11: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, British Columbia, Canada, May 7–12, 2011. pp. 3363–3372.

Eckerson WW. Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High-Quality Data. TDWI Report. Woodland Hills, CA, USA: The Data Warehousing Institute; 2002.

Microsoft. Getting started with VBA in Office. [Online]. Available at https://learn.microsoft.com/

en-us/office/vba/library-reference/concepts/getting-started-with-vba-in-office [Accessed on March 24, 2023].

Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. New York, NY, USA: Springer; 2012.

Redman TC. Data Driven: Profiting from Your Most Important Business Asset. Boston, MA, USA: Harvard Business Review Press.

Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969; 64 (328): 1183–1210.

McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference SciPy 2010, Austin, TX, USA, June 28–July 3, 2010. pp. 56–61.

Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, Yutani H. Welcome to the Tidyverse. J Open Source Softw. 2020; 4 (43): 1686.

Published

2023-12-20