UPDF AI

Data Preprocessing Methods for Machine Learning: An Empirical Comparison

P. Yasodha

2025 · DOI: 10.36948/ijfmr.2025.v07i03.48569
International Journal For Multidisciplinary Research · 引用数 0

TLDR

A systematic comparison of prominent data preprocessing methods across multiple real-world datasets and machine learning algorithms reveals that while certain methods like standardization and one-hot encoding generally improve performance, their effectiveness is dataset- and algorithm-dependent.

摘要

The accuracy and efficiency of machine learning (ML) algorithms largely depend on the quality and structure of input data. Data preprocessing is a crucial step in the ML pipeline that transforms raw data into a clean and structured format suitable for modeling. Despite the diversity of preprocessing techniques such as normalization, standardization, missing value imputation, encoding categorical variables, and feature selection there remains a lack of comprehensive empirical evaluation of their comparative effectiveness. This paper presents a systematic comparison of prominent data preprocessing methods across multiple real-world datasets and machine learning algorithms. Using a controlled experimental setup, we analyze the influence of different preprocessing techniques on model performance metrics such as accuracy, precision, recall, F1-score, and training time. The study reveals that while certain methods like standardization and one-hot encoding generally improve performance, their effectiveness is dataset- and algorithm-dependent. The findings highlight the importance of tailoring preprocessing strategies to specific use cases and provide guidelines for selecting optimal preprocessing combinations for different ML contexts.

参考文献
引用文献