Balancing Quality, Quantity, and Imputation Strategies

1 2 minutes read

[Submitted on 24 Dec 2024 (v1), last revised 21 May 2025 (this version, v2)]

View a PDF file for the paper entitled Corruption for data in machine learning: achieving a balance between quality, quantity and inclusion strategies, by Qi Liu and Wanjing Ma

PDF HTML (experimental) view

a summary:Data corruption, including lost and loud data, shows great challenges in the real world learning. This study examines the effects of data corruption on the performance of the model and exploring strategies to mitigate these effects through two experimental preparations: learning subject to supervision with NLP (NLP-IS) tasks and learning to improve the Signal-RL. We analyze the relationship between data corruption levels and the form of the form, evaluate the effectiveness of data proof methods, and evaluate the benefit of expanding data groups to process data corruption.

Our results show that the performance of the model within the corruption of the data follows a detailed return curve, designed by the luxury function. Lost data, despite its damage, is less harmful than loud data, which causes severe performance deterioration and instability in training, especially in the tasks of making serial decisions such as Signal-RL. Including strategies include comparison: it is the restoration of lost information but it may make noise. Its effectiveness depends on the accuracy of inclusion and the ratio of corruption. We define distinct areas in the calm feature, including “useful angle” and “unfavorable edge” and classifying tasks as “noise sensitive” or “not sensitive to noise” based on the limits of its decision.

Moreover, we find that increasing the size of the data set is lower but it cannot be completely overcome by the effects of data corruption. The marginal benefit of additional data decreases with increased corruption. Experimental base appears: Nearly 30 % of the data is very important to determine performance, while the remaining 70 % has a little effect.

These results provide visible visions in pre -processing treatment, inclusion strategies, data collection practices, and directing the development of powerful automatic learning systems in loud environments.