ByteDance Introduces QuaDMix: A Unified AI Framework for Data Quality and Diversity in LLM Pretraining

0 3 minutes read

1745764660 ByteDance Introduces QuaDMix A Unified AI Framework for Data Quality.png

The efficiency of training and generalization of large language models (LLMS) is greatly affected by the quality and diversity of the basic training group. Traditional traditional data pipelines are often treated as separate goals, and the application filtering is followed by the field budget. This serial improvement ignores the complex bonding between these factors. High -quality data sets often show biases in the field, while various data groups may affect quality. In the context of fixed training budgets, an urgent need for improvement is simultaneously for both dimensions to increase the performance of the model. However, the definition of quality and diversity jointly remains not trivial.

Bytedance offers Quadmix

By TEDANCE Quadmix, a unified framework for choosing data that systematically balances quality and diversity during the pre -opinion LLM period. Quadmix evaluates each data sample based on multiple quality standards and field classification and determines the possibility of samples through a specific function. Framework uses the agent model experiences along with a Lightgbm -based slope to predict the performance of the estuary, allowing an effective parameter to improved without a massive training on a large scale. Experiments show that Quadmix achieves average average performance of 7.2 % via multiple standards compared to methods to improve quality and diversity separately, which confirms the effectiveness of the common approach.

Quadmix works in three main stages: features of features, quality assembly, and sampling of quality dependency. Initially, each document is explained with domain posters and dozens of multiple quality. These degrees are normalized and combined using the field parameters to calculate the combined quality degree. Samples of the documents are later taken according to a skeleton -based position that gives priority to high -quality samples while maintaining the field balance through the parameter control elements.

Improvement is done by training thousands of agent models through various parameters settings. The slope model, which has been trained in these agent’s experiences, predicts the results of performance, allowing the determination of optimal samples. This method allows an organized exploration of a high -dimensional parameter space, and align the selection of data more closely with the intended jungle tasks.

Quadmix provides several advantages:

Improving data quality and diversity of field.
The ability to adapt to the requirements of the task by choosing the goal of the agent’s evaluation.
Accounting efficiency by circumventing the re -training of the full comprehensive model.
Considering performance improvements in the direction of the estuary without increasing account budgets.

Experimental results and visions

Health verification experiments were conducted using DesitedWEb Data Data set, training of 530m parameters from zero point. Quadmix has been compared to many foundations, including random choice, FINEWB-Edu, ASKLM, DCLM, DSIR, and Regmix. Quadmix constantly outperformed these methods, achieving an average score of 39.5 % through nine various criteria.

The main notes include:

Common improvement strategies are constantly outperforming isolated, quality or diversity -focused methods.
The performance of the Proxy model is strongly associated with the results of large -scale models, which leads to the effectiveness of the agent -based approach.
Data mixtures for specific tasks in the estuary are more improved than improving the task performance.
Merging multiple quality standards reduces inherent biases and improves the durability of the general model.
Expanding the scope of symbolic diversity exceeds certain returns that reduces the returns, while emphasizing the importance of the quality coordinated on the huge amount.

conclusion

Quadmix introduces an initial approach to selecting data for LLM Pretrearing, as it addresses the long -term challenge of improving data quality and its diversity simultaneously. By combining assembly quality and taking samples from the field in a unified framework and taking advantage of the improvement based on the agent, Quadmix creates a developed methodology to enhance LLM training efficiency. While there are opportunities for future improvements-such as improving the teacher’s space and enhancing the loyalty of the agent model-Quadmix represents an important step towards more systematic and effective strategies to stimulate data to develop models on a large scale.

verify paper. Also, do not forget to follow us twitter And join us Telegram channel and LinkedIn GrOup. Don’t forget to join 90k+ ml subreddit.

🔥 [Register Now] The virtual Minicon Conference on Agency AI: Free Registration + attendance Certificate + 4 hours short (May 21, 9 am- Pacific time)

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.