Preparing High Quality Data for LLMs at Scale

View PDF file from the paper entitled Gneissweb: Preparing high -quality LLMS data on a wide range, written by Hajar Emi Goahari and 31 other books
PDF HTML (experimental) view
a summary:The amount of data and quality plays a vital role in determining the performance of large language models (LLMS). High -quality data, in particular, can enhance LLM’s ability to generalize a wide range of clinic tasks. Large data sets before training to lead LLMS remain unacceptable to the public, while many open data collections are small in size (less than 5 trillion symbols), which limits their suitability to train large models.
In this paper, we offer Gneissweb, a large collection of data that produces about 10 trillion symbols that meets the quality of data and quantity requirements in LLMS training. Our Gneissweb recipe consists of the data collection consists of canceling the exact sub -chain and a set of quality filters. Gneissweb achieves a favorable comparison between data quality and quantity, as it produces models that exceeded the models trained on the latest large open data collections (5+ trillion symbol).
We explain that the trained models using the Gneissweb Data collection excel over that trainer on Fineweb-V1.1.0 by 2.73 points in terms of average grades calculated on a set of 11 common use standards (both zero and a few trip) to evaluate the data set before training. When the evaluation set is extended to 20 criteria (both zero and a few trip), models that were trained using Gneissweb are still a 1.75 points feature on that trainer on Fineweb-V1.1.0.
The application date
From: Bishwaranjan Bhattacharje [view email]
[v1]
Wed, February 1925 00:14:29 UTC (5,613 KB)
[v2]
Tuesday, July 29, 2025 21:19:06 UTC (2,396 KB)
Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!
2025-07-31 04:00:00