Preparing High Quality Data for LLMs at Scale

0 1 minute read

[Submitted on 19 Feb 2025 (v1), last revised 29 Jul 2025 (this version, v2)]

Authors:Hajar Imami Johari, Swande Ravenra Kadhi, Sayed Yosaf Shah, Qastin Adam, Abdel Hamid Adebayo, Branit Adosomelli, Farhan Ahmed, Natalie Parasaldo, Nair, Alex, Rano, Karve, Yan Koyfman, Wei-Han Lee, Changchang Liu, Boris Lublinsky, Takuyo Ohko, Pablo Pesce, Maroun Touma, Shiqiang Wang, SHALISHA WithRERERSPON

View PDF file from the paper entitled Gneissweb: Preparing high -quality LLMS data on a wide range, written by Hajar Emi Goahari and 31 other books

PDF HTML (experimental) view

a summary:The amount of data and quality plays a vital role in determining the performance of large language models (LLMS). High -quality data, in particular, can enhance LLM’s ability to generalize a wide range of clinic tasks. Large data sets before training to lead LLMS remain unacceptable to the public, while many open data collections are small in size (less than 5 trillion symbols), which limits their suitability to train large models.

In this paper, we offer Gneissweb, a large collection of data that produces about 10 trillion symbols that meets the quality of data and quantity requirements in LLMS training. Our Gneissweb recipe consists of the data collection consists of canceling the exact sub -chain and a set of quality filters. Gneissweb achieves a favorable comparison between data quality and quantity, as it produces models that exceeded the models trained on the latest large open data collections (5+ trillion symbol).

We explain that the trained models using the Gneissweb Data collection excel over that trainer on Fineweb-V1.1.0 by 2.73 points in terms of average grades calculated on a set of 11 common use standards (both zero and a few trip) to evaluate the data set before training. When the evaluation set is extended to 20 criteria (both zero and a few trip), models that were trained using Gneissweb are still a 1.75 points feature on that trainer on Fineweb-V1.1.0.