Meet Yambda: The World’s Largest Event Dataset to Accelerate Recommender Systems

Yandex recently made a major contribution to the recommendation community through its launch YambdaThe largest data collection available to the public in the world to research and develop the recommendation system. This data collection is designed to fill the gap between academic research and applications across industry, providing approximately 5 billion unknown user interaction events from Yandex music-one of the company’s main broadcast services with more than 28 million users per month.
Why does it matter: processing the important data gap in the recommendation systems
Recommendation systems work on personal experiences of many digital services today, from e -commerce and social networks to broadcasting platforms. These systems depend greatly on huge amounts of behavioral data, such as clicks, likes, listen, to conclude user preferences and provide custom content.
However, the field of recommendation systems behind other areas of artificial intelligence, such as the treatment of natural language, is largely due to the scarcity of large data collections that can be publicly accessed. Unlike LLMS models, which learn from the text sources available to the public, the recommendation systems need sensitive behavioral data – a commercial value and is difficult to hide the identity. As a result, companies have traditionally guarded these data closely, limiting researchers’ access to data collections on the real world.
Current data collections such as the Million’s Million menu from Spotify, Netflix Award data and CRITEO clicks are either very small, lacking time details, or are poorly documented to develop the production category recommendations. Yandex’s launch on Yambda It addresses these challenges by providing a wide -ranging high -quality data collection with a rich set of features and anonymous protection.
What Yambda contains: size, wealth and privacy
the Yambda The data collection includes 4.79 billion anonymous user reactions that have been collected for 10 months. These events come from nearly a million users interacting with nearly 9.4 million paths on Yandex music. The data collection includes:
- User reactions: Both implicit comments (listening) and explicit comments (likes, do not hate, remove them).
- Anonymous sound included: Representations of veil for paths derived from tawafral nervous networks, and enabling models to benefit from the similarity of audio content.
- Organic reaction flags: The “IS_organic” sign indicates whether users have discovered a path independently or by recommendations, which facilitates behavioral analysis.
- Micro -time stamp: Each stimulating event to maintain chronological demand, which is very important to model the user’s serial behavior.
All user and path identifiers are hidden using digital identifiers to comply with privacy standards, ensuring that no personal definition information is exposed.
The data set is provided in APache Parquet format, which has been improved for huge data processing frameworks such as Apache Spark and Hadoop, and is also compatible with analytical libraries like Pandas and Polars. This makes Yambda available for researchers and developers working in various environments.
Rating method: global time division
The main innovation in the Yandex Data set is the adoption of a Global time division (GTS) Evaluation strategy. In the research system of the typical recommendation system, the external leave method is widely used to remove the recent reaction of each test user. However, this approach disrupts the time continuity of the user’s interactions, creating unrealistic training conditions.
GTS, on the other hand, divide the data based on time stamps, while maintaining the entire juvenile sequence. This approach simulates the real world’s recommendation scenarios because it prevents any future data from leakage to training and allows examples to truly testing invisible reactions, later.
This known time assessment is necessary to measure algorithms in light of realistic restrictions and understand their practical effectiveness.
The baseline and standards models included
To support the measurement and accelerate innovation, Yandex provides the basic recommendation models that have been implemented on the data set, including:
- Mostpop: Popular dependent model recommends the most popular elements.
- Decaypop: Time breakdown model.
- Itemknn: A cooperative liquidation method based on the neighborhood.
- Ials: The implicit rotation is the small squares of the working squares.
- BPR: Bayesian personal classification, which is the way to classify the husband.
- Sansa and Sasrik: Serial coacher models benefit from self -interest mechanisms.
These basic lines are evaluated using standard recommendations such as:
- Ndcg@K (Naturally reduced cumulative profit): It measures the quality of the assertion of the position of the relevant elements.
- Call@K: Part of the relevant relevant elements reside.
- @K: It indicates the diversity of recommendations across the catalog.
Providing these criteria helps researchers measure the performance of new algorithms in relation to the specified methods.
The ability to apply a wide application outside the flow of music
While the data set arises from the music flow service, its value extends beyond that field. The types of reaction, the dynamics of the user behavior and the large size of Yambda make a global standard for recommendation systems across sectors such as e -commerce, video platforms and social networks. The verified algorithms can be circulated to this data set or adapted to various recommendations.
Benefits for different stakeholders
- Academic circles: A strict test allows new theories and algorithms on a manufacturing scale.
- Startups and SMBS: It provides a resource similar to what technology giants possess, settle the stadium stadium and accelerate the development of advanced recommendation engines.
- Final users: Indirectly benefit from the most intelligent recommendations that improve content detection, reduce search time, and increase participation.
My summary: Yandex’s recommendation system
Yandex Music takes advantage of the property recommendation system named WaveWhich includes deep nerve networks and AI to customize music suggestions. My wave decomposes thousands of factors including:
- User interaction sequence and listening date.
- Customized preferences such as mood and language.
- Music analysis in the actual time of bayyps, rhythm, sound tone, frequency ranges, and species.
This system is dynamic to individual tastes by identifying similarities in sound and preferences, which indicates the type of complex recommendation pipeline that benefits from large -scale data sets such as Yambda.
Privacy guarantee and moral use
release Yambda It emphasizes the importance of privacy in the research system research. Yandex anonymous all data using digital identifiers and neglecting personal identification information. The data collection contains reaction signals only without detecting accurate user identities or sensitive features.
This balance between openness and privacy allows a strong search with the protection of individual user data, which is a critical consideration of the moral progress of artificial intelligence technologies.
Access and versions
Yandex offers the YAMBDA data group of three sizes to accommodate the various capabilities of research and accounting:
- Full version: ~ 5 billion event.
- Medium version: ~ 500 million event.
- Small version: ~ 50 million event.
All versions can be reached via EmbroideryA famous platform for hosting data collections and automated learning models, which allows easy integration in the research progress.
conclusion
Yandex version on Yambda The data set is a pivotal moment in the recommendation of the recommendation system. By providing an unprecedented scale for anonymous reaction data associated with an evaluation and shield in the first place, it sets a new standard for measuring and accelerating innovation. Listers, startups and institutions alike can explore and develop recommendation systems that better reflect use in the real world and provide augmented allocation.
As the recommendation systems continue to influence countless experiences online, data groups such as YAMBDA play a fundamental role in pushing the limits of what the allocation can achieve in Amnesty International.
verify Yambda Data set in the face of embrace.
Note: Thanks to the Yandex team to lead/ thought resources for this article. Yandex team supported this content/article.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-06-02 07:31:00