AI

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Each of the artificial intelligence version inevitably includes the plans that link how it surpassed its competitors in this standard test or the evaluation matrix.

However, these criteria often test general capabilities. For institutions that want to use models and agents based on the large language model, it is difficult to assess the actual understanding of the agent or model their own needs.

Face Model Respository Face, an open source tool where developers and institutions can create their own standards to test model performance against their internal data.

Sumuk SHASHIDHAR, part of the Huging Face Ratics Research Team, has announced your Pench on X. The feature provides “customized criteria and generating artificial data from any of your documents. It’s a big step towards improving how to make models’ evaluation.”

He added that Huging Face knows that “for many cases of use, what really matters is the extent to which the model performs your specific task.

Create dedicated reviews

“Using the minimum source text, and achieving this for less than $ 15 in the cost of total inference while maintaining the perfectly relative performance classifications.”

Organizations need a pre -monitoring process before they can work. This involves three stages:

  • Swalcation document To “Normalize” file formats.
  • Semantic To destroy the documents to meet the limits of the context window and focus the attention of the model.
  • Summarize the document

After that, the process of generating questions and answers comes, which creates questions of information about documents. This is where the user LLM brings the chosen to find out the best answers to the questions.

The embracing face tested with Deepseek V3 and R1 models, QWEN models from Alibaba including QWEN QWQ thinking form, Mistral Large 2411, Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, GEMINI 2.0 Flash and Gemma 3, GPT-4O, Sonata and Claude 3.5 Haiko.

Cashidhar said the hug also provides an analysis of the cost on the models and found that QWEN and Gemini 2.0 Flash “produces a huge value for very low costs.”

Restrictions account

However, the creation of LLM customized standards based on the organization’s documents comes at a cost. Ownbench requires a lot of account strength to work. X -Shachehahar said that the company “adds a capacity” as quickly as possible.

Huging Face runs many graphics and partners processing units with companies like Google to use their cloud services for inference tasks. Venturebeat communicates to embrace your face about using your account.

The measurement is not perfect

Other standards and evaluation methods give users an idea of ​​performing models well, but this does not pick up how the models will work daily.

Some even expressed doubts that standard tests show models restrictions and can lead to wrong conclusions about their safety and performance. A study also warned that measurement factors may be “misleading”.

However, institutions cannot avoid evaluating models now because there are many options in the market, and to justify technology leaders the increasing cost to use artificial intelligence models. This has led to various ways to test the performance of the model and reliability.

Google DeepMind mainly provided facts, which test the model’s ability to generate accurate responses based on information from documents. Some Yale and Tsinghua researchers have developed self -installation code to direct the institutions that work with them.



2025-04-02 21:33:00

Related Articles

Back to top button