AI

Measuring perception in AI models

A new standard for evaluating multimedia systems based on video, sound and text in the real world

From the Turning test to imagenet, the standards played an effective role in forming artificial intelligence (AI) by helping to define research goals and allow researchers to measure progress towards those goals. Amazing breakthroughs have been linked in the past ten years, such as AlexNet in the vision of the computer and the computer in folding protein, closely related to the use of standard data collections, allowing researchers to classify design options and forms training, and repetition to improve their models. While we are working to achieve the goal of building artificial general intelligence (AGI), developing strong and effective standards expand the capabilities of artificial intelligence models is no less important than developing the models themselves.

Cognition – the process of experimenting with the world through the senses – is an important part of intelligence. The construction of agents who have a perception of human level of the world is a central but difficult task, which has become of increasing importance in robots, self -driving cars, personal assistants, medical photography, and more. So today, we offer Imagination testMultimedia standard using real videos to help assess the possibilities of perception of the model.

Developing a visualization standard

Many criteria for perception are currently used through artificial intelligence research, such as the kinetic of video procedures, Audioset to classify vocal events, MOT to track objects, or VQA to leave image questions. These criteria led to amazing progress in how to build and develop the structure and training methods of Amnesty International, but everyone only targets the restricted aspects of perception: excludes the criteria of the images aspects of time; The visual answer of questions tends to focus on understanding the high -level semantic scene; The tasks of tracking objects generally work on the minimum appearance of the individual level, such as color or texture. A very small number of criteria determines tasks on both sound and visual methods.

Multimedia models, such as Perceiver, Flamingo, or Beit-3, aim to be more general models of perception. But their assessments were based on multiple specialized data collections because there was no custom standard available. This process is slow and expensive, and provides incomplete coverage of general perceptions such as memory, which makes it difficult for researchers to compare methods.

To address many of these problems, we have created a collection of data from video clips that are intended for activities in the real world, named according to six different types of tasks:

  1. Tracking creatures: A square is provided around an early -scale object, the form must restore a full course via the entire video (including through the blockage).
  2. Point tracking: A point is determined early in the video, the model should track the point throughout the video (also through blockage).
  3. Localization of time work: The model must localize and classify a set of pre -defined procedures.
  4. Settlement of time voice: The model must localize and classify a set of pre -defined sounds.
  5. MDF MDF: Text questions about the video, each with three options to determine the answer.
  6. Answer the Foundation Video questions: Text questions about the video, the model needs to return the paths of one or more objects.

We were inspired by the way to evaluate the perception of children in developmental psychology, as well as from artificial data collections such as Cater and Clevler, designed 37 videos, each with different differences to ensure a balanced data set. Every difference was filmed by at least dozens of participants in the crowd source (similar to the previous work on barriers and something something), with a total of more than 100 participants, which led to 11609 video clips, with an average of 23 seconds.

Videos show simple games or daily activities, allowing us to determine the tasks that require the following skills to solve:

  • Knowing the indications: Test aspects such as finishing the task, identifying organisms, procedures or sounds.
  • Understanding physics: Candidates, movement, blockage, spatial relationships.
  • Time thinking or memory: Timetable for events, counting over time, discovering changes in the scene.
  • Abstract capabilities: Figure matching, the same concepts/different, patterns detection.

Participants from the public described videos with spatial and temporal illustrations (square paths around objects, points paths, movement slices, and sound sectors). Our search team designed questions for each type of text programming for multi -options and messages to answer video questions on the floor to ensure a good diversity of skills tested, for example, questions that achieve the ability to cause opposite or to provide explanations for a specific situation. The corresponding answers for each video were provided again by the participants from the public sources.

Evaluation of multimedia systems with a perception test

We assume that the models have been pre -trained in data collections and external tasks. The perception test includes a small set of control (20 %) that creators can use optionally to transfer the nature of tasks to models. The remaining data (80 %) consists of a public health verification division and a refreshing test division where performance can only be evaluated through our evaluation server.

Here we show a plan to prepare the evaluation: the inputs are a video and sound sequence, in addition to the specifications of the task. The task in a high -level text model can be to answer visual questions or low -level entry, such as the coordinates of the object perimeter of the object to track the object.

The evaluation results are detailed across several dimensions, and the capacity of capabilities through the six computer tasks. As for the visual answer tasks of the questions, we also offer maps of questions through the types of situations shown in the videos and types of thinking required to answer questions for a more detailed analysis (see our paper for more details). The ideal model will increase grades across all radar plots and all dimensions. This is a detailed assessment of the model skills, allowing us to narrow the areas of improvement.

The guarantee of the diversity of the participants and the scenes displayed in the videos was critical when developing the index. To do this, we have chosen participants from different countries from different races and races and aimed at obtaining a diverse representation in each type of video scenario.

Learn more about the perception test

The perception test criteria for the public are available here, and more details are available in our paper. The leaders and challenging server will also be available soon.

On October 23, 2022, we host a workshop on general perception models at the European Conference on the computer vision in Tel Aviv (ECCV 2022), where we will discuss our approach, and how to design and evaluate public awareness models with other prominent experts in this field.

We hope to inspire the perception test and direct more research towards the models of general perception. To move forward, we hope to cooperate with the multimedia research community to provide additional explanatory comments, tasks, standards, or even new languages ​​to the standard.

Contact the email, send via email test@google.com if you are interested in contributing!

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2022-10-12 00:00:00

Related Articles

Back to top button