Gemma Scope: helping the safety community shed light on the inner workings of language models

Techniques
Announced a comprehensive and open collection of sporadic automatic coding devices to explain the language model.
To create the artificial intelligence language model (AI), researchers build a system learning from huge amounts of data without human instructions. As a result, the internal works of the language models are often a puzzle, even for researchers who train them. Mechanical interpretation It is the field of research that focuses on deciphering these internal works. Researchers in this field of use Sporadic As a kind of “microscope” that allows them to see within the language model, and get a better feeling how it works.
Today, we announce Gemma Scope, a new collection of tools to help researchers understand the internal works of GEMMA 2, our lightweight family of open models. GEMMA Scope is a set of hundreds of automatic coding devices available freely available (SAES) for GEMMA 2B and GMMA 2B. We also open Mishax sources, a tool we created that enabled many interpretation work behind the GEMMA range.
We hope to launch today’s interpretation research will allow more ambitious. More research has the ability to help this field build more powerful systems, develop better guarantees against typical hallucinations, and protect risk from independent artificial intelligence factors such as deception or manipulation.
Try the illustration of the interactive Gemma Scope, with the permission of Neuronpedia.
Interpretation of what is happening within the language model
When it introduces a model on the language model, it turns the introduction of the text into a series of “activation”. These activations are planning the relationships between the words you have entered, which helps the model make communications between different words, which you use to write an answer.
Since the model addresses the introduction of the text, the activation in different layers in the nerve network of the model represents multiple increasingly advanced concepts, known as “features”.
For example, you may learn the early layers of the model to remember facts such as Michael Jordan plays basketball, while the subsequent layers may learn more complicated concepts such as the realism of the text.
An enforced representation of sporadic automatic encoding to explain the model activations because it remembers the fact that the city of light is Paris. We see that French concepts exist, while unrealistic concepts are not.
However, interpretation researchers face a major problem: stimulating the model is a mixture of many different features. In the early days of mechanical interpretation, researchers were hoping that the features of stimulating the neuroma with individual neurons would continue, any, Holding information. Unfortunately, in practice, neurons are activated for many unrelated features. This means that there is no clear way to know the features that are part of the activation.
This is where sporadic automatic coding devices come.
The specified activation will only be a mixture of a small number of features, although the language model is likely to discover millions or even billions – anyForm uses features Few. For example, the language model will think about relativity when responding to inquiries about Einstein and look at eggs when writing about omelette, but you may not think about relativity when writing about the omelette.
The sporadic automatic coding devices benefit from this fact to discover a set of possible features, and to break every activation into a small number of them. The researchers hope to be the best way to spare sporadic code to accomplish this task is to find the actual basic features used by the language model.
More importantly, at any time in this process, we tell us – researchers – spontaneous spontaneous codes that are searched for. As a result, we are able to discover rich structures that we did not expect. However, because we do not know immediately meaning Among the discovered features, we are looking for meaningful patterns in examples of the text in which the sporadic automatic encrypted says that the feature of “fires” says.
Below is an example in which the symbols in which the features of the feature are distinguished in the gradients of blue according to their strength:
An example of activation of a feature found by sporadic automatic coding devices. Each bubble is a symbolic symbol (fragment of a word or word), and the variable blue color shows how strong the feature is. In this case, the feature appears to be linked to expressions.
What makes the Gemma range unique
Previous research with sporadic automatic coding devices has focused mainly on investigating the internal works of small models or one layer in larger models. But researching the most ambitious interpretation includes decoding the complex algorithms in the largest models.
We trained sporadic automatic coding devices on all Gemma 2B and 9B layer and sub -layer resulted in the GEMMA range, resulting in more than 400 separate automatic devices with more than 30 million features that are used in total (although many features are overlapping). This tool will enable researchers to study how features develop throughout the model, interaction and composition of more complex features.
The GEMMA range is also trained with the new new Jumprelu Sae brown. The infrastructure of the original scattered car tapes has struggled to balance the double goals to discover existing features and estimate their strength. Jumprelu structure makes it easy to achieve this balance appropriately, which greatly reduces error.
Training on many sporadic automatic coding devices was an important engineering challenge, which requires a lot of computing power. We used about 15 % of the Gemma 2 9B training account (except for an account to generate distillation stickers), and save about 20 PIB stimulations from activation to disc (about one million copies of Wikipedia English), and produced hundreds of billions of transformer transactions in total.
Pay the field forward
At the launch of GEMMA Scope, we hope to make Gemma 2 the best typical family to research open mechanical interpretation and accelerate society’s work in this field.
Until now, the Al -Qalbiyah Society has made great progress in understanding small models with sporadic automatic factors and the development of relevant technologies, such as causal interventions, automatic circuits analysis, advantage of features, and evaluation of sporadic automatic factors. With GEMMA, we hope to see society expanding in these technologies to modern models, analyzing the most complex capabilities such as a series of ideas, and finding applications in the real world of interpretation such as addressing problems such as hallucinations and fractures that arise only with larger models.
Thanks and appreciation
Gemma Scope was a collective effort by Tom Lieberum, the agent Rajamanohran, Arthur Konmi, Luis Smith, Nick Sonrat, Vikrant Pharma, Ganus Kamar, and Niel Nanda, who was advised by Rohen Shah and Anas Dragan. We would like to thank Johnny Lin, Joseph Bloom and Cort Tigggs in Neuronpedia for their help in the interactive illustration. We are grateful for help and contributions from Vibi Kirk, Andrew Forbes, Ariel Bear, Alia Ahmed, Yutam Doron, Tres Warkinin, Ludovic Piran, Cat Black, Anand Rao, Meg Resdal, Samuel Albani, Dave Ur, Matt Miller, Toby Igitoy, Alex Tomala, Javier Firando, Oscar Oppo, Kathleen Kennelli, Joe Fernandez, Omar Sancifero and Jalin Cameron.
2024-07-31 15:59:00