Train a Model Faster with torch.compile and Gradient Accumulation

1 4 minutes read

Training a language model using a deep transformer architecture takes a long time. However, there are techniques you can use to speed up training. In this article you will learn about:

Use torch.compile() To speed up the model
Use gradient accumulation to train a model with a larger effective batch size

Let’s get started!

Train the model faster using torch.compile and Graient Accumulation
Photography by François Guenon. Some rights reserved.

summary

This article is divided into two parts; they:

Use torch.compile()
Gradient accumulation

Using torch.compile

When you write and run your form code using PyTorch, the code is executed in keen mode. This means that the code is executed line by line, and the results are stored in memory. This is native to Python because it is an interpreted language. You know this is the case because when you make a mistake in your code, you won’t see the error until you run that line of code.

Running the model in excited mode is slow. Starting with PyTorch 2.0, you can use torch.compile() To compile a model to improve performance. This creates a new, optimized model object. It is not the same as the form object you created it with nn.ModuleBut it shares the same tensions with the original model. You can use this compiled model for forward and backward passes and optimizer updates as usual.

Building a model and compiling it as a computational graph is how TensorFlow 1.0 was supposed to work. This makes debugging more difficult, since the form you implement cannot match line by line the code you wrote. Therefore, you should only compile your model after running the trial version and ensuring that it is error-free.

Not all forms can be collected. However, if your model supports compilation, you will immediately benefit from the speedup. To compile a form, all you have to do is replace the form object right before you’re ready to use it:

… model = LlamaForPretraining(model_config).to(device) model. load_state_dict(checkpoint) model = torch.compile(model) …

...

model = LlamaForPretraining(model_config).to(device)

model.load_state_dict(Checkpoint)

model = The torch.translation(model)

...

Do not load the model weights after assembly. This is because the translated model is an object that shares the same weights as the original model. During assembly, the computational graph is generated by referring to the weight tensors of the original model. If you load weights after assembly, the model may not work as expected.

Likewise, to save the translated form, you must revert to the original form’s state dictation, as follows:

torch.save(getattr(model, “_orig_mod”, model).state_dict(), “model.pth”)

The torch.Save(getattr(model, “_orig_mod”, model).State_dict(), “model.pth”)

The original form can be accessed from the translated form using model._orig_mod. In the above code we use getattr(model, "_orig_mod", model) To get the original form if it exists, or use it model itself if it doesn’t. This line of code works with both compiled and native forms.

Gradient accumulation

When you train a model, you probably spend two to three times more time on the back pass than on the forward pass. This is because backward scrolling is more computationally intensive and uses more memory.

One easy trick to speed up training is to do fewer back passes. This can be achieved by increasing the batch size: with the same number of data samples, a larger batch size means fewer batches to process.

However, a larger batch size requires more memory. In a memory-constrained environment, you can simulate a larger batch size by running multiple forward passes and gradient pooling. This is called Gradient accumulation.

It’s easier to explain this idea with code:

.. accumulate_steps = 4 for epoch in range (num_epochs): optimized.zero_grad() for i, batch in enum(dataloader): # Get aggregated data input_ids, target_ids = Batch # Create attention mask: causal mask + padding mask attn_mask = create_causal_mask(input_ids.shape[1]device) + \create_padding_mask(input_ids, PAD_TOKEN_ID, device) # Extract output from logs Model = model(input_ids, attn_mask) # Calculate loss: cross entropy between log and target, ignore padding codes Loss = Loss_fn(logits.view(-1, logits.size(-1)), target_ids.view(-1)) Loss = Loss / accumulate_steps # Run backwards, but only update once each `accumulate_steps` steps Loss.backward() if (i + 1) %accumulate_steps == 0: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)Optimer.step() Optimr.zero_grad() Scholar.step()

accumulate_steps = 4

to era in ranges(num_epochs):

Mohsen.zero_grad()

to I, pack in enumeration(Data loader):

# Get aggregated data

input_ids, target_ids = pack

# Create an attention mask: causal mask + padding mask

attn_mask = create_causal_mask(input_ids.appearance[1], device) + \

create_padding_mask(input_ids, PAD_TOKEN_ID, device)

# Extract the output from the model

logits = model(input_ids, attn_mask)

# Calculate the loss: cross entropy between the logarithm and the target, ignoring padding symbols

loss = Loss_fn(logits.view(–1, logits.measuring(–1)), target_ids.view(–1))

loss = loss / Accumulate_steps

# Revert, but refresh only once in each Backlog step.

loss.back()

if (I + 1) % accumulate_steps == 0:

The torch.N.N.utils.clip_grad_norm_(model.border(), 1.0)

Mohsen.step()

Mohsen.zero_grad()

Scheduling.step()

The training loop above is an excerpt from the previous article on training a llama model on your local GPU.

Normally, when you run a forward pass, you calculate the loss. Then you call loss.backward() To propagate the loss gradient through the model parameters. In Bay Torch, AL backward() The method is cumulative, which means that gradients are added. Therefore, you need to call optimizer.zero_grad() Explicitly clears the gradients before running the back pass.

In the above code, you are not intentionally calling optimizer.zero_grad() On every iteration. Alternatively, you can run the back spread of the loss divided by accumulate_steps. In this way, the gradients are reduced but aggregated accumulate_steps Duplicates. Once every accumulate_steps Iterations,run the optimizer to adjust the model parameters.

This approach produces results similar to using a larger batch size. However, since you are running fewer optimizer updates, the learning rate table should be adjusted accordingly. This means that you need to configure the scheduler with a different number of steps:

… num_training_steps = (len(dataloader) //accumulate_steps) * num_epochs cosine_scheduler = lr_scheduler.CosineAnnealingLR(optimal, T_max=num_training_steps – num_warmup_steps, eta_min=0)

...

num_training_steps = (flexible(Data loader) //accumulate_steps) * num_epochs

cosine_scheduler = lr_scheduler.Cosine AnnealingLR(

Mohsen,

T_max=num_training_steps – num_warmup_steps,

eta_min=0

)

summary

In this article you learned to use torch.compile() It can help you speed up the model by compiling the computational graph. You also learned that gradient accumulation is a technique for training with a larger effective batch size by accumulating gradients from multiple small batches. Since you run fewer optimizer updates this way, you save time on back passes and parameter updates.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!

2025-12-25 16:44:00

1 4 minutes read