A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

4 5 minutes read

1759638299 A Coding Implementation to Build a Transformer Based Regression Language Model.png

We will build a decline language model (RLM), a model that predicts the continuous numerical values directly from the text sequence in this coding application. Instead of classifying or generating the text, we focus on transformer -based structure training that learns the quantitative relationships in the descriptions of the natural language. We start by generating text data to a number of texts, rejecting them efficiently, then training an encrypted lightweight transformer to set the linguistic signals of the goals of real value. In the end, we don’t only understand how RLMS can be performed from zero but also visualizing their educational behavior and a circular test on invisible examples. verify Full codes here.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from collections import Counter
import re


torch.manual_seed(42)
np.random.seed(42)


print("🚀 Regression Language Model (RLM) Tutorial")
print("=" * 60)

We start importing basic libraries, such as Pytorch, Numby and Matplotlib, to build and visualize our slope language model. We put random seeds to ensure the reproduction and preparation of the environment, thus ensuring fixed results each time the tutorial program is run. verify Full codes here.

def generate_synthetic_data(n_samples=2000):
   """Generate synthetic text-to-number regression data"""
  
   templates = [
       ("The temperature is {} degrees", lambda x: x),
       ("I rate this {} out of ten", lambda x: x),
       ("The price is {} dollars", lambda x: x),
       ("Confidence level: {}", lambda x: x / 100),
       ("Speed of {} kilometers per hour", lambda x: x / 10),
       ("{} percent complete", lambda x: x / 100),
       ("Scored {} points in the game", lambda x: x / 10),
       ("The distance is {} meters", lambda x: x),
   ]
  
   data = []
   for _ in range(n_samples):
       template, transform = templates[np.random.randint(len(templates))]
       value = np.random.uniform(0, 100)
       text = template.format(round(value, 1))
       target = transform(value)
       data.append((text, target))
  
   return data

We create an artificial data collection that accumulates natural language sentences with the corresponding numerical values. Using a variety of molds such as temperatures, evaluation and percentages, we guarantee that the model learns a number of number. This controlled setting helps us to simulate the realistic slope tasks without relying on external data. verify Full codes here.

class SimpleTokenizer:
   def __init__(self):
       self.word2idx = {"": 0, "": 1}
       self.idx2word = {0: "", 1: ""}
       self.vocab_size = 2
  
   def fit(self, texts):
       """Build vocabulary from texts"""
       words = []
       for text in texts:
           words.extend(re.findall(r'\w+|[^\w\s]', text.lower()))
      
       word_counts = Counter(words)
       for word, _ in word_counts.most_common():
           if word not in self.word2idx:
               self.word2idx[word] = self.vocab_size
               self.idx2word[self.vocab_size] = word
               self.vocab_size += 1
  
   def encode(self, text, max_len=20):
       """Convert text to token indices"""
       words = re.findall(r'\w+|[^\w\s]', text.lower())
       indices = [self.word2idx.get(w, 1) for w in words]
      
       if len(indices) < max_len:
           indices += [0] * (max_len - len(indices))
       else:
           indices = indices[:max_len]
      
       return indices

We design a simple Tokenizer to convert the raw text into digital symbols that the model can address. It builds vocabulary from all unique words and maps for each index, and dealing with unknown words and filling automatically. This step guarantees that our textual inputs are converted into consistently readable serials for training. verify Full codes here.

class RLMDataset(Dataset):
   def __init__(self, data, tokenizer, max_len=20):
       self.data = data
       self.tokenizer = tokenizer
       self.max_len = max_len
  
   def __len__(self):
       return len(self.data)
  
   def __getitem__(self, idx):
       text, target = self.data[idx]
       tokens = self.tokenizer.encode(text, self.max_len)
       return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32)


class RegressionLanguageModel(nn.Module):
   def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2,
                dropout=0.1, max_len=20):
       super().__init__()
      
       self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
       self.position_embedding = nn.Embedding(max_len, embed_dim)
      
       encoder_layer = nn.TransformerEncoderLayer(
           d_model=embed_dim,
           nhead=num_heads,
           dim_feedforward=embed_dim * 4,
           dropout=dropout,
           batch_first=True
       )
       self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
      
       self.fc1 = nn.Linear(embed_dim, 64)
       self.relu = nn.ReLU()
       self.dropout = nn.Dropout(dropout)
       self.fc2 = nn.Linear(64, 1)
      
       self.max_len = max_len
  
   def forward(self, x):
       batch_size, seq_len = x.shape
      
       positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
      
       token_embed = self.token_embedding(x)
       pos_embed = self.position_embedding(positions)
       embeddings = token_embed + pos_embed
      
       padding_mask = (x == 0)
      
       encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask)
      
       mask_expanded = (~padding_mask).unsqueeze(-1).float()
       summed = (encoded * mask_expanded).sum(dim=1)
       pooled = summed / mask_expanded.sum(dim=1)
      
       x = self.fc1(pooled)
       x = self.relu(x)
       x = self.dropout(x)
       output = self.fc2(x)
      
       return output

We fill in our text pairs in the PyTorch data collection, where we symbolize each sentence and return the molds ready for renewal. Then we build the converter -based RLM: the distinctive inclusion and topical inclusion via multi -layer encryption, we mean unrestricted symbols, and nourish the result of a small MLP head to slope. In fact, we allow the encrypted to learn the numerical sermon from the language, while the head draws it to one continuous value. verify Full codes here.

def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001):  
   criterion = nn.MSELoss()
   optimizer = optim.Adam(model.parameters(), lr=lr)
  
   train_losses, val_losses = [], []
  
   print(f"\n📊 Training on {device}")
   print("-" * 60)
  
   for epoch in range(epochs):
       model.train()
       train_loss = 0
       for tokens, targets in train_loader:
           tokens, targets = tokens.to(device), targets.to(device)
          
           optimizer.zero_grad()
           outputs = model(tokens)
           loss = criterion(outputs, targets)
           loss.backward()
           optimizer.step()
          
           train_loss += loss.item()
      
       train_loss /= len(train_loader)
       train_losses.append(train_loss)
      
       model.eval()
       val_loss = 0
       with torch.no_grad():
           for tokens, targets in val_loader:
               tokens, targets = tokens.to(device), targets.to(device)
               outputs = model(tokens)
               loss = criterion(outputs, targets)
               val_loss += loss.item()
      
       val_loss /= len(val_loader)
       val_losses.append(val_loss)
      
       print(f"Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
  
   return train_losses, val_losses

We train the form using the ADAM and MSE loss on the graphics processing unit, if available, we repeat on a mini -style to decrease and update weights. We turn into an evaluation mode to verify health at the end of each period, track training and validation, and print the progress made so that we can see learning dynamics in the actual time. verify Full codes here.

print("\n📝 Generating synthetic data...")
data = generate_synthetic_data(2000)
split_idx = int(0.8 * len(data))
train_data, val_data = data[:split_idx], data[split_idx:]
print(f"Train samples: {len(train_data)}, Val samples: {len(val_data)}")


print("\n🔤 Building tokenizer...")
tokenizer = SimpleTokenizer()
tokenizer.fit([text for text, _ in train_data])
print(f"Vocabulary size: {tokenizer.vocab_size}")


train_dataset = RLMDataset(train_data, tokenizer)
val_dataset = RLMDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)


print("\n🏗️ Building Regression Language Model...")
model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")


train_losses, val_losses = train_rlm(model, train_loader, val_loader)


plt.figure(figsize=(10, 4))
plt.plot(train_losses, label="Train Loss", linewidth=2)
plt.plot(val_losses, label="Val Loss", linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('RLM Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


print("\n🎯 Testing Predictions:")
print("-" * 60)
test_examples = [
   "The temperature is 25.5 degrees",
   "I rate this 8.0 out of ten",
   "The price is 45.0 dollars",
   "75.0 percent complete"
]


with torch.no_grad():
   for text in test_examples:
       tokens = torch.tensor([tokenizer.encode(text)]).to(device)
       prediction = model(tokens).item()
       print(f"Input: {text}")
       print(f"Predicted value: {prediction:.4f}\n")


print("✅ RLM Tutorial Complete!")

We create and divide the artificial data, fit our distinctive symbol, wrap everything in groups/pytorch data, and create the converter -based RLM. We train the form, photograph the losses of loss to verify learning, then manage some of the natural language testing claims to see the expected continuous values. However, we complete the RLM pipeline from end to end.

In conclusion, we have successfully designed, trained and evaluated the slope language model capable of predicting continuous values of text inputs. We note how to combine topical implications, transformers, and the simple slope head model to capture the numerical connotations included in the language. By generating artificial data, depicting training progress and prediction testing, we explain how RLMS dries the gap between language understanding and numerical thinking.

verify Full codes here. Do not hesitate to check our GitHub page for lessons, symbols and notebooks. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter. I am waiting! Are you on a telegram? Now you can join us on Telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically intact and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.

🙌 Follow Marktechpost: We added as a favorite source on Google.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!

2025-10-05 04:05:00

4 5 minutes read