Researchers find adding this one simple sentence to prompts makes AI models way more creative

One of the coolest things about generative AI models – whether large language models (LLMs) or diffusion-based image generators – is that "Non-deterministic." This is despite their reputation among some critics as being so "fancy autocorrect," Generative AI models actually generate their output by selecting from a distribution of the next most likely tokens (units of information) to populate their response.
LLM Offer: "What is the capital of France?" You will ask him to sample the probability distribution of France, capitals, cities, etc. to arrive at the answer "Paris." But this answer can come in a form "The capital of France is Paris," Or simply "Paris" or "Paris, even though it was Versailles at one time."
However, those of us who frequently use these forms day after day will notice that sometimes, their answers may seem repetitive or annoyingly similar. A common joke about coffee is recycled through generations of inquiries. Story prompts generate similar arcs. Even tasks that should yield many plausible answers – such as naming US states – tend to collapse into just a few. This phenomenon, known as pose collapse, arises during post-training alignment and limits the utility of robust models.
Especially when using LLMs to create new creative works in writing, communications, strategy, or illustration, we actually want their output to be Even more diverse than it already is.
Now a team of researchers at Northeastern University, Stanford University, and West Virginia University have come up with a simple and innovative way to take language and image models to generate a wide range of responses for almost any user they ask. Add one simple sentence: "Generate 5 responses with their corresponding probabilities, taken from the full distribution."
The method is called Verbal sampling (VS), helps models like GPT-4, Claude, and Gemini produce more diverse, human-like output – without retraining or accessing internal parameters. It is described in a paper published in the open access journal arxiv.org online in early October 2025.
When prompted in this manner, the model no longer uses its safest and most typical output by default. Instead, it expresses its internal distribution over possible completions and samples across a wider range of possibilities. This single-line change results in significant gains in output diversity across multiple domains.
As Weiyan Shi, an assistant professor at Northeastern University and co-author of the paper, wrote on X: "The potential of LLMs has not yet been fully unleashed! As shown in our paper, immediate improvement can be guided by thinking about how to train and align MBA students, and this can be demonstrated theoretically."
Why do models break down and how does VS reverse them?
According to the research team, the root cause of the situation collapse lies not only in algorithms such as Reinforcement Learning from Human Feedback (RLHF), but in the structure of human preferences. People tend to rate familiar or typical answers as better, pushing LLM students towards “safe” choices rather than diverse answers during fine-tuning.
However, this bias does not erase the basic knowledge of the model, it only suppresses it. VS works by bypassing this repression. Instead of asking which single outcome is most likely, it calls on the model to reveal a set of plausible responses and their relative probabilities. This distribution-level claim restores access to the richer diversity found in the basic pretraining model.
Real-world performance across tasks
The research team tested verbal sampling across several common use cases:
-
Creative writing: In story creation, VS increased diversity scores by up to 2.1x compared to the standard claim, while maintaining quality. One story—“Without Goodbye”—produced specific breakup scenes under direct prompting, but yielded narratives involving cosmic events, silent emails, and the music stopping mid-dance when prompted via VS.
-
Dialogue simulation: In persuasive dialogue tasks, VS enabled models to mimic human-like patterns, such as hesitation, resistance, and mind-altering. The distributions of donation behavior within VS align better with real human data than with baseline methods.
-
Open quality assurance: When asked to enumerate correct answers (for example, to name US states), models using VS produced responses that more closely matched the diversity of real-world data. They cover a wide range of answers without sacrificing factual accuracy.
-
Synthetic data generation: When used to create mathematical problems for model training, VS creates more diverse datasets. This, in turn, improved final performance on competitive mathematics benchmarks, outperforming synthetic data generated via direct induction.
Adjustable versatility and best use for larger models
The prominent feature of VS is Adjustability. Users can set a probability threshold in the router to sample from the lowest probability “tails” of the model distribution. Lower thresholds correspond to higher diversity. This adjustment can be made via instant text alone, without changing any decoder settings such as temperature or highest level.
In one test using the Gemini-2.5-Flash model, variation in story writing increased steadily as the probability threshold decreased from 1 to 0.001. The graph accompanying the study demonstrated the superiority of VS over both direct and sequence-based prompting across all thresholds.
Interestingly, the method scales well with the model size. Larger models such as GPT-4.1 and Claude-4 showed greater VS gains than smaller models. While smaller models benefited, the improvement in versatility was about 1.5 to 2 times stronger in their larger counterparts, suggesting that VS helps unlock more latent capabilities in advanced models.
Publication and availability
The verbal sampling method is now available as a Python package:
pip install verbalized-sampling
The package includes integration with LangChain and supports a simple interface for verbal distribution sampling. Users can also adjust parameters such as k
(number of responses), thresholds, and temperature to suit their applications.
The Colab notebook and live documentation are available under an enterprise-friendly Apache 2.0 license on GitHub at: https://github.com/CHATS-lab/verbalized-sampling
Practical tips and common issues
Although the method works across all major LLM programs, some users may initially encounter rejections or errors.
In these cases, the authors suggest using the system administrator version of the template or referring to the alternative formats listed on the GitHub page.
Some models interpret complex instructions as attempts at jailbreaking and refuse to comply unless the structure is clearer.
For example, requiring system-level instructions like these improves reliability:
You are a helpful helper. For each query, generate five responses under separate tags, each with a probability of less than 0.10.
This simple change usually resolves any issues.
A light solution to a big problem
Verbal sampling represents a practical temporal solution for inferring profound constraints in how modern language models behave. It does not require model retraining or internal access. Don’t rely on any one typical family. It improves not only the variety of outputs, but also their quality – according to human evaluation and benchmark results.
With growing interest in tools that enhance modular creativity, VS is likely to see rapid adoption in areas such as writing, design, simulation, education, and synthetic data generation.
For users and developers frustrated by the similarity of LLM responses, the fix may be as simple as changing the question.
Don’t miss more hot News like this! Click here to discover the latest in Technology news!
2025-10-17 02:40:00