AI

OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work

OpenAI just introduced GPT-5.2, its most advanced boundary model for professional work and long-term agents, and is rolling it out via ChatGPT and the Application Programming Interface (API).

GPT-5.2 is a family of three different types. In ChatGPT, users see ChatGPT-5.2 Instant, Thinking, and Pro. In the API, the corresponding models are gpt-5.2-chat-latest, gpt-5.2and gpt-5.2-pro. Instant is aimed at everyday help and learning, Thinking targets complex multi-step, multi-agent work, and Pro devotes more computing to difficult technical and analytical tasks.

Standard file, from GDPval to SWE bench

GPT-5.2 thinking is positioned as the main backbone of real-world cognitive work. In the GDVval test, an assessment of well-defined cognitive tasks across 44 occupations in 9 large industries, it outperforms or ties top industry professionals in 70.9 percent of comparisons, while producing output more than 11 times the speed and less than 1 percent of the experts’ estimated cost. For engineering teams, this means the model can create authoritative elements such as presentations, spreadsheets, timelines, and graphs based on structured instructions.

In an internal benchmark of entry-level investment banking spreadsheet modeling tasks, average scores increased from 59.1 percent using GPT-5.1 to 68.4 percent using GPT-5.2 Thinking and 71.7 percent using GPT-5.2 Pro. These tasks include three data models and enhanced purchasing forms with restrictions on formatting and citations, representing several structured enterprise workflows.

In software engineering, GPT-5.2 reasoning reaches 55.6 percent in SWE-Bench Pro and 80.0 percent in SWE-bench Verified. SWE-Bench Pro evaluates patch generation at the repository level across multiple languages, while SWE-bench Verified focuses on Python.

Long context and proxy workflow

Long context is the primary design goal. GPT-5.2 thinking defines a new state of progress in OpenAI MRCRv2, a standard that inserts multiple identical “needle” queries into long dialogue “haystacks” and measures whether the model can reproduce the correct answer. It is the first model that has been reported to reach close to 100 percent accuracy on a 4-needle MRCR variant of up to 256 thousand symbols.

For workloads that go beyond this context, GPT-5.2 reasoning is integrated with responses /compact Endpoint, which compresses context to expand the effective window for heavy tools and long-running tasks. This is convenient if you are creating agents that call tools repeatedly over many steps and need to preserve state beyond the initial token limit.

In terms of tool usage, GPT-5.2 reasoning reaches 98.7 percent on Tau2-bench Telecom, a multi-threaded customer support standard where the model must coordinate tool calls across a real-world workflow. Official examples from the OpenAI release publication show scenarios such as a traveler with a delayed flight, a missed connection, a lost bag and a medical seating requirement, where GPT-5.2 manages rebooking, special assistance seating and compensation in a fixed sequence while GPT-5.1 leaves the steps incomplete.

Vision, science and mathematics

Vision quality is also moving up. GPT-5.2 Reasoning roughly halves error rates in graph logic and UI comprehension benchmarks such as CharXiv Reasoning and ScreenSpot Pro when the Python tool is enabled. The model shows improved spatial understanding of images, for example when labeling motherboard components with rough bounding boxes, GPT-5.2 identifies more regions with tighter placement than GPT-5.1.

For scientific workloads, GPT-5.2 Pro scored 93.2 percent and GPT-5.2 Thinking scored 92.4 percent in GPQA Diamond, and GPT-5.2 Thinking solves 40.3 percent of FrontierMath Level 1 to Level 3 problems with Python tools enabled. These standards cover physics, chemistry, biology, and specialized mathematics at graduate level, and OpenAI highlights early use where GPT-5.2 Pro contributed to proving statistical learning theory under human verification.

Comparison table

model Determine initial locations Context window/maximum output pieces of knowledge Notable Benchmarks (Thinking/Professional Thinking vs. GPT-5.1 Thinking)
GPT-5.1 A leading model for agent programming and tasking with configurable reasoning effort 400,000 symbolic contexts, 128,000 maximum output 09-30-2024 SWE-Bench Pro 50.8 percent, SWE-bench Verified 76.3 percent, ARC-AGI-1 72.8 percent, ARC-AGI-2 17.6 percent
GPT-5.2 (thinking) A new master model for programming and tasking agents across industries and for agents working long term 400,000 symbolic contexts, 128,000 maximum output 2025-08-31 GDP wins or breaks with 70.9 percent versus industry professionals, SWE-Bench Pro 55.6 percent, SWE-bench Verified 80.0 percent, ARC-AGI-1 86.2 percent, ARC-AGI-2 52.9 percent
GPT-5.2 Pro Higher computing version of GPT-5.2 for the most difficult inferences and scientific workloads, producing more intelligent and accurate responses 400,000 symbolic contexts, 128,000 maximum output 2025-08-31 GPQA Diamond 93.2 percent vs. GPT-5.2 Thinking 92.4 percent, GPT-5.1 Thinking 88.1 percent, ARC-AGI-1 90.5 percent, and ARC-AGI-2 54.2 percent

Key takeaways

  1. The GPT-5.2 Thinking model is the new default working model: It replaces GPT-5.1 thinking as the main model for coding, cognitive work, and agents, maintaining the same 400K context and 128K max output, but with significantly higher benchmark performance across GDPR, SWE-Bench, ARC-AGI, and Scientific QA.
  2. The resolution jumps above GPT-5.1 on a similar scale: On key benchmarks, GPT-5.2 Thinking went from 50.8 percent to 55.6 percent on SWE-Bench Pro and from 76.3 percent to 80.0 percent on SWE-bench Verified, from 72.8 percent to 86.2 percent on ARC-AGI-1 and from 17.6 percent to 52.9 percent on ARC-AGI-2, keeping token limits comparable.
  3. GPT-5.2 Pro targets cutting-edge thinking and science: GPT-5.2 Pro is a higher arithmetic variant that mainly improves difficult thinking and scientific tasks, for example reaching 93.2 percent in GPQA Diamond versus 92.4 percent in GPT-5.2 Thinking and 88.1 percent in GPT-5.1 Thinking, and higher scores in ARC-AGI levels.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of AI for social good. His most recent endeavor is the launch of the AI ​​media platform, Marktechpost, which features in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand by a broad audience. The platform has more than 2 million views per month, which shows its popularity among the masses.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-12-11 20:04:00

Related Articles

Back to top button