DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion

3 5 minutes read

1761023228 DeepSeek Just Released a 3B OCR Model A 3B VLM.png

DeepSeek-AI 3B has released DeepSeek-OCR, an optical character recognition (OCR) and document analysis system for the Vision Language Model (VLM) that compresses long text into a small set of vision tokens, and then decodes those tokens using a language model. The method is simple: images carry embedded representations of text, which reduces the sequence length for the decoder. The research team reported 97% decoding accuracy when text tokens were within 10 times that of vision tokens in the Fox benchmark, and useful behavior even when pressed 20 times. It also reports competitive results on OmniDocBench with a much lower number of tokens compared to popular baselines.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Architecture, what’s actually new?

The DeepSeek-OCR-3B has two components, a vision encoder called DeepEncoder and an expert mix decoder called DeepSeek3B-MoE-A570M. The encoder is designed for high-precision inputs with low activation cost and few output codes. It uses a SAM-based window attention stage for local perception, a two-layer convolutional compressor for 16× token upsampling, and a CLIP-based dense global attention stage for visual knowledge aggregation. This design maintains control of activation memory with high precision, and keeps the number of visual codes low. The decoder is a 3B parameter MoE model (called DeepSeek3B-MoE-A570M) with about 570 million active parameters per token.

Multiple resolution modes, designed specifically for token budgets

DeepEncoder supports native modes and dynamic modes. The original modes are Tiny with 64 icons at 512 x 512 pixels, Small with 100 icons at 640 x 640, Basic with 256 icons at 1024 x 1024, and Large with 400 icons at 1280 x 1280. The dynamic modes called Gundam and Gundam-Master combine tiled local views And the comprehensive offer. Gundam produces n×100 plus 256 tokens, or n×256 plus 400 tokens, with n in the range from 2 to 9. For padded modes, the research team provides a formula for valid tokens, which is less than the number of raw tokens, and depends on the aspect ratio. These modes allow AI developers and researchers to align token budgets with page complexity.

Compression results, what the numbers say…..

The Fox benchmark measures accuracy as exact matching of text after decoding. With 100 visibility icons, pages containing 600 to 700 text icons reach up to 98.5% accuracy at 6.7x compression. Pages containing 900 to 1,000 text characters reach 96.8% accuracy at 9.7× compression. With 64 visibility tokens, the accuracy decreases as compression increases, for example 59.1% at about 19.7× for 1200 to 1300 text tokens. These values come directly from Table 2.

In OmniDocBench, summaries indicate that DeepSeek-OCR outperforms GOT-OCR 2.0 when using only 100 visibility tokens per page, and that it is less than 800 visibility tokens outperforming MinerU 2.0, which uses more than 6,000 visibility tokens per page on average. The Standard section shows the overall performance in terms of release distance.

Training details that matter….

The research team describes a two-stage training pipeline. It first trains DeepEncoder with next code prediction on OCR 1.0, OCR 2.0 data, and 100 million LAION samples, and then trains the entire system with parallel pipelines across 4 partitions. For hardware, the run used 20 nodes, each with 8 A100 40G GPUs, and used AdamW. The team reported a training speed of 90 billion codes per day on text-only data, and 70 billion codes per day on multimodal data. In production, reports indicate the ability to generate more than 200,000 pages per day on a single A100 40G node.

How to evaluate them in a process stack

If the target documents are reports or sample books, start with the Small mode at 100 tokens, and only adjust upward if the editing distance is unacceptable. If your pages have small, dense fonts or too many tokens, use Gundam mode, because it combines global and local fields of view with a clear token budget. If your workload includes chemical diagrams, tables, or structures, review the “Deep Analysis” qualitative section, which displays conversions to HTML tables, SMILES, and structured geometry, and then design output that is easy to validate.

Key takeaways

DeepSeek OCR targets token efficiency using optical context compression with near-lossless decoding at about 10 times the compression, and about 60 percent accuracy at about 20 times the compression.
HF version reveals explicit token budgets, Tiny uses 64 tokens at 512 x 512, Small uses 100 tokens at 640 x 640, Base uses 256 tokens at 1024 x 1024, Large uses 400 tokens at 1280 x 1280, Gundam composes n views at 640 x 640 plus To one global view in 1024 by 1024.
The system architecture is a DeepEncoder that compresses pages into visual codes and a DeepSeek3B MoE decoder with about 570 million active parameters, the research team described in the technical report.
The Hugging Face sample card documents a setup tested for immediate use, Python 3.12.9, CUDA 11.8, PyTorch 2.6.0, Transformers 4.46.3, Tokenizers 0.20.3, and Flash Attention 2.7.3.

DeepSeek OCR is a practical step up for document AI. It treats pages as embedded optical carriers that reduce the length of the decoder sequence without discarding most of the information. The sample card and technical report describe 97 percent decoding accuracy at about 10 times the compression on the Fox benchmark, which is the main claim for testing in real workloads. The released model is a 3B MoE decoder with a DeepEncoder front-end, packaged for compilers, with tested versions of PyTorch 2.6.0, CUDA 11.8, and Flash Attention 2.7.3, reducing setup cost for engineers. The repository offers one 6.67GB security tool chip, which fits popular GPUs. Overall, DeepSeek OCR runs visual context compression using a 3B MoE decoder, reports 97% decoding accuracy at 10x compression on Fox, provides clear symbol budget modes, includes a tested Transformers setup, and validates the throughput claim in your pipeline.

verify Technical paper, high frequency model and GitHub repo. Feel free to check out our website GitHub page for tutorials, codes, and notebooks. Also, feel free to follow us on twitter Don’t forget to join us 100k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of AI for social good. His most recent endeavor is the launch of the AI media platform, Marktechpost, which features in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand by a broad audience. The platform has more than 2 million views per month, which shows its popularity among the masses.

🙌 FOLLOW MARKTECHPOST: Add us as a favorite source on Google.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-10-20 23:50:00

3 5 minutes read