The GLM-4.6V series models officially debut, bringing two versions: 106B and 9B, targeting high-performance cloud and low-latency local scenarios respectively. This article will analyze how its native Function Calling capability breaks the boundary between ‘seeing’ and ‘doing’, and delve into its practical applications in long document understanding, frontend code generation, and mixed image-text creation. Detailed benchmark data and deployment resources are also included.
A New Milestone for Vision Models: More Than Just “Understanding”
Developments in the field of Artificial Intelligence are always dazzling. Just as we got used to language models being eloquent, Multimodal AI has raised the bar to a new level. The release of GLM-4.6V brings a quite interesting signal: models are no longer satisfied with “looking at pictures and talking”, they are starting to try “looking at pictures and doing things”.
The GLM-4.6V series launched two versions. One is the foundational model GLM-4.6V (106B) designed for cloud and high-performance computing clusters, and the other is the lightweight model GLM-4.6V-Flash (9B) optimized for local deployment and low-latency applications. Both models extended their context window to 128k tokens during training, which means the amount of information they can process at once is astonishing.
This is not just a stacking of parameters. The core breakthrough of this update lies in the integration of “Native Function Calling”. This might sound a bit technical, but simply put, it turns AI from an observer who only comments into an executor who can get hands-on to solve problems.
Bridging Perception and Action: Native Vision-Driven Tool Use
In the past, when multimodal models processed tasks, they usually needed to convert the seen images into text descriptions first, and then call tools based on the text. This conversion process often resulted in loss of detail or even misunderstanding.
GLM-4.6V takes a different path. It introduces Native Multimodal Function Calling. This means that images, screenshots, or document pages can be directly used as input parameters for tools, without the need for text conversion. Imagine throwing a screenshot of a complex report to the model; it doesn’t need to “translate” it into text first, but directly “looks” at the image to call search tools or calculation tools, and the final output (whether it’s a chart or a rendered page) can also be directly integrated into the reasoning chain.
This truly realizes a closed loop from “visual perception” to “understanding”, and then to “execution”. For developers, this provides a more unified technical foundation for building AI Agents capable of handling real business scenarios.
Mixed Image-Text Creation: Organizing Content Like a Human
Content creators might be particularly interested in the Interleaved Image-Text Content Generation feature.
In the past, when we asked AI to write articles with pictures, it was usually done separately: write the text first, then find the pictures. But GLM-4.6V can handle multimodal contexts including documents, user inputs, and images retrieved by tools. During the content generation process, it actively calls search and retrieval tools to collect and filter additional text and visual materials.
The final result is coherent content with both text and images, tailored to the task. It’s like an experienced editor who knows how to insert supporting images at key points in the text, rather than stiffly piecing materials together.
The Nemesis of Long Documents and Complex Charts
When dealing with business documents, the biggest headache is often those PDFs or scans with complex formats. GLM-4.6V possesses Multimodal Document Understanding capabilities and can handle multi-document or long-document inputs of up to 128K tokens.
This has a huge advantage: it understands formatted pages directly as images. That is to say, it can understand text, layout, charts, tables, and images simultaneously. This avoids the problem of lost layout structures or misaligned tables when traditional OCR (Optical Character Recognition) technology converts everything into plain text first. For professionals who need to analyze a large number of financial reports or technical manuals, this can save a lot of proofreading time.
The Frontend Engineer’s AI Assistant: From Screenshot to Code
For web developers, Frontend Replication & Visual Editing is a very practical feature.
You only need to give the model a UI screenshot, and it can restore the corresponding HTML and CSS code at the pixel level. It visually detects layouts, components, and styles to generate clean code. Even more impressive is that it supports natural language-driven modifications. If you feel a button color is wrong or the layout is too crowded, just give instructions like talking to a designer, and the model will perform iterative visual modifications.
Performance Evaluation: The Showdown Between 106B and 9B
With so many features, how is the actual performance? Let’s look at the specific benchmark data. This table shows the scores of the two versions of GLM-4.6V in various benchmarks.
It is worth noting that although the Flash version (9B) has fewer parameters, the gap with the 106B version in many tasks is not large, which shows its extremely high cost-performance ratio, making it very suitable for resource-constrained local deployments.
GLM-4.6V Series Benchmark Results
| Benchmarks | GLM-4.6V (106B) | GLM-4.6V-Flash (9B) | GLM-4.5V (106B) | Qwen3-VL-8B | Kimi-VL-A3B |
|---|---|---|---|---|---|
| General VQA | |||||
| MMBench V1.1 | 88.8 | 86.9 | 88.2 | 84.3 | 84.4 |
| MMBench V1.1 (CN) | 88.2 | 85.9 | 88.3 | 83.3 | 80.7 |
| MMStar | 75.9 | 74.7 | 75.3 | 75.3 | 70.4 |
| BLINK (Val) | 65.5 | 65.5 | 65.3 | 64.7 | 53.5 |
| MUIRBENCH | 77.1 | 75.7 | 75.3 | 76.8 | 63.8 |
| Multimodal Reasoning | |||||
| MMMU (Val) | 76.0 | 71.1 | 75.4 | 74.1 | 64.0 |
| MMMU_Pro | 66.0 | 60.6 | 65.2 | 60.4 | 46.3 |
| VideoMMMU | 74.7 | 70.1 | 72.4 | 72.8 | 65.2 |
| MathVista | 85.2 | 82.7 | 84.6 | 81.4 | 80.1 |
| AI2D | 88.8 | 89.2 | 88.1 | 84.9 | 81.9 |
| Multimodal Agentic | |||||
| Design2Code | 88.6 | 69.8 | 82.2 | 56.6 | 38.8 |
| Flame-React-Eval | 86.3 | 78.8 | 82.5 | 56.3 | 36.3 |
| OSWorld | 37.2 | 21.1 | 35.8 | 33.9 | 8.2 |
| AndroidWorld | 57.0 | 42.7 | 57.0 | 50.0 | - |
| WebVoyager | 81.0 | 71.8 | 84.4 | 47.7 | - |
| OCR & Chart | |||||
| OCRBench | 86.5 | 84.7 | 86.5 | 81.9 | 86.9 |
| ChartQAPro | 65.5 | 62.6 | 64.0 | 58.4 | 23.7 |
| Spatial & Grounding | |||||
| RefCOCO-avg (val) | 88.6 | 85.6 | 91.3 | 89.3 | 33.6 |
| Ref-L4-test | 88.9 | 87.7 | 89.5 | 88.6 | 51.3 |
From the data, it can be seen that GLM-4.6V’s improvement in Multimodal Agentic and Reasoning is particularly obvious, which confirms its strengthening in “action execution” capabilities.
How to Get and Deploy
For developers who want to try this model, you can now download and use the GLM-4.6V series models via Hugging Face.
If you are a fan of local deployment and are used to using llama.cpp to run models, here is a small reminder. Currently, llama.cpp support for GLM 4.5V/4.6V is still in the draft stage (Draft PR). Although the community is actively pushing it forward, it may not be stable enough yet. Friends who are interested in tracking progress or participating in testing can follow this GitHub Pull Request #16600.
This also means that to run this new model perfectly locally, you may need to wait a little longer, or you need to have certain debugging skills to handle unmerged code.
Frequently Asked Questions (FAQ)
To help everyone understand the features of GLM-4.6V more quickly, here are a few core Q&As:
Q1: What is the main difference between GLM-4.6V and GLM-4.6V-Flash? GLM-4.6V (106B) is a flagship model designed for cloud and high-performance clusters, suitable for handling the most complex reasoning and multimodal tasks; while GLM-4.6V-Flash (9B) is a lightweight version optimized for local deployment and low-latency scenarios. Although it has fewer parameters, it still shows strength close to the flagship model in many benchmarks.
Q2: What is “Native Multimodal Function Calling”? This means that the model can directly accept images (such as screenshots, documents) as input parameters for tools without first converting the images to text. This allows the model to “operate by seeing” more accurately, such as directly calling a search tool based on an error screenshot, greatly improving the execution efficiency of AI Agents in real business scenarios.
Q3: Can I use it to write web code? Yes. GLM-4.6V has frontend replication and visual editing capabilities. You can upload a UI screenshot, and the model will generate the corresponding HTML and CSS code. You can even use natural language commands (such as “make the button bigger”) to let the model modify the code until you are satisfied.
Q4: Can I run GLM-4.6V on a local device now?
Theoretically yes, especially the 9B Flash version which is very suitable for local operation. However, support for this series in the mainstream local inference framework llama.cpp is currently still in development (Draft stage), so ordinary users may need to wait for official support to be merged to get the smoothest experience.


