GLM-5.2 Beat US AI Models at Finding Security Bugs

GLM-5.2 Beat US AI Models at Finding Security Bugs: What the Result Really Shows
China's GLM-5.2, the new open-weight model from Zhipu AI (now branded Z.ai), beat leading US models at one specific security task: it scored a 39 percent F1 on finding a common class of vulnerability and edged out Anthropic's Claude Code at roughly one-sixth the cost, according to security firm Semgrep. That result is real, but the headline is narrower than it sounds. Here is exactly what GLM-5.2 did, what it beat, and why cybersecurity researchers are both impressed and worried.
What is GLM-5.2?
GLM-5.2 is Z.ai's flagship open-weight large language model, released to its GLM Coding Plan subscribers on June 13, 2026, with the open weights and release notes following three days later. It is a Mixture-of-Experts model with roughly 750 billion total parameters but only about 40 billion active per token, and it extends the usable context window to one million tokens, as VentureBeat reported.
The detail that matters most for this story is the license. Z.ai published the weights under a permissive MIT license with no regional restrictions, meaning anyone can download the model, run it on their own hardware, fine-tune it, and inspect it. "Open weight" is not the same as fully open source, since the training data and full pipeline are not released, but it does mean GLM-5.2 can be used entirely offline, outside any provider's control.
The benchmark result, in plain terms
The widely shared "beat US models at security" claim traces back to a Semgrep evaluation focused on one vulnerability class: Insecure Direct Object Reference, or IDOR. An IDOR bug is a missing authorization check that lets one user access another user's data, for example by changing an ID number in a request. It is a reasoning-heavy task because the model has to understand who is allowed to do what, not just spot a dangerous function.
On that task, run against real open-source applications, GLM-5.2 scored a 39 percent F1 using a bare prompt and beat the Claude Code agent. As Semgrep put it, "an open-weight model running a bare prompt outperformed a frontier coding agent on a reasoning-heavy security task." It did so at about $0.17 per vulnerability found, which the authors described as roughly one-sixth the cost of comparable frontier models.
Here is how the models stacked up on Semgrep's IDOR test. Note that two of the top scores belong to Semgrep's own multimodal scanner, not to a general chatbot:
| System / model | IDOR detection F1 |
|---|---|
| Semgrep multimodal (powered by GPT-5.5) | 61% |
| Semgrep multimodal (powered by Opus 4.8) | 53% |
| GLM-5.2 (bare prompt) | 39% |
| Claude Code (Opus 4.6) | 37% |
| Claude Code (Opus 4.7/4.8) | 28% |
| GPT-5.5 (Codex harness) | 20% |
The honest framing is this: GLM-5.2 beat the general-purpose US coding agents on IDOR detection, but it did not beat a purpose-built security scanner. Semgrep's own tool, using GPT-5.5 under the hood, still came out on top.
What GLM-5.2 actually beat, and where it did not
The result is genuine but bounded, and Semgrep is unusually clear about the limits. "This is one task, one dataset, one run," the authors wrote, adding that "it might well be the case that for IDOR detection GLM-5.2 really is better than Claude, but for SSRF detection the tables turn. We don't know this yet." In other words, a single vulnerability category does not settle the broader question of which model is better at security.
A second, independent evaluation puts the win in context. Security analytics firm Graphistry tested GLM-5.2 on CyBT-CTF, a capture-the-flag style benchmark whose tasks and answers are kept hidden from model makers to prevent training contamination. There, GLM-5.2 did not dominate. It matched Anthropic's Opus 4.7 and 4.8 on solve rate and beat Claude Sonnet 4.5, with GLM-5.2 solving 28 of 59 tasks versus Sonnet's 23 of 59. Graphistry called GLM-5.2 the first open-weight model it would recommend for a frontier-grade cybersecurity experience, which is a strong endorsement but a different claim from beating every US model outright.
Tellingly, Graphistry also noted that on contaminated public benchmarks both Opus and Sonnet score much higher, yet that advantage shrinks on the hidden CyBT-CTF set. The takeaway is that GLM-5.2 reaches rough parity with frontier US models on uncontaminated security tasks, rather than blowing past them.
Why finding security bugs with AI matters
Vulnerability finding is one of the highest-stakes uses of AI in security, because it cuts both ways. The same model that helps a defender audit code for missing authorization checks can help an attacker find the same hole first. IDOR and similar logic bugs are notoriously hard to catch with traditional pattern-matching scanners, since they depend on intent and context, which is precisely the kind of reasoning large models are getting better at.
It is worth keeping expectations grounded, though. Academic work like the CyberGym benchmark, which tests AI agents on discovering and reproducing real vulnerabilities in projects such as curl, OpenSSL, and FFmpeg, finds that autonomous vulnerability discovery remains substantially constrained for current models. Frontier agents show real but limited capability on genuine, end-to-end bug hunting. A 39 percent F1 on a single curated task is a meaningful signal, not a solved problem.
If you work with these models day to day, the practical thread connecting all of this is prompt and configuration quality, since Semgrep's result came from a bare prompt rather than an elaborate agent harness. Free utilities like the AI tools on RunFreeTools can help you draft and refine the prompts you feed any model, which matters when small wording changes move security results by several points.
What it means for the US-China AI race
The strategic story is less about one F1 score and more about who can run the model. GLM-5.2 arrived just as the Trump administration moved to block Anthropic's most advanced models from foreign nationals, which, as Trending Topics noted, makes GLM-5.2 the most capable openly licensed model available to many developers outside the US. An open-weight Chinese model reaching parity with frontier US systems on hidden security benchmarks is a notable shift in a field the US has led.
That openness is exactly what alarms defenders. Because the weights can be downloaded and run locally, an attacker can strip safety guardrails and operate without any provider seeing the activity. Security firms quoted by Digital Today warned about this directly. GuidePoint Security's Jason Baker said jailbreak methods for using GLM-5.2 in offensive work were already circulating on Russian-language hacker forums, and one CTO cited in the piece warned the model could help automate lateral movement and exploit chaining after a breach. The dual-use nature of a capable, unrestricted, downloadable model is the real headline for cybersecurity teams.
The caveats you should not skip
Three caveats keep this result honest. First, scope: the clean "beat Claude" win is on one vulnerability type, IDOR, and Semgrep itself says other categories may flip. Second, the comparison set: GLM-5.2 beat general coding agents, not Semgrep's dedicated security scanner, which still led the table.
Third, and most contested, is provenance. Graphistry's analysis suggested GLM-5.2 may be a distillation of GPT-5.5 and Opus 4.8, citing unusually high agreement between the models. Where OpenAI and Anthropic models share a Cohen's Kappa of about 0.63, GLM-5.2's agreement with them rose to roughly 0.80 and 0.76. That is circumstantial evidence, not proof, and confirming distillation would require a technical audit. Notably, Z.ai was absent from the list of Chinese labs named in Anthropic's earlier industrial-scale distillation report, so the accusation remains unresolved.
Bottom line
GLM-5.2 did beat leading US models at finding security bugs, but only in a specific, well-documented sense: it scored 39 percent F1 on IDOR detection and edged out Claude Code at about one-sixth the cost, per Semgrep, and it matched frontier US models on the hidden CyBT-CTF benchmark, per Graphistry. What it does not prove is that GLM-5.2 is broadly the best security model, since a single task is not the whole field, a purpose-built scanner still beat it, and open questions about distillation linger. The more durable story is that a freely downloadable Chinese model has reached frontier-grade cybersecurity capability, which is genuinely useful for defenders and genuinely worrying in the wrong hands. As of late June 2026, that combination, not the F1 number itself, is what makes GLM-5.2 a turning point worth watching.
Frequently asked questions
Yes, but narrowly. In Semgrep's evaluation, GLM-5.2 scored 39 percent F1 on detecting IDOR vulnerabilities, beating the Claude Code agent at about one-sixth the cost. It did not beat Semgrep's own purpose-built security scanner, and the result covers one vulnerability type, not security overall.
GLM-5.2 is Z.ai's (formerly Zhipu AI) flagship open-weight large language model, released on June 13, 2026. It is a Mixture-of-Experts model with roughly 750 billion total parameters, about 40 billion active per token, a one-million-token context window, and weights published under a permissive MIT license.
It won on Semgrep's IDOR (Insecure Direct Object Reference) detection test, scoring 39 percent F1 versus 28 to 37 percent for Claude Code. On the separate hidden CyBT-CTF benchmark tested by Graphistry, GLM-5.2 matched Anthropic's Opus models rather than beating them.
Not conclusively. GLM-5.2 beat general-purpose Claude and GPT coding agents on IDOR detection, but Semgrep's scanner powered by GPT-5.5 still led the table. Semgrep cautions that results may flip for other vulnerability classes such as SSRF.
Because the weights can be downloaded and run locally, anyone can fine-tune the model or strip its safety guardrails and use it offline without a provider seeing the activity. Security firms have warned this lowers the bar for attackers, with jailbreak methods already circulating on hacker forums.
It is unproven. Graphistry's analysis flagged unusually high answer agreement between GLM-5.2 and the US models, with Cohen's Kappa rising to roughly 0.80 and 0.76. That is circumstantial evidence only, and Z.ai was not named in Anthropic's earlier distillation report.
Not fully. Academic benchmarks like CyberGym show that autonomous, end-to-end vulnerability discovery remains substantially constrained for current frontier models. A high score on one curated task is a meaningful signal, not a solved problem.
An Insecure Direct Object Reference is a missing authorization check that lets one user access another user's data, often by changing an ID value in a request. It is hard for traditional scanners to catch because detecting it requires reasoning about who is allowed to do what.
Share this article
Send it to a teammate or save the link for later.
Related articles

Why Is RAM So Expensive in 2026? Price Surge Explained
Why is RAM so expensive in 2026? AI and HBM demand starved DDR5 supply, spiking prices 2x to 4x. See the real numbers, who's hit, and when prices drop.
Read article
Steam Machine Price, Specs: Is It Worth $1,049?
Steam Machine starts at $1,049: full specs, real-world performance, SteamOS, and how it compares to a gaming PC and PS5. Plus a clear is-it-worth-it verdict.
Read article
GLM-5.2 vs GPT-5.5: Is China's AI Better Than ChatGPT?
GLM-5.2 vs GPT-5.5 compared on coding, reasoning, price, and openness. Is China's AI better than ChatGPT in 2026? An honest, benchmark-backed verdict.
Read article