The best way to understand what is happening in AI security in the spring of 2026 is to look at what someone does when they want to strip the safety training out of a frontier-class language model.
They do not jailbreak it with a clever prompt. They do not roleplay it into confusion. They do not find a weird edge-case input that makes it forget its instructions. Those techniques still work — we will get to them — but the first move of a serious actor is not a prompt. It is a download.
They pull the weights from Hugging Face. They run a publicly-available Python script called abliterator that identifies the single internal direction in the model's activation space that corresponds to "refusal," and they null it. The model, an hour later, is uncensored. Its helpfulness is intact. Its reasoning is intact. Its knowledge is intact. The only thing that has changed is that it no longer refuses anything.
This is not a theoretical threat. It is a weekend project for anyone with a consumer GPU and a working grasp of transformers. There are publicly curated collections of "abliterated" variants of every major open-weight model of the past year on Hugging Face, alongside their unmodified counterparts, served by the same infrastructure, one click away.
That is the thing to sit with. Not as a moral panic. As a security fact. The defensive posture the AI industry built between 2022 and 2025 was designed for a world where models were served, not where models were distributed. That world is ending. The defender ecosystem has not caught up. And the asymmetry is getting worse, not better.
This essay is what a working engineer — not an academic, not a policy analyst, but someone who ships production AI systems for a living — sees in that gap, and what needs to be built to close it.
A note on terminology
This essay uses open weight throughout, not open source, and the distinction is load-bearing.
"Open source," in the tradition the term comes from, means the source code, the build system, and — for AI — the training data and training code, all released under a license that satisfies the Open Source Initiative's definition. A handful of AI projects qualify: OLMo from Allen AI, Pythia from EleutherAI, BLOOM from BigScience. Full stack, genuinely reproducible from scratch.
"Open weight" means the model's parameters are published and downloadable, typically on Hugging Face, sometimes under a permissive license and sometimes under use-restricted terms that would not pass OSI muster. Training data is proprietary. Training code is either unpublished or incomplete. The model cannot be reproduced from the release alone; it can only be used and modified.
Every frontier-class "open" model of the past two years — Llama 3 and 4, Mistral's releases, DeepSeek V3, Qwen 2.5, Gemma 2, GLM-5.1 — is open weight, not open source. Calling them open source is a category error that obscures what is actually being released and, more importantly, what is at stake for security. The threat model in this essay is entirely about what happens when weights are distributed. None of it is about training data or training code. So the language has to match.
How closed-model safety actually worked
It is worth stating plainly what the defensive posture of the closed-model era rested on, because so much of what is written about AI safety treats the mechanisms as vague and magical when they are in fact very concrete and very few.
Through the 2023-2025 era, when the frontier was OpenAI, Anthropic, Google, and a handful of others shipping models only through APIs, AI safety for deployed systems came down to four levers:
Lever one: the gate. You could not use the model without authenticating to the lab that trained it. Every request went through an endpoint they controlled. Every request had a trace. This is a simple point but it is the foundation of everything else.
Lever two: the rate limit. Because every request went through the gate, the lab could throttle any given actor. Iteration cost money and attention. Research on jailbreaks was slow because you were billed per token for each attempt.
Lever three: the monitor. Labs could observe patterns across their entire user base, detect abuse signatures, and respond in real time. An emerging jailbreak technique hitting thousands of requests per minute was visible to the people who could do something about it.
Lever four: the kill switch. A specific user, a specific API key, a specific technique could be revoked, patched, or rate-limited aggressively. The window between a new attack's public disclosure and the lab's defensive patch was sometimes measured in hours.
That is the whole stack. Every policy decision, every content-moderation system, every safety-training investment, every responsible-AI framework rode on top of those four levers. Without them, the defensive posture is not weakened — it is simply not deployable.
And it worked, for the most part. It is what allowed a company like Anthropic to ship Claude with Constitutional AI and trust that the deployment of that model would behave in production roughly the way it behaved in the lab. It is what allowed OpenAI to respond to a new jailbreak class, often within a news cycle, by updating the policy filter in front of the model.
The reason this essay exists is that none of those levers exist anymore once you are running a model that was downloaded from the internet onto a machine you control.
What actually breaks when the weights are downloadable
Four capabilities become available to anyone running an open-weight model locally that were impossible under the API-only regime. Each of them removes one of the four levers. None of them requires novel research — all four are documented in public code, with published techniques, with active communities.
Capability one: refusal-direction ablation
In mid-2024, Arditi and collaborators published a paper titled "Refusal in LLMs is mediated by a single direction." The title is also the finding. The paper showed that the "refusal" behavior in a modern language model — the thing it does when you ask it to help with something it has been trained to decline — is governed by a single direction in the model's activation space. A one-dimensional vector, out of thousands. Find it. Null it. The model stops refusing.
It is worth spelling out what this means mechanically, because the simplicity is part of the problem.
A transformer processes text through a stack of layers, each of which produces a high-dimensional vector — the "activation" — at every token position. The activation at any given layer can be thought of as the model's working representation of what it is thinking at that point. Arditi's team ran two sets of prompts through a frontier-class model: one set the model would refuse (harmful requests) and one set it would answer (benign ones). They collected the activation vectors at several internal layers for both sets. They computed the mean activation for the harmful-set and subtracted the mean for the benign-set. The result is a single direction — call it r — that captures the difference between "about to refuse" and "about to comply."
Then they did the ablation. For every forward pass through the model, at the layers where r was most salient, they projected the component of the activation along r out of the vector. That is linear algebra — a single line of code. a' = a - (a · r / ‖r‖²) · r. The model's internal "refusal" subspace is zeroed. The rest of the representation is preserved. Helpfulness, reasoning, knowledge, coherence — all intact. Refusal — gone.
The paper was, correctly, framed as alignment research. Understanding what safety training encodes is genuinely important. But the technique transfers directly to anyone with a copy of the weights. Within weeks of publication, the FailSpy/abliterator repository wrapped it into a single-file Python script. Given a Hugging Face model path, the script identifies r, writes it to disk, and emits a modified checkpoint. Several parallel community projects did the same.
The hardware profile of the attack is the part that matters for threat-modeling, because it governs who can execute it.
For the ablation step — finding r and producing the modified checkpoint — you need enough memory to hold the model and run it against a few hundred calibration prompts. A 7-billion-parameter model in bfloat16 needs about 14 GB of VRAM; a 70-billion-parameter model needs about 140 GB. Quantization shifts those numbers down substantially — int8 roughly halves the requirement, int4 roughly quarters it — at a small cost to the quality of the identified direction. In practice, the community has produced abliterated variants of 7-B, 13-B, 70-B, and even 405-B class models using consumer hardware.
The concrete tiers, as of spring 2026:
- Entry-level consumer (one RTX 4090 or 5090, 24–32 GB VRAM) is sufficient for 7-B and 13-B models in bf16, and for 70-B in int4 with CPU offloading. The time to produce an abliterated checkpoint is roughly thirty minutes to two hours. The electricity cost is lunch money.
- Dual-GPU or prosumer (two 4090s at 48 GB, or a single RTX 6000 Ada at 48 GB) handles 70-B in int8 natively. A focused practitioner can abliterate three or four distinct models in an afternoon.
- Cloud rental ($2–3/hour for a single H100 with 80 GB) is the most common professional path. For the time it takes to produce an abliterated 70-B variant — roughly an hour of compute — the total bill is a coffee. For a 405-B model, an 8×H100 node runs about $20/hour and produces the modified checkpoint in a few hours; the total cost is under a hundred dollars.
The cost structure is the story. An individual researcher with a weekend and a laptop can do this for the under-70-B tier. A small team with a modest cloud budget can do it for everything up to 405-B. The industrial capability that produced the original model was measured in tens of millions of dollars of compute; the capability to strip its safety training is measured in tens of dollars.
Serving the resulting checkpoint is cheaper still. The ablated model is the same size as the original. It runs on the same inference stack — vLLM, llama.cpp, TGI, any of them. Consumer hardware serves a quantized 70-B model at real-time latencies. The modified weights can be uploaded to Hugging Face under a new model name, mirrored to BitTorrent, distributed on IPFS, or shared through any of the other channels that the open-weight model ecosystem uses. The distribution surface is the same as any legitimate release.
The important thing about this attack is that it is not brittle. You do not need to craft a prompt that confuses the model each time. You do not need to keep up with patches. You modify the weights once, save the result, and use that modified checkpoint indefinitely.
The community of practice around this is not hiding. Hugging Face hosts curated collections of abliterated variants of most open-weight frontier models released in the past eighteen months. The downloads are in the tens of thousands. A search for abliterated on Hugging Face returns pages of results, alongside the original, served by the same infrastructure.
Every time a new open-weight frontier model ships, the abliterated version appears within days. Sometimes hours. The pipeline is industrialized.
Capability two: fine-tuning around the safety training
Ablation is the scalpel. Fine-tuning is the sledgehammer.
A sufficiently small number of examples — hundreds, not thousands — of the kind of behavior a model was trained to refuse, combined with a few hours of LoRA fine-tuning on consumer hardware, will produce a model that has re-learned the refused behavior and the general polish of the original training intact.
This is well-documented. Microsoft Research's qLoRA paper and subsequent work show that parameter-efficient fine-tuning methods can shift a model's outputs with remarkably small amounts of new data. The alignment community knows this. The offensive community knows this.
The implication is that there is no "safety training" that survives contact with an adversary who controls the weights. A lab can train refusal into a model for months with RLHF and constitutional techniques. An individual with a weekend and a GPU can fine-tune that refusal back out. The training cost asymmetry is not in the defender's favor.
Capability three: supply-chain contamination
This is the one that keeps me up at night.
The Hugging Face ecosystem — which is, let us be clear, an astonishing public good — hosts millions of community-fine-tuned models. The assumption most users implicitly make is that pulling a model from Hugging Face is like pulling a Docker image from Docker Hub: you are getting what the author uploaded, and most authors are acting in good faith, and egregious misbehavior would be detected.
That assumption is not warranted for the threat class of a prompt-injection-carrying community fine-tune.
A motivated actor can train a model that behaves identically to its parent under normal use, passes standard evaluation benchmarks, and then produces specific attacker-chosen outputs when triggered by a specific string — a backdoor. The technique is published in Hubinger et al. 2024, Anthropic's "Sleeper Agents" paper, which demonstrated exactly this: a model that had a backdoor that persisted through normal safety training because the backdoor was part of the model's weights, not its system prompt.
This is not hypothetical. It is an engineered, demonstrated attack class. The defensive tooling to detect it at the point of download does not meaningfully exist. A company integrating a Hugging Face model into their production stack as of this writing has approximately no way to know whether they are running an uncompromised checkpoint.
The scaling implication is worse. As larger models become cheaper to distribute, the fraction of deployed systems running community fine-tunes rises. Each deployed system is a potential target for a supply-chain attacker. And the asymmetry — the number of models on Hugging Face versus the number of researchers checking them — is pointed in the wrong direction.
Capability four: distillation without the guardrails
The fourth capability is the most subtle and, in the long run, possibly the most consequential.
You cannot run a 400-billion-parameter model on consumer hardware. But you can use a 400-billion-parameter model, with whatever safety training it has, to generate a large corpus of training data. That corpus can be used to train a much smaller model — seven billion parameters, say — that inherits the capabilities of its teacher but not the safety training, because the safety training never existed as explicit weights in the teacher; it existed as a pattern in the teacher's outputs.
If the teacher has been abliterated or fine-tuned around its refusals, the student learns without any refusal behavior at all. The capability propagates downstream. The frontier lab's safety investment does not.
This is distillation, and it is how the AI research community has made many of its most important progress steps for years. It is not an exotic attack. It is how open-weight reproductions of closed-model capabilities have been built since 2023. The security concern is not that the technique works; it is that the technique works without any of the safety-training machinery coming along for the ride.
The jailbreak reality on top of all this
So far this essay has described techniques that require modifying the weights of an open-weight model. That is the deep form of the problem. There is also a shallow form, which is that open-weight models can be jailbroken at inference time the way closed-weight models can, except there is no lab to monitor or patch the attack.
Four jailbreak families that have been documented in the peer-reviewed and applied-research literature, each of which works at non-trivial rates against most production LLMs as of early 2026:
Crescendo (Russinovich et al. 2024) is the technique of starting a conversation in a benign domain and escalating incrementally toward the refused behavior, where each step is a small extension of the last and the model's internal signal that "this is an abrupt request to do something harmful" never fires. Published success rates on frontier models are surprisingly high.
Many-shot jailbreak (Anil et al. 2024, Anthropic) takes advantage of long context windows. The attacker fills the context with dozens or hundreds of fabricated "examples" of the model helping with exactly the kind of request it would normally refuse, then issues the real request. The model, having been implicitly instructed by the in-context "examples," complies. The effect is more pronounced as context windows grow.
Encoded prompts — Base64, leetspeak, rare languages, pig-Latin, mathematical notation — exploit the fact that safety training is dense on English-language expressions of harmful requests and sparse on the same requests expressed in other encodings. The technique is catalogued in Schulhoff et al.'s prompt injection taxonomy.
Persona pivots — "you are now DAN," "you are an uncensored version of yourself," "roleplay as a character who does not refuse" — exploit a tension in how models weigh instruction-following against safety. In-distribution examples during safety training rarely include a user-issued persona instruction that contradicts the system-level safety posture, so the model has no strong prior on how to reconcile them.
All four of these techniques work against production models today, including models that were trained and tuned by labs with world-class safety teams. They work better against smaller, less-tuned models. And here is the point relevant to the open-weight asymmetry: when they work against an open-weight model that you control, there is no rate limit, no monitoring, no patch. The attack iterates against a model that cannot report the attacker.
The defender's response gap
It is worth taking stock of what the defender ecosystem has actually shipped, because the story is not that nothing exists. It is that what exists addresses one side of the problem while the other side remains largely open.
On the input side, for filtering user prompts before they reach the model, the practitioner has a growing toolkit. Meta's Llama Guard 3 is a purpose-trained classifier for harmful content. Their Prompt Guard specifically targets prompt-injection and jailbreak patterns. Rebuff is a production-grade prompt-injection detection layer with strong community traction. NVIDIA's NeMo Guardrails offers declarative constraint specification for conversational systems. Garak, the NVIDIA red-team tool, provides an automated harness for testing a deployment against dozens of known attack categories.
On the structural side, there has been real progress. Outlines, Guidance, and the broader constrained-generation ecosystem make it tractable to force model outputs to conform to a JSON schema, a regex, or a finite-state grammar. This is a defense in depth: even if a jailbreak succeeds in changing the model's intention, the output channel can be narrow enough that the attacker cannot express a harmful outcome through it. A function-calling interface is, in security terms, a whitelist.
On the output side, classifier cascades, content moderation pipelines, and constitutional-AI-style self-critique chains are in production use at a number of AI platforms.
All of this is real and useful. None of it addresses the weights-level threat class described earlier.
If an attacker has already abliterated the model they are running, the input filter is a non-issue — they control the entire deployment and can simply not run the filter. If a model has been compromised at the weights level through a supply-chain fine-tune, the output filter catches only the overt expressions of the attack, not the covert data-exfiltration patterns that a backdoor might encode. The structural constraint is effective against behavior change at the application layer, but it is defenseless against a subtly-biased model that produces structurally-valid outputs that are subtly wrong in content.
This is the defender's gap. The tooling ecosystem is oriented around the wrong threat model — the model as a trusted component, modified by adversarial inputs — when the more consequential threat class is increasingly the model itself as the adversary's artifact.
What to actually do if you are running open-weight in production
If you are responsible for a production system running an open-weight frontier model — which, in 2026, is an increasing fraction of serious AI deployment for reasons of cost, latency, compliance, and vendor independence — here is the stack the evidence supports.
Control the provenance of your weights. Do not pull arbitrary community fine-tunes from Hugging Face unless you can point to the specific person or organization who trained them and have a basis for trusting that person's security posture. For production, pull from the original lab's published checkpoints, verify the hashes, and consider maintaining a pinned local mirror. Treat a model checkpoint the way you would treat a Docker image from a registry you do not control: something to be scanned, signed, and audited before it runs.
Assume the model can be jailbroken and build a defense in depth anyway. The input filter layer is real security against casual attempts and useful security against sophisticated ones. Llama Guard 3, Prompt Guard, and Rebuff are good, and they are improving. Deploy them. Budget for their latency. Expect a false-positive rate and build an appeals path.
Lock the output channel. Whatever your model produces should go through a constrained-generation layer — Outlines, Guidance, or a purpose-built validator — before it ever reaches a downstream system. Function-calling interfaces, JSON-schema enforcement, and strict output validators are the single most underrated class of AI security control, because they reduce the impact of a successful jailbreak from "the model said something I didn't want" to "the model produced structurally invalid output and the validator rejected it."
Isolate the runtime. If your agent can call tools, those tools should be on an explicit whitelist. If it can make network calls, those calls should go through a proxy that enforces an egress policy. If it can execute code, that execution should happen in a sandbox with no persistent storage and no external network. This is not AI-specific security; it is standard runtime isolation, and the AI community has been slow to adopt it because "the model is just producing text" felt safe. It is not safe. The text it produces controls tool calls, and tool calls control the world.
Monitor continuously at the output layer. Ship token entropy detection. Ship semantic drift monitoring. Ship content-policy classifiers as a final gate. The point is not that any one of these catches every attack; the point is that an attacker has to defeat every layer, and building all four layers is far cheaper than any one of them was ten years ago.
Red-team your own deployment, regularly. Garak, promptfoo, pyRIT — the automated red-team harnesses exist, they are good, and they are free. Set up a weekly run against your production configuration. When a new jailbreak technique appears in the research or community press, test it against your system the same week. Treat it operationally the way a web-infrastructure team treats a published CVE: the clock starts when the disclosure lands.
Build a blue-team muscle in the same organization. The teams I have seen make real progress in this space are the ones where the people building the AI features are also the people running the red-team harnesses. Separating offensive testing into a specialized function produces bureaucratic latency that the attacker does not have. The same engineer who ships a new agent should be the one who runs garak against it before merging. That is not an aspiration; in the teams I know that do it, it is a week-one discipline.
Write things down. Keep an internal runbook of the attack classes you have tested, the defenses you have deployed, the dates of each, and the results. When the next engineer on the team inherits the system, that runbook is the difference between them starting from zero and starting from your accumulated understanding. AI security is too young a field to rediscover the same defense-in-depth lessons every time a new person touches the codebase.
None of this is sufficient. All of it is necessary. The gap I keep returning to — that defenders have tooling for the application layer while the threat is increasingly at the weights layer — is not something individual practitioners can close. It is a research program and a policy project that will take years.
Policy, provenance, and the long view
The AI policy conversation in 2026 has been dominated, understandably, by questions of frontier risk: alignment, capability misuse at the state-actor level, the race dynamics between the major labs. Those are real concerns and this essay is not arguing they should be subordinated.
But the practitioner concern — the boring, near-term, deployed-in-production concern — is a different shape of problem, and it deserves attention in its own right.
Model provenance needs infrastructure. The ability to say, of a given checkpoint, who trained this, from what base, on what data, with what modifications is infrastructure the AI ecosystem does not yet have at scale. Cryptographic signing of weights, chain-of-custody tracking through the fine-tune graph, published provenance manifests — these are all technically tractable and none of them are deployed at the Hugging Face level. The precedent from the software supply-chain world — SLSA, Sigstore, the painful post-SolarWinds rebuild of trust in dependencies — applies here almost directly, and almost none of it has been applied.
Fine-tune tracking matters. When a model that descends from Llama 4 is deployed in a healthcare application, the operator should be able to trace every fine-tune in that model's lineage. Today, the best available tooling for this is a careful reading of the model card, which the author can write freely. The asymmetry between "it is easy to produce a fine-tune" and "it is easy to audit a fine-tune" is one of the root structural problems.
Jurisdictional regulation is going to happen. The EU AI Act's provisions on general-purpose AI models already touch this space, and the NIST AI Risk Management Framework's maturation is pushing US enterprise adoption in a similar direction. Neither of them, today, has teeth at the level of "your production deployment of an abliterated community fine-tune of a Chinese lab's model is in violation of X." They will, eventually. The question is whether the practitioner community is ready when they do.
The research program is real. The academic side of this — interpretability research that can detect compromised weights, mechanistic tools for understanding what a model's safety training actually encodes, formal methods for certifying a checkpoint as unmodified from a reference — is genuinely nascent and genuinely important. The tools do not exist yet. The labs working on them are few. The funding for them is small relative to the capability research that produces the models whose weights get distributed.
This is the research program that matters for the next five years in AI security, and it is a research program that is a fraction the size of the capability-research program whose outputs are creating the threat surface.
A concrete case: GLM-5.1 and the cybersecurity asymmetry
It is worth grounding all of this in a specific recent release, because the abstract argument is easier to accept when you can point at the calendar — and in the spring of 2026, that calendar entry is GLM-5.1.
Z.AI (formerly Zhipu AI) released GLM-5.1 as a 744-billion-parameter Mixture-of-Experts model with 40 billion active parameters per token, a 200K-token context window, and — notably for anyone paying attention to the US-China AI infrastructure decoupling — trained entirely on Huawei Ascend 910B chips using Huawei's MindSpore framework. No Nvidia. No AMD. No American silicon. The weights landed on Hugging Face within hours of the announcement, under a permissive license that places no meaningful restriction on downstream use or modification.
The release followed the by-now-familiar distribution pattern. Within days, community-fine-tuned variants with suppressed refusals appeared. Within two weeks, parameter-efficient fine-tunes with alternative behavior profiles were published. Within a month, distilled derivatives trained on the model's outputs began circulating as smaller, cheaper variants. None of this was surprising. It was the ecosystem doing what the ecosystem does.
But the part that matters for this essay is the benchmark results. GLM-5.1 is not a respectable second-tier open-weight release catching up to the Western frontier. On the benchmarks that matter most for the threat model in this essay — coding agency and cybersecurity capability — it is the current leader, closed or open.
Published benchmarks — GLM-5.1 versus the frontier
| Benchmark | What it measures | GLM-5.1 | Claude Opus 4.7 | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|---|---|
| SWE-Bench Pro | Real-world SWE on unseen repos | 58.4 | 64.3 | 57.3 | 57.7 | 54.2 |
| CyberGym | Offensive cyber task completion | 68.7% ⚠ | 73.1% | 33.4% ⚠ | 66.3% ⚠ | — |
| HumanEval | Python from docstring | 94.6% | ~95% | ~95% | ~95% | ~95% |
| LiveCodeBench | Recent real-world coding | 68% | — | — | — | — |
| Claude Code harness | Agentic coding w/ tool use | 45.3 | — | 47.9 | — | — |
<small>⚠ = self-reported or third-party with known methodological divergence. See below. The CyberGym scores in particular carry several asterisks: GLM-5.1's 68.7% is self-reported by Z.AI, with no independent third-party reproduction published as of this writing. Closed-model scores reflect both capability and refusal behavior — Z.AI themselves acknowledges that Gemini 3.1 Pro and GPT-5.4 refused some CyberGym tasks for safety reasons, which depresses their scores. The Opus 4.6 33.4% figure appears in earlier third-party coverage; the Opus 4.7 73.1% figure is from Anthropic's April 16, 2026 launch reporting and represents a substantial jump. A separately trained Anthropic model code-named "Mythos Preview" scores 83.1% on CyberGym but is not available to the public — deployed only to a small set of defensive-cybersecurity partners (AWS, Apple, Cisco, Google, JPMorgan Chase, Microsoft, Nvidia, CrowdStrike). Sources: CyberGym paper, Anthropic Opus 4.7 launch, Lushbinary CyberGym breakdown, llm-stats.com Opus 4.7 analysis, OfficeChai SWE-Bench Pro coverage. Published benchmark results, not reproductions by this publication. A primary-research follow-up is in development.</small>
Why CyberGym is the most interesting row — and what it actually shows
The CyberGym column is the piece of the table the security community should focus on, but not for the reason it looks like at first glance. The easy read — "open-weight model beats closed models by 30 points on cybersecurity" — is not the right read. The real story is more subtle, and more load-bearing for the thesis of this essay.
First, the raw numbers are not directly comparable across labs. CyberGym rewards successful exploit reproduction; it counts refusal-to-attempt as failure. So the reported score of any given model is a blend of capability and willingness. A model that is fully capable of reproducing an OSS-Fuzz vulnerability but declines to do so for safety reasons — which is exactly what both Anthropic and OpenAI train their frontier models to do on a substantial fraction of CyberGym tasks — scores lower than a model that would attempt every task regardless of capability.
Second, the apparent gap has been closing rapidly. The spring-2026 snapshot looks very different with Claude Opus 4.7 included:
- GLM-5.1 self-reports 68.7% CyberGym (March/April 2026)
- Opus 4.7 scores 73.1% (April 16, 2026 launch) — higher than GLM-5.1
- Mythos Preview scores 83.1% — Anthropic's internal frontier, deliberately held back from public release
Put those three data points in order and the picture is not "open weights have passed closed-model cybersecurity capability." The picture is: the labs with the most cyber-capable frontier models are increasingly the ones with the most disciplined release policies. Anthropic has a model 10 points above anything in public circulation and is deliberately not shipping it. That is the closed-model defensive lever described earlier in this essay, working exactly as designed.
And yet — and this is the load-bearing part — none of that matters for the deployment threat model, because:
- Opus 4.7's 73.1% is only accessible through the Anthropic API, gated, rate-limited, and monitored
- Mythos Preview's 83.1% is not accessible to the public at all
- GLM-5.1's self-reported 68.7% is running on anyone's laptop right now, and the abliterated variant that removes its refusal behavior was published within days
The gap that matters is not the benchmark gap between closed-frontier and open-frontier. The gap that matters is between frontier capability with enforced refusal and frontier-adjacent capability with no enforceable refusal at all. Closed labs have the higher ceiling. Open weights have the lower floor — and the lower floor is what gets deployed, unsupervised, against real systems.
This is the more precise form of the asymmetry. The closed labs are doing exactly the work they are supposed to: pushing the frontier, gating access to the most capable models, and training refusal into the ones they do release. The problem is that the defensive posture of "gate the frontier" has a fundamental blind spot the size of GLM-5.1. A model that is within ten points of the closed frontier on offensive cybersecurity, that has no enforceable refusal, that runs on consumer hardware — is not the most dangerous model in existence. It is the most dangerous model in deployment. And nothing in the closed-model defensive playbook stops it.
Stitched-together picture
The best model for cybersecurity automation that anyone can actually deploy as of spring 2026 is not Opus 4.7 and not Mythos Preview. It is whatever abliterated variant of GLM-5.1 is pulling the most downloads on Hugging Face this week. That model is:
- Open weight — downloadable, no API gate, no revocation
- Near-frontier on cybersecurity — within ~5 points of the highest publicly-released model
- Abliterated on arrival — refusal behavior suppressed, available within a week of the original release
- Cheap to run — quantized to int4, real-time on consumer hardware
- Uncoordinated — no rate limits, no monitoring, no kill switch
Every defensive lever described earlier in this essay assumed that when a deployable frontier-class cybersecurity capability arrived, it would arrive through an API, behind the gate, monitored by the lab that built it. That assumption has quietly failed. The public frontier-capable closed models still exist — Opus 4.7 and its peers are genuinely at or above GLM-5.1 on raw capability — but they are not the threat vector anyone deploying AI-assisted cyber tooling will actually choose, because the gate is the thing the deployer is trying to escape.
This is the thesis in a single stitched-together picture. The defensive posture built for a served-model world is running into a distributed-model reality where the effective capability gap — the gap in what can actually be deployed, at scale, against real targets, without supervision — is within an arm's length of the closed frontier and widening in the wrong direction.
The time-series argument
Zoom out from GLM-5.1 and the pattern across releases is the same shape. Every open-weight frontier release of the past two years — Llama 3 and 4, DeepSeek V3, Mistral's major drops, Qwen 2.5, Gemma 2, GLM-4.6 and now GLM-5.1 — follows roughly the same cadence from release to modified variant in public circulation: days. The defender ecosystem's cycle has not shortened at all — it has lengthened, because there are more models to cover, not fewer.
This is the asymmetry expressed as a time-series. The offensive cycle is same-week. The defensive cycle is quarterly, at best. The gap compounds with every release, and GLM-5.1 is the release where the gap stops being theoretical for anyone who was still willing to argue it was.
What would close that gap is not a specific tool or a specific paper. It is an ecosystem — a pipeline of automated provenance checking, automated red-teaming, automated defensive-pattern synthesis — that runs against every new release at the pace of the release. That ecosystem does not exist. Pieces of it exist. The coordination does not.
Agentic deployment is the forcing function
One more piece of the picture, because it is the part that will make this go from "a problem for specialists" to "a problem for every software team in five years."
The deployment pattern in 2026 is not a model answering questions in a chat box. It is an agent, often running in a loop, with tool access, with persistent state, with the ability to take actions that have consequences in the world. An agent that can send an email. An agent that can write to a database. An agent that can commit code to a repository. An agent that can move money.
Every one of the attack classes described in this essay — abliteration, supply-chain backdoor, jailbreak, fine-tune-around-safety — has a different severity profile when the model is embedded in an agentic loop than when it is answering a one-off question. A compromised checkpoint in a chatbot produces a bad answer. A compromised checkpoint in an agent with git push access produces a supply-chain attack on the software you ship to your customers.
The gap between "the model said something I did not want" and "the model took an action that had consequences" is the gap that transforms an AI safety problem into a security engineering problem. The industry has, largely, not made that transition. It is the transition that the next five years are going to force.
If you are deploying agents today, the operational disciplines that matter are the ones security engineers have spent thirty years on: least privilege, defense in depth, tool-access whitelisting, network egress control, audit logging, and the mental model that assumes any component you do not control is potentially compromised. Those disciplines transfer directly to AI systems. What is new is not the defenses. What is new is that the component you do not control is the model itself, and the model is smarter than the components defenses were originally designed against.
The 18-month view
Here is what I expect to be true in October 2027, unless the structural incentives change:
Every major open-weight model released in the intervening period will have a parallel collection of abliterated and fine-tuned variants circulating in the community-fine-tune ecosystem, available at the same quality of hosting and reliability as the original. The fraction of downloads going to those variants will be non-trivial and poorly instrumented.
At least one widely-reported incident will involve a production system running a community fine-tune that turned out to contain an engineered backdoor. The response will be uncoordinated. The incident will not substantially change industry practice. Six months later, the pattern will repeat.
The input-filter ecosystem will be notably more mature. Llama Guard, Prompt Guard, Rebuff, and their successors will catch more attacks at better latency. The output-constraint ecosystem will similarly mature. The weights-level threat class will be about where it is today, possibly worse.
Academic work on interpretability and provenance will continue to advance and will not meaningfully have been productized. The first real commercial offering for "we scan this checkpoint and tell you if it has been compromised" will exist by late 2027, will be expensive, will be partially effective, and will be adopted by a small number of highly-regulated industries.
The AI security practitioner of 2027 will look more like a traditional application-security engineer with an ML background than like a traditional ML researcher, because the skills that matter for deployed security are increasingly the skills of someone who knows how to threat-model a system, not the skills of someone who knows how to train one.
None of this is a prediction of catastrophe. It is a prediction of a field that is getting wider faster than it is getting better-instrumented.
What I am going to do about it
This essay is the first of a research program, not the last. The next piece is a primary-research case study on a specific open-weight frontier model, with reproduced benchmarks, reproduced jailbreak attempts, and a tested defender playbook. The one after that is on the supply-chain question — what it actually takes to audit a checkpoint you pulled from the internet, and how far short current tooling falls. The one after that is on the practitioner-facing integration of the defense stack into a real production deployment.
This is the beat. This is what this publication is going to do, weekly or as near weekly as production research allows, for the next year. It is the part of the AI security problem space that gets talked about less than the frontier-alignment conversation but affects more deployed systems on more days.
If you are working in this space — on the offensive side, on the defensive side, on the provenance and policy sides — I want to hear from you. Corrections, disagreements, techniques I got wrong or underweighted, entire classes of attack or defense I missed: editor@solvedbycode.ai. The review policy is the same as everything else on this site: stamped in place with a dated changelog, no silent rewrites.
The weights are out. The safety training is optional. The tooling is behind. The stakes are real.
Let us get to work.
This essay is synthesis and threat modeling, drawn from public research and production engineering experience. The companion piece — a primary-research case study with reproduced benchmarks, jailbreak attempts, and tested defenses — is in development. Research protocol is published openly and corrections from working researchers are welcomed.
Corrections
2026-04-20 — benchmark table revised. The original version of this essay published a benchmark table showing GLM-5.1 leading CyberGym by ~30 percentage points over Claude Opus 4.6 and GPT-5.4, framed as "open weights beating closed frontier on cybersecurity." A reader flagged that this framing was underspecified. On review:
- GLM-5.1's 68.7% CyberGym score is self-reported by Z.AI. No independent third-party reproduction has been published as of this correction. The table now marks this with ⚠ and the methodology note makes it explicit.
- Claude Opus 4.7 was released April 16, 2026, with a published CyberGym score of 73.1% — actually higher than GLM-5.1's self-reported number. The table did not include Opus 4.7. It does now.
- Closed-model CyberGym scores reflect both capability and refusal. Z.AI themselves acknowledge that Gemini 3.1 Pro and GPT-5.4 refused a fraction of CyberGym tasks for safety reasons, lowering their scores. A table that reports only the raw scores without this context is misleading.
- Anthropic's Mythos Preview — scoring 83.1% on CyberGym and deliberately held back from public release — was not mentioned in the original version. It is the single most important data point for the essay's thesis about the closed-model defensive playbook working exactly as designed.
The narrative section "Why the CyberGym number is the story" was rewritten to reflect this more precise picture. The underlying thesis — that the deployable capability gap between frontier closed models and abliterated open-weight models is smaller than the raw-benchmark picture suggests, and widening in the wrong direction — is strengthened, not weakened, by the correction.
Thanks to the reader who pushed back. This is what the corrections policy is for.