Mozilla invited Anthropic’s large language model, Claude, to a two-week bug-hunting sprint inside Firefox’s 25-million-line codebase. The model responded by filing 22 vulnerability reports—14 rated high-severity, 7 medium, 1 low—without a single coffee break.
The Numbers
- Total bugs: 22
- High-severity: 14 (CVSS ≥ 7.0)
- Remote-code-execution: 4
- UXSS (Universal Cross-Site Scripting): 3
- Sandbox escapes: 2
- Average time from prompt to PoC: 3.6 hours
How Claude Did It
Mozilla gave the model read-only access to mozilla-central, a trimmed-down Docker build environment, and a fuzzing harness adapted from domino-fast. Instead of generic “find bugs” prompts, engineers used a chain-of-thought template that asked Claude to:
- Identify attack surface (IPC, WebIDL, JIT stubs).
- Generate targeted input grammars.
- Cross-reference findings with existing Bugzilla entries to avoid duplicates.
The model produced standalone regression tests for 20 of the 22 bugs—something most human bounty hunters leave for Mozilla devs to write.
Sample Bug: CVE-2024-4237
Class: Use-after-free in StructuredCloneHolder::Read.
Trigger: A ServiceWorker posting a nested SharedArrayBuffer after a cross-origin navigation.
Impact: UXSS on Android nightly builds.
Claude’s PoC: 42 lines of JavaScript—no heap-grooming gymnastics required.
Marketing Reality Check
Anthropic’s press release calls Claude a “critical ally in securing the open web.” Translation: it’s still a probabilistic autocomplete that hallucinates; Mozilla’s static-analysis pipeline filtered out 60 % of false positives before a human ever saw them. Useful? Absolutely. Magic? Nope.
Firefox’s Response
Patches landed in Nightly within 72 hours for 12 of the 14 high-severity bugs. Two remain embargoed until the next ESR merge. Mozilla paid Anthropic a fixed consulting fee—no per-bug bounties—so the dollar-per-vuln ratio remains undisclosed.
The Benchmark Question
Can we score LLM bug hunting like we score SPECint? Not yet. Mozilla plans to open-source the evaluation harness later this quarter; until then, take any “LLMs out-hack humans” headline with a 3.6-gram grain of salt.
Bottom Line
Twenty-two real bugs squashed before they hit stable release is a win for Firefox users. It’s also a reminder that large language models are just another tool—sharp, occasionally unpredictable, and only as good as the engineer wielding them.