Claude vs. Firefox: 22 Bugs, 14 High-Severity, Zero Hype

Mozilla invited Anthropic’s large language model, Claude, to a two-week bug-hunting sprint inside Firefox’s 25-million-line codebase. The model responded by filing 22 vulnerability reports—14 rated high-severity, 7 medium, 1 low—without a single coffee break.

The Numbers

Total bugs: 22
High-severity: 14 (CVSS ≥ 7.0)
Remote-code-execution: 4
UXSS (Universal Cross-Site Scripting): 3
Sandbox escapes: 2
Average time from prompt to PoC: 3.6 hours

How Claude Did It

Mozilla gave the model read-only access to mozilla-central, a trimmed-down Docker build environment, and a fuzzing harness adapted from domino-fast. Instead of generic “find bugs” prompts, engineers used a chain-of-thought template that asked Claude to:

Identify attack surface (IPC, WebIDL, JIT stubs).
Generate targeted input grammars.
Cross-reference findings with existing Bugzilla entries to avoid duplicates.

The model produced standalone regression tests for 20 of the 22 bugs—something most human bounty hunters leave for Mozilla devs to write.

Sample Bug: CVE-2024-4237

Class: Use-after-free in StructuredCloneHolder::Read.
Trigger: A ServiceWorker posting a nested SharedArrayBuffer after a cross-origin navigation.
Impact: UXSS on Android nightly builds.
Claude’s PoC: 42 lines of JavaScript—no heap-grooming gymnastics required.

Marketing Reality Check

Anthropic’s press release calls Claude a “critical ally in securing the open web.” Translation: it’s still a probabilistic autocomplete that hallucinates; Mozilla’s static-analysis pipeline filtered out 60 % of false positives before a human ever saw them. Useful? Absolutely. Magic? Nope.

Firefox’s Response

Patches landed in Nightly within 72 hours for 12 of the 14 high-severity bugs. Two remain embargoed until the next ESR merge. Mozilla paid Anthropic a fixed consulting fee—no per-bug bounties—so the dollar-per-vuln ratio remains undisclosed.

The Benchmark Question

Can we score LLM bug hunting like we score SPECint? Not yet. Mozilla plans to open-source the evaluation harness later this quarter; until then, take any “LLMs out-hack humans” headline with a 3.6-gram grain of salt.

Bottom Line

Twenty-two real bugs squashed before they hit stable release is a win for Firefox users. It’s also a reminder that large language models are just another tool—sharp, occasionally unpredictable, and only as good as the engineer wielding them.