University of Florida researchers just served up a reality check for anyone who thinks the deepfake-detection war is over. Their verdict: AI crushes humans on still photos—97 % accuracy versus coin-flip randomness—but the moment those synthetic faces start talking or blinking, people claw back the advantage. Benchmarks first, marketing fluff never.
The Setup
- 3,000+ deepfake portraits generated with StyleGAN2 and its progeny
- Same identities rendered in 1-second video clips (eye blinks, lip twitches, micro-expressions)
- Four off-the-shelf detection models: Xception, EfficientNet-B4, MesoNet, and the Facebook-built “Deepfake Detection Challenge” champ
- 350 human volunteers recruited via Prolific, balanced for age, gender, and prior exposure to deepfakes
Results in One Table
| Medium | Best AI Accuracy | Human Accuracy | p-value |
|---|---|---|---|
| Still images | 97.1 % (EfficientNet-B4) | 48.9 % | <0.001 |
| 1-s video | 72.4 % (Xception) | 78.6 % | 0.02 |
Why AI Dominates Stills
- Pixel-perfect artifacts: GANs leave high-frequency fingerprints—checkerboard patterns in teeth, asymmetric glints in corneas—that convolutional nets gobble up.
- No temporal noise: A single frame doesn’t wobble; the model can stare as long as it likes.
- Training abundance: Millions of labeled fake faces are free for the scraping. Humans, meanwhile, rarely stare at 4-K synthetic eyeballs for sport.
Why Humans Dominate Video
- Biological motion priors: We’ve evolved to detect wonky blinks (too fast, too slow, no lid lag) and lip-sync failures that models still misclassify as compression noise.
- Contextual coherence: A necklace chain that floats two pixels off the collar for a single frame screams “render glitch” to us; the AI treats it as a stray adversarial gradient.
- The uncanny valley emotion detector: Subtle lack of micro-wrinkles around the eyes when someone smiles is creepy to us, merely a low-variance texture to a CNN.
The Caveats
- Dataset scope: All fakes were StyleGAN derivatives. Swap in diffusion-model faces and the AI lead could shrink.
- Video length: Clips were capped at one second. Give the networks 5-second sequences with audio and the leaderboard might flip again.
- Human expertise: Participants weren’t trained. A one-minute tutorial on blink cadence pushed human accuracy to 85 % in a follow-up pilot—good luck getting that past an IRB at scale.
Take-away for Builders
If your threat model is passport-style photo fraud, plug in EfficientNet and call it a day. If you’re moderating TikTok, keep a human in the loop—or at least fine-tune on temporal datasets that include real-world compression artifacts, not just pristine GAN output.
TL;DR
Still fakes: bet on silicon. Moving fakes: bet on squishy neurons—at least until the next training run.