Still Images: AI Wins. Add Motion: Humans Take the Deepfake Crown

University of Florida researchers just served up a reality check for anyone who thinks the deepfake-detection war is over. Their verdict: AI crushes humans on still photos—97 % accuracy versus coin-flip randomness—but the moment those synthetic faces start talking or blinking, people claw back the advantage. Benchmarks first, marketing fluff never.

The Setup

3,000+ deepfake portraits generated with StyleGAN2 and its progeny
Same identities rendered in 1-second video clips (eye blinks, lip twitches, micro-expressions)
Four off-the-shelf detection models: Xception, EfficientNet-B4, MesoNet, and the Facebook-built “Deepfake Detection Challenge” champ
350 human volunteers recruited via Prolific, balanced for age, gender, and prior exposure to deepfakes

Results in One Table

Medium	Best AI Accuracy	Human Accuracy	p-value
Still images	97.1 % (EfficientNet-B4)	48.9 %	<0.001
1-s video	72.4 % (Xception)	78.6 %	0.02

Why AI Dominates Stills

Pixel-perfect artifacts: GANs leave high-frequency fingerprints—checkerboard patterns in teeth, asymmetric glints in corneas—that convolutional nets gobble up.
No temporal noise: A single frame doesn’t wobble; the model can stare as long as it likes.
Training abundance: Millions of labeled fake faces are free for the scraping. Humans, meanwhile, rarely stare at 4-K synthetic eyeballs for sport.

Why Humans Dominate Video

Biological motion priors: We’ve evolved to detect wonky blinks (too fast, too slow, no lid lag) and lip-sync failures that models still misclassify as compression noise.
Contextual coherence: A necklace chain that floats two pixels off the collar for a single frame screams “render glitch” to us; the AI treats it as a stray adversarial gradient.
The uncanny valley emotion detector: Subtle lack of micro-wrinkles around the eyes when someone smiles is creepy to us, merely a low-variance texture to a CNN.

The Caveats

Dataset scope: All fakes were StyleGAN derivatives. Swap in diffusion-model faces and the AI lead could shrink.
Video length: Clips were capped at one second. Give the networks 5-second sequences with audio and the leaderboard might flip again.
Human expertise: Participants weren’t trained. A one-minute tutorial on blink cadence pushed human accuracy to 85 % in a follow-up pilot—good luck getting that past an IRB at scale.

Take-away for Builders

If your threat model is passport-style photo fraud, plug in EfficientNet and call it a day. If you’re moderating TikTok, keep a human in the loop—or at least fine-tune on temporal datasets that include real-world compression artifacts, not just pristine GAN output.

TL;DR

Still fakes: bet on silicon. Moving fakes: bet on squishy neurons—at least until the next training run.