Still Images: AI Wins. Add Motion: Humans Take the Deepfake Crown

Still Images: AI Wins. Add Motion: Humans Take the Deepfake Crown

University of Florida researchers just served up a reality check for anyone who thinks the deepfake-detection war is over. Their verdict: AI crushes humans on still photos—97 % accuracy versus coin-flip randomness—but the moment those synthetic faces start talking or blinking, people claw back the advantage. Benchmarks first, marketing fluff never.

The Setup

  • 3,000+ deepfake portraits generated with StyleGAN2 and its progeny
  • Same identities rendered in 1-second video clips (eye blinks, lip twitches, micro-expressions)
  • Four off-the-shelf detection models: Xception, EfficientNet-B4, MesoNet, and the Facebook-built “Deepfake Detection Challenge” champ
  • 350 human volunteers recruited via Prolific, balanced for age, gender, and prior exposure to deepfakes

Results in One Table

Medium Best AI Accuracy Human Accuracy p-value
Still images 97.1 % (EfficientNet-B4) 48.9 % <0.001
1-s video 72.4 % (Xception) 78.6 % 0.02

Why AI Dominates Stills

  1. Pixel-perfect artifacts: GANs leave high-frequency fingerprints—checkerboard patterns in teeth, asymmetric glints in corneas—that convolutional nets gobble up.
  2. No temporal noise: A single frame doesn’t wobble; the model can stare as long as it likes.
  3. Training abundance: Millions of labeled fake faces are free for the scraping. Humans, meanwhile, rarely stare at 4-K synthetic eyeballs for sport.

Why Humans Dominate Video

  1. Biological motion priors: We’ve evolved to detect wonky blinks (too fast, too slow, no lid lag) and lip-sync failures that models still misclassify as compression noise.
  2. Contextual coherence: A necklace chain that floats two pixels off the collar for a single frame screams “render glitch” to us; the AI treats it as a stray adversarial gradient.
  3. The uncanny valley emotion detector: Subtle lack of micro-wrinkles around the eyes when someone smiles is creepy to us, merely a low-variance texture to a CNN.

The Caveats

  • Dataset scope: All fakes were StyleGAN derivatives. Swap in diffusion-model faces and the AI lead could shrink.
  • Video length: Clips were capped at one second. Give the networks 5-second sequences with audio and the leaderboard might flip again.
  • Human expertise: Participants weren’t trained. A one-minute tutorial on blink cadence pushed human accuracy to 85 % in a follow-up pilot—good luck getting that past an IRB at scale.

Take-away for Builders

If your threat model is passport-style photo fraud, plug in EfficientNet and call it a day. If you’re moderating TikTok, keep a human in the loop—or at least fine-tune on temporal datasets that include real-world compression artifacts, not just pristine GAN output.

TL;DR

Still fakes: bet on silicon. Moving fakes: bet on squishy neurons—at least until the next training run.

Leave a Reply

Your email address will not be published. Required fields are marked *