Veo 3.1 just dropped. Did Google just take the AI Video crown from Sora 2?
The AI Video Heavyweights: Sora 2 vs. Veo 3.1
For a minute, it looked like OpenAI had locked up the crown. Sora 2’s launch in October set a new standard for realism and synchronized audio, seemingly leaving competitors in the dust.
Then, Google threw its counter-punch.
Meet Veo 3.1, the update that just dropped and aims to snatch the crown right back. I’ve been digging into the new model, and this isn’t just a minor update, it’s a direct challenge to Sora’s dominance.
The two top dogs in AI video are now Sora 2 and Veo 3.1. So, how do they really stack up in a head-to-head test? Let’s break down the new features.
Meet the contenders
Veo 3.1 is Google’s direct upgrade to Veo 3, built to enhance creative control, realism, and (most importantly) audio-visual sync.
Sora 2 is OpenAI’s second-gen model, prized for its realism, physics, and ability to generate video with synchronized audio from text, images, or even other video prompts.
Head to Head
The baseline for both models is already incredibly high.
Both Veo 3.1 and Sora 2 can handle Multi-Shot Generation (letting you specify cuts and new shots at timestamps within a single prompt) and Synchronized Audio (generating sound effects, music, and dialogue that matches the action). Seedance 1.0 also competes on multi-shot, but only Veo and Sora are mastering synced dialogue.
This is where things get interesting. Veo 3.1 introduces three new features that are designed to give creators the granular control we’ve been asking for.
1. The Game-Changer: Ingredients to Video
This is Veo’s biggest new weapon. You can now provide multiple reference images, characters, specific items, outfits, props, and Veo 3.1 will integrate them all into a single, cohesive, fully-formed scene with sound.
This allows for unprecedented style and character consistency. You’re no longer just rolling the dice on the model’s interpretation; you’re directing it.
Sora 2: Since it has a focus on Social media, it has a different spin of this feature: You can upload a video and prompt changes to it.
Other Models: Kling 1.6 has a similar “ingredient” or “material” feature, but Veo’s integration with synchronized audio puts it in a class of its own.
2. Narrative Control: First and Last Frame
This is a powerful tool for narrative structure. You can give Veo 3.1 the first frame and the last frame of your desired shot, and the model will generate the entire video in between, creating a seamless and often epic transition. It gives you precise control over your narrative’s start and end points.
Sora 2: Lacks this specific “A-to-B” functionality. You can provide an initial image to start a video, but you can’t define the end point in the same generation.
Other Models: Most video tools can this A-to B image-to-video, but Veo’s ability to lock in both ends and sync dialogue is a significant directorial tool.
3. Workflow Power: The Extend Feature
Veo 3.1 now has a native feature to seamlessly expand generated clips beyond the initial 8 seconds, creating continuous shots of a minute or more. Each new segment is generated based on the final second of the previous clip, ensuring strong consistency for both the background and any characters in the scene.
Sora 2: The core model doesn’t technically have this as a named feature. However, the platforms it’s integrated with (like Freepik and Higgsfield) almost certainly provide their own extension and outpainting workflows, and it does generate longer videos up to 12 seconds (which is better for Multi Shots). Veo 3.1’s advantage is having this as a native model capability, which often leads to better consistency.
The final verdict: Who’s winning?
Synchronized dialogue is the current ceiling for AI video, and for a brief moment, Sora 2 was the only model to truly nail it.
With Veo 3.1, Google has not only matched Sora 2 on audio quality but has arguably pulled ahead in what matters most to creators: control. The Ingredients and First/Last Frame features are practical, powerful tools for directors looking to execute a specific vision rather than just discover a happy accident.
Sora 2 is a great model, but Google’s focus on a professional workflow makes Veo 3.1 the new model to beat.
The race is on, and it’s getting faster.
I dunno. They both do a pretty crappy job following instructions in my experience. Just a few minutes ago I tried two different prompts for the same scene in Gemini video. It was supposed to be a basset hound dog chasing a rabbit. The first one had the rabbit chasing the dog and the dog's ears kept changing between dog and rabbit ears. The second video just created two dogs.