Google Veo 3's killer? Not quite. But Wan 2.5 turns the heat on

I put Alibaba's new text-to-video model to the test. Its secret weapon? Perfectly synchronized audio and lip-sync that could change the game.

Sep 28, 2025

For what feels like an eternity in AI time, creating AI video with convincing, synchronized audio has been the final frontier. We’ve seen incredible visuals, but getting characters to speak naturally, without a separate, clunky lip-syncing step, has remained a major hurdle.

Google’s Veo 3 has been the undisputed King in this area, but a powerful new challenger has entered the ring.

Meet Wan 2.5, an advanced AI video model from Alibaba’s Wan AI team. While it handles text-to-video and image-to-video like other models, its true power lies in its native audio generation. It doesn’t just create a silent film; it generates synchronized sound effects, voices, music, and ambient noise directly with the video.

But does it deliver? I decided to put it to the test, using the Higgsfield platform to access Wan 2.5. For my experiments, I started with a static image from Midjourney and then gave Wan 2.5 a prompt to bring it to life with motion and sound.

Let’s diiiive in.

Test 1: The Cyber Viking

First, I wanted to see how it handled a detailed, futuristic character with a specific line of dialogue.

Source Image Prompt (Midjourney):

Photography, Dope concept of cyber Viking, optical illusion, oni aesthetics, spacepunk, long exposure volumetric leaks, Future surreal vibes --chaos 10 --ar 3:4 --exp 25 --sref 2670961653 --p --sw 300 --stylize 1000 --weird 10

Video & Audio Prompt (Wan 2.5):

Dramatic, realistic close-up of a man wearing intricate cybernetic glasses fused with metallic horns curling from his head, his skin smooth and detailed. He slowly smiles with deep confidence, revealing chromed teeth that glint under the soft, dramatic lighting. As the camera pulls back, the man leans slightly forward and speaks with a calm, compelling voice, “Want some Love?”

The Verdict: The result was impressive. The lip sync on “Want some Love?” was flawless, matching the audio perfectly. The generated voice was calm and compelling as requested. While the overall motion was minimal, the prompt didn’t ask for much action (my fault). The core task, bringing a character to life with convincing speech, was a clear success.

Test 2: The Anime Heroine

Next, I wanted to try a more dynamic scene with a declarative statement and a camera cut, common in cinematic trailers.

Source Image Prompt (Midjourney):

Hyper realistic, full body Genshin Impact female character, crafted with Unreal Engine 5 rendering, boasting photorealistic material accuracy, ray traced lighting, and cinematic post processing for lifelike depth. Exquisitely rendered hair... --ar 16:9 --quality 2 --raw

Video & Audio Prompt (Wan 2.5):

Female hero declares with eerie composure, “Your reign has come to an end.” As she speaks, the camera focuses on a surprised old king in his throne.

The Verdict: Once again, Wan 2.5 nailed the essentials. The voice acting had the “eerie composure” I prompted for, and the lip sync was perfect. Even more impressively, it successfully interpreted the camera change, cutting to a dynamic shot of the surprised king on his throne. This shows a deeper understanding of cinematic language beyond just animating a single subject.

Why Wan 2.5 is important?

My belief is that by 2026, integrated audio will be a standard feature in all major AI video tools, and we’ll look back on manual lip-syncing as a primitive step. Wan 2.5 is accelerating that future, and its importance can’t be overstated.

How to try Wan 2.5 yourself

Ready to experiment? You can access Wan 2.5 through a growing number of third-party platforms, many of which offer free trials or credits to get you started: Higgsfield, Pollo AI, Fal.ai, and Krea.ai