Pixverse 5.6 now speaks: The new multimodal feature challenges Veo 3.1 and Sora 2
Pixverse now has integrated audio and lip-syncing. Here is a first look at the results.
The landscape of AI video keeps shifting from “silent films” to full-scale cinematic production.
With the release of Pixverse 5.6, the startup introduces a feature that has long been a holy grail for creators: the ability to generate synchronized sound and realistic lip-sync directly from the initial text prompt.
With this move, Pixverse is positioning itself as a realistic competitor to heavyweights like Google’s Veo 3.1 and OpenAI’s Sora 2.
Great. But is it good?
Glad you asked. To see if it lives up to the hype, I put it to a quick test.
For the challenge, I’m using Midjourney-generated character references. I’m currently in the mood for Saiyan culture, so bear with me.
Here is a breakdown of two distinct scenarios testing emotional range, camera movement, and environmental sound.
Test 1: Dialogue and 360 View
In this first test, I wanted to see how the model handled a “cool” character archetype contrasted with a chaotic environment.
Midjourney Prompt:
Photography, Female Saiyan with Saiyan hairstyle, character concept, cinematic, --chaos 10 --ar 3:4 --exp 10 --stylize 1000
Using this image as a reference, I then used this prompt:
The Pixverse Prompt: Static camera, woman with saiyan hairstyle says with ironic calm “Things are going to get weird. Be ready”. The camera looks away 360, to the scene around her, there are people running scared screaming “Fire!”, cars and buildings on fire, skyscrapers collapsing.
This is the result:
The result: The “ironic calm” requested in the prompt translated well into the vocal delivery, it wasn’t just a generic female voice, but one that carried the weight of the character’s personality. The lip-syncing was tight, matching the cadence of the warning perfectly.
What stood out most was the environmental transition. As the camera panned away from the character, the soundscape shifted from the focused dialogue to the diegetic noise of the city. The screams of “Fire!” are not clear, but there is a low-frequency rumble of collapsing buildings generated in sync with the visual destruction, creating a cool cinematic effect.
Let’s try another one.
Test 2: Dialogue and Zoom out
For the second test, I pushed the model’s ability to handle complex physics, specifically flight, combined with a dramatic camera move and some dramatic emotion.
The Midjourney Prompt:
Photography, Female Saiyan with Saiyan hairstyle, character concept, focused, dramatic background, cinematic, --chaos 10 --ar 16:9 --exp 10 --stylize 1000
The Pixverse Prompt: Slow zoom out while a saiyan woman says with controlled fury “Show me what you can do! I’m ready”. She has Saiyan hairstyle. The camera zooms out to reveal her full body, standing suspended in mid-air, wind moving her clothes, arms to her side, fists ready, feet pointed to the ground, below her cars and buildings are on fire.
The Result: As the camera zooms out, the audio maintains the character’s voice as the primary focus, while the roar of the fires below begins to swell. The “controlled fury” was reflected in a slightly raspy, intense vocal tone.
Visually, the model managed the “suspended in mid-air” physics remarkably well. Often, AI video struggles with characters floating without looking like they are simply “pasted” on a background, but Pixverse 5.6 does a good job integrating the movement of the wind in her clothes with the light from the fires below. The sync between the dialogue and the widening field of view felt intentional. On the bad side, there is a cartoonish look at the end that didn’t go away even as I upscaled.
What do you think?
Pixverse 5.6’s native audio feature is a solid one, and should be a good reason for creators to give this tool a try.
The competition is heating up, and for creators, this means the tools are finally catching up to our imaginations. One by one, models are adding native audio to their tooling. It’s becoming a requirement.
That’s all for today. Have fun, touch grass, and enjoy coffee.
Cheers.





