Wan 2.5 + Audio
VIDEO GENERATIONby Alibaba (Tongyi Lab)

Wan 2.5 + Audio

Uncensored image-to-video with audio

Audio1080p5s/10sStart FrameFilters: MinimalStandard
Official Alibaba (Tongyi Lab) page

About Wan 2.5 + Audio

Wan 2.5 + Audio is Alibaba Tongyi Lab's natively multimodal video generation model, launched in September 2025. Unlike previous Wan versions that produced silent video requiring separate audio post-production, Wan 2.5 generates synchronized audio and video in a single pass. This includes dialogue with automatic lip-sync, ambient sound effects matched to on-screen action, background music, and multi-person vocal tracks.

The model produces HD 1080p video at 24 frames per second in 5-second or 10-second durations. Its native audio-visual synchronization means generated content has properly timed sound effects, footsteps, environmental ambience, and music that match the visual scene without manual alignment. This makes it the first Wan model suitable for producing complete, ready-to-publish video content without any audio editing workflow.

Wan 2.5 supports multiple input modes including text-to-video, image-to-video, and audio-guided generation. The multi-shot coherence system maintains character appearance, voice consistency, and scene lighting across sequential clips, enabling episodic content creation. On ZenCreator, Wan 2.5 + Audio runs with minimal content filters, making it a strong choice for creators who need unrestricted image-to-video conversion with built-in audio for social media, marketing content, and storytelling projects.

Technical Specifications

ArchitectureNatively multimodal diffusion transformer (audio + video)
Max Resolution1080p
Frame Rate24 fps
Max Duration5-10 seconds
AudioNative sync (dialogue, SFX, music, ambient)
Input ModesText-to-Video, Image-to-Video, Audio-guided
Keyframe ControlStart frame
Release DateSeptember 2025

Best Use Cases

1
Creating complete video-with-audio content in a single generation step
2
Social media videos with synchronized sound effects and music
3
Product demo videos with narration and ambient audio
4
Storytelling clips with dialogue and lip-synced character speech
5
Marketing content that needs background music matched to visuals
6
Episodic video series with consistent character voice and appearance

Available In

Frequently Asked Questions

How does Wan 2.5 generate audio with video?

Wan 2.5 is natively multimodal, generating audio and video simultaneously in a single inference pass. It produces dialogue with lip-sync, ambient sounds, background music, and sound effects that are automatically aligned to on-screen motion and scene changes.

Do I need to add audio separately after generating a Wan 2.5 video?

No. Wan 2.5 + Audio produces complete video-with-audio output. Dialogue, sound effects, ambient audio, and music are all generated and synchronized automatically during the same generation step.

What is the difference between Wan 2.5 + Audio and Wan 2.6 + Audio?

Wan 2.5 was the first Wan model with native audio and supports up to 10-second clips at 1080p. Wan 2.6 extends the maximum duration to 15 seconds and improves motion quality, temporal consistency, and text rendering within videos.

Can Wan 2.5 maintain character consistency across multiple clips?

Yes. Wan 2.5 features multi-shot coherence that maintains character appearance, voice consistency, and scene lighting across sequential clip generations, making it suitable for episodic content.

Try Wan 2.5 + Audio

Available on ZenCreator — no setup, no API keys.

Start Creating Free

Wan 2.5 + Audio is developed by Alibaba (Tongyi Lab). Official page. ZenCreator provides access to Wan 2.5 + Audio through its platform.