VIDEO GENERATIONby Alibaba (Tongyi Lab)

Wan 2.5 + Audio

Uncensored image-to-video with audio

Audio1080p5s/10sStart FrameFilters: MinimalStandard

About Wan 2.5 + Audio

Wan 2.5 + Audio is Alibaba Tongyi Lab's natively multimodal video generation model, launched in September 2025. Unlike previous Wan versions that produced silent video requiring separate audio post-production, Wan 2.5 generates synchronized audio and video in a single pass. This includes dialogue with automatic lip-sync, ambient sound effects matched to on-screen action, background music, and multi-person vocal tracks.

The model produces HD 1080p video at 24 frames per second in 5-second or 10-second durations. Its native audio-visual synchronization means generated content has properly timed sound effects, footsteps, environmental ambience, and music that match the visual scene without manual alignment. This makes it the first Wan model suitable for producing complete, ready-to-publish video content without any audio editing workflow.

Wan 2.5 supports multiple input modes including text-to-video, image-to-video, and audio-guided generation. The multi-shot coherence system maintains character appearance, voice consistency, and scene lighting across sequential clips, enabling episodic content creation. On ZenCreator, Wan 2.5 + Audio runs with minimal content filters, making it a strong choice for creators who need unrestricted image-to-video conversion with built-in audio for social media, marketing content, and storytelling projects.

Technical Specifications

ArchitectureNatively multimodal diffusion transformer (audio + video)

Max Resolution1080p

Frame Rate24 fps

Max Duration5-10 seconds

AudioNative sync (dialogue, SFX, music, ambient)

Input ModesText-to-Video, Image-to-Video, Audio-guided

Keyframe ControlStart frame

Release DateSeptember 2025

Best Use Cases

Creating complete video-with-audio content in a single generation step

Social media videos with synchronized sound effects and music

Product demo videos with narration and ambient audio

Storytelling clips with dialogue and lip-synced character speech

Marketing content that needs background music matched to visuals

Episodic video series with consistent character voice and appearance

Available In

Video Generator

Frequently Asked Questions

How does Wan 2.5 generate audio with video?

Wan 2.5 is natively multimodal, generating audio and video simultaneously in a single inference pass. It produces dialogue with lip-sync, ambient sounds, background music, and sound effects that are automatically aligned to on-screen motion and scene changes.

Do I need to add audio separately after generating a Wan 2.5 video?

No. Wan 2.5 + Audio produces complete video-with-audio output. Dialogue, sound effects, ambient audio, and music are all generated and synchronized automatically during the same generation step.

What is the difference between Wan 2.5 + Audio and Wan 2.6 + Audio?

Wan 2.5 was the first Wan model with native audio and supports up to 10-second clips at 1080p. Wan 2.6 extends the maximum duration to 15 seconds and improves motion quality, temporal consistency, and text rendering within videos.

Can Wan 2.5 maintain character consistency across multiple clips?

Yes. Wan 2.5 features multi-shot coherence that maintains character appearance, voice consistency, and scene lighting across sequential clip generations, making it suitable for episodic content.

Related Guide

Read the full Wan 2.5 + Audio guide on AI University

Try Wan 2.5 + Audio

Available on ZenCreator — no setup, no API keys.

Start Creating

Wan 2.5 + Audio is developed by Alibaba (Tongyi Lab). Official page. ZenCreator provides access to Wan 2.5 + Audio through its platform.