The multimodal model supports text, image, audio and video inputs to generate up to 15 second clips with improved motion stability and creative control, now rolling out to paid users in seven countries while safety guardrails limit real face and IP usage.