← Back to Project List
ViMax is a Agentic Video Generation framework for "long narrative, multi-lens and consistency". the goal is to integrate director, screenwriter, producer and video generator into a multi-intelligent video creation system. It is not a single video generation model, but around the Idea2Video, Novel2Video, Script2Video, AutoCameo and other processes, script understanding, sub-mirror design, reference map selection, role/environment consistency check, image/video generation and final assembly for automated orchestration. Before sales, it is suitable for the scheme discussion of "AI video from short segment to complete story/marketing film/novel adaptation", but it is more suitable for PoC and research verification at present, and it is not suitable for direct commitment to production-level content platform.

1. Project Overview

DimensionInformation
ProjectsHKUDS/ViMax
Official TitleViMax: Agentic Video Generation
Official DescriptionDirector, Screenwriter, Producer, and Video Generator All-in-One
PaperarXiv:2606.07649 ViMax: Agentic Video Generation
Core IssuesExisting video generation tools can only generate short segments, lacking narrative structure, role consistency, scene continuity and sound-picture collaboration
Core CapabilitiesMulti-agent long video generation, RAG long script design, split mirror/lens planning, reference map selection, consistency check, multi-lens parallel generation
Main ProcessesIdea2Video, Novel2Video, Script2Video, AutoCameo
Interactive ModeAgent Loop TUI
EnvironmentPython 3.12,uv management, Linux/Windows
Main dependenciesLangChain, OpenAI SDK, Google GenAI, FAISS, MoviePy, OpenCV, PySceneDetect, Pillow, etc.
LicenseMIT
Information Check Time2026-06-30

The abstract of the paper makes the problem very clear: long video generation requires systematic narrative planning and visual consistency, while current short segment generation methods usually only generate isolated sequences and lack a role/environment consistency mechanism across scenes. ViMax is designed to use multi-agent collaboration to negotiate narrative decisions, visual continuity, and production quality.

2. The project comes with schematic diagram and Demo

Main drawing of 2.1 project

! ViMax

Official Video Demo in 2.2 README

Multiple videos of' github.com/user-attachments' are embedded in README, covering different subject matter clips generated from scratch, such as underwater, animal, sky city guest, cat guest, etc. Obsidian the rendering of remote video is not necessarily stable, it is recommended to open the GitHub README page to play during demonstration:

-ViMax README - Video Demos

-YouTube Channel

2.3 Multi-agent Video Generation Pipeline

The schema in README is in the form of an HTML table, and the core can be organized as follows:

flowchart TD A["输入层
创意 / 剧本 / 小说 / Prompt / 参考图 / 风格 / 配置"] --> B["中央调度
Agent Scheduling / Stage Transitions / Resource Management / Retry"] B --> C["剧本理解
角色与环境提取 / 场景边界 / 风格意图"] B --> D["场景与镜头规划
Storyboard / Shot List / Key Frames / Beats"] C --> E["视觉资产规划
参考图选择 / 风格引导 / Prompt Conditioning"] D --> E E --> F["资产索引
Frames & Refs Catalog / Embeddings / Retrieval"] E --> G["一致性与连续性
角色追踪 / 环境追踪 / Ref Matching / Temporal Coherence"] F --> H["视觉合成与组装
Image Generation / Best Frame Selection / First-Last Frame to Video / Timeline Assembly"] G --> H H --> I["输出层
Frames / Clips / Final Videos / Logs / Working Directory Artifacts"]

3. What can it mainly do

3.1 Generate video from Creative: Idea2Video

Idea2Video for a sentence or a paragraph of creative input. The user describes an idea, target audience and style, and ViMax automatically completes:

-Creative understanding

-Story structure generation

-Character Design

-Scene division

-Split mirror planning

-Image generation prompt

-Video generation call

-Fragment assembly

Example:

idea = """
If a cat and a dog are best friends, what would happen when they meet a new cat?
"""
user_requirement = """
For children, do not exceed 3 scenes.
"""
style = "Cartoon"

Pre-sales understanding: This is suitable for the low-cost sample generation of "marketing short film concept verification", "children's education story", "brand small theater" and "product creative film.

3.2 Video adaptation from "Novel/Long Text": Novel2Video

Novel2Video is what makes ViMax more interesting than ordinary video generation tools. Instead of just turning a single prompt into a few seconds of video, it attempts to compress a complete novel or long narrative into diverse video content.

It will handle:

-Long text understanding

-Narrative compression

-Key plot retention

-Character tracking

-Scene Split

-Mirror-level visual adaptation

Both the paper and README emphasize RAG-based long script design engine: analyze novel-level long texts through RAG, and automatically divide them into multi-scene script formats, keeping key plots and character dialogues as much as possible.

Pre-sales value:

-Automatic adaptation of web articles/skits/educational stories.

-Quick visualization of IP content.

-Long text content becomes video dailies.

-Help the content team reduce the cost of up-front mirror and script adaptation.

3.3 Generating video from "playbook": Script2Video

Script2Video for clients who already have a script. It allows users to provide screenplay and set requirements for style, rhythm, number of shots, etc.

The sample input is similar to a movie script format:

EXT. SCHOOL GYM - DAY
A group of students are practicing basketball in the gym...
John: I'm going to score a basket!
Jane: Good job, John!

At the same time, users can add requirements:

Fast-paced with no more than 20 shots.

Pre-sales value:

-When the customer already has a script or copy, visual samples can be quickly generated.

-Suitable for advertising scripts, training scripts, skit scripts, brand stories.

-allows customers to preview the narrative rhythm and lens feel before shooting.

3.4 Generate Guest Video with Photos: AutoCameo

The AutoCameo concept is to upload a photo of yourself or a pet and have the character appear in a creative script and cinematic footage. It is essentially aimed at personalizing interactive videos:

-Individual cameos

-Pet Cameos

-Brand spokesperson/digital avatar

-Fan interactive content

Pre-sales can be used for entertainment marketing, event communication, educational interaction and short drama personalization. But here must also remind customers: portrait rights, authorization, face change/identity abuse, content compliance to focus on review.

3.5 Multi-Agent Production Process

ViMax's focus is not on a model, but on a multi-agent production process:

PhasesWhat to doWhy is it important
Script UnderstandingExtract characters, environment, scene boundaries, style intentLet the system know "who, where and what" in the story
Scene & Shot PlanningGenerate Mirror, Shot List, Key Frame, Rhythm PointTurn Text Narrative into Shot Language
Visual Asset PlanningSelect reference diagrams, generate visuals promptImprove role/environment consistency
Asset IndexingMaintain frames, reference images, embedding, and reusable footageMultiplexing visual assets across scenes for long videos
Consistency & ContinuityTrack Character, Environment and Time ContinuitySolve the Common "People Change, Scene Disorderly" in AI Video
Visual Synthesis & AssemblyImage Generation, Best Frame Selection, First and End Frame to Video, Timeline AssemblyFrom Material to Final Video

3.6 Consistency Check with VLM/MLLM

README mentions that ViMax will generate multiple images in parallel and select the most consistent image as the first frame of the video through MLLM/VLM to mimic the workflow of a human creator.

This is very critical for long videos, because it is difficult to maintain only by prompt:

-Same character looks the same.

-Consistent environment in the same scene.

-Multi-role space position is reasonable.

-The front and rear lenses are not obtrusive.

Pre-sales reach:

ViMax's idea is not to let the model generate a complete long video at one time, but to split the long video into a production line of shots, reference maps and consistency checks to reduce the risk of long video out of control.

4. Applicable Scenario

4.1 Short Play/Online Text/Novel IP Visualization

Customer:

-Short Drama Company

-Web Text Platform

-IP Operations Team

-MCN/Content Studio

Pain point:

-The novel to the short play requires screenwriting, filming, directing and pre-art investment.

-It is difficult to see visual effects quickly before shooting.

-IP more, but the cost of trial and error is high.

ViMax cut in:

-Novel2Video do plot compression and split mirror.

-Script2Video to make specified script samples.

-Maintain roles and environments with consistency mechanisms.

PoC mode:

-Select 1 short chapter.

-Generate video samples of 3-5 scenes.

-Compare manual script/split mirror costs.

4.2 brand marketing and advertising creative sample

Customer:

-Brand Marketing Department

-Advertising agency

-Content Creative Team

Pain point:

-Creative proposals require samples, but the cost of shooting is high.

-Customers often want to look at the "feeling", just look at the text is not enough.

-Multi-version ideas require quick trial and error.

ViMax cut in:

-Idea2Video generate story fragments from ideas.

-Visualize the ad script Script2Video.

-AutoCameo do interactive communication play.

Suitable for output:

-Product story short film.

-Activity warm-up video.

-Brand concept piece.

-Social media interactive video.

4.3 Education/Children's Story Video

Customer:

-Educational content companies

-Picture book/children's story platform

-Home Education Products

Pain point:

-Lots of stories need to be videoed.

-To keep the character stable, unified style, simple scene.

-Limited budget for content production.

ViMax cut in:

-Idea2Video generation of children's stories.

-Novel2Video adaptation of picture books/chapters.

-Control the number of scenes, styles and shots.

4.4 Game/Animation Pre-Concept Proof

Customer:

-Game Company

-Animation Studio

-Virtual Person/Digital Content Team

Pain point:

-Worldview, characters, and plot need to be visualized early.

-High cost of split mirror and concept preview.

-Want to quickly test different lens rhythms.

ViMax cut in:

-Generate concept videos with scripts and character settings.

-Maintain character image with reference maps and consistency checks.

-As a previsualization tool, not as a final production tool.

4.5 Personalized Interactive Video/Activity Dissemination

Customer:

-Cultural Tourism Activities

-Brand Campaign

-Fan interactive platform

-Pet/Parent Content Products

Pain point:

-Users want to put themselves in the story.

-The cost of manual production of personalized video is high.

ViMax cut in:

-AutoCameo generate guest videos with user photos.

-Can do activity poster/short video diffusion.

Must be reminded:

-Portrait authorization.

-Protection of minors.

-Content security review.

-Risk of deep forgery.

5. Not suitable for the scene

ScenarioReason
Direct production of film and television-level filmsAt present, it is more like a research/PoC framework, and the final film still needs professional post-production and manual review
Commercials that require extremely high character consistencyDespite the consistency mechanism, the generative model may drift
Strictly compliant content, such as medical, financial and government publicityAI generated video facts, images and expressions need to be strictly reviewed
Team without API key or Generate Model resourceViMax relies on LLM, Image Generator, Video Generator API/Model
Business personnel who want ordinary editing software UICurrently, it is mainly code and TUI, not mature SaaS products
Direct automatic mass production of long seriesLong video continuity is still a problem, suitable for segmented PoC, not directly committed to mass production

6. Core Competence List

CapabilitiesDescriptionsPre-Sales Value
Idea2VideoFrom creative to full video storySuitable for quick creative samples
Novel2VideoNovel/long text to diversity videoIdeal for IP visualization and pre-skit
Script2VideoScript to VideoSuitable for Ad Script, Training Script, Short Script
AutoCameoGenerate guest videos with photosSuitable for interactive marketing and personalized content
RAG long script generationlong text analysis and multi-scenario script segmentationsolving long text adaptation problems
split-mirror designplanning lens-level storyboard with movie languagelowering the threshold for split-mirror
Multi-camera simulationSimulate multi-camera shootingImprove viewing experience and lens diversity
Reference map selectionSelect the current video first frame and historical timeline reference mapImprove role/environment consistency
Consistency CheckGenerate multiple graphs in parallel and use VLM/MLLM to select the best consistent graphReduce out-of-control images
Parallel shot generationParallel shot processing on the same cameraImprove production efficiency
TUI / Agent LoopInteractive planning, revision, rendering control, sessionMore suitable for human participation in the authoring process
Configurable ProviderLLM, image, video, embedding, and reranker can be configuredEasy to connect different model services

7. Architecture/Deployment/Integration

7.1 installation

git clone https://github.com/HKUDS/ViMax.git
cd ViMax
uv sync

Requirements:

-Python >= 3.12

-uv

-Linux or Windows

-LLM API

-Image generation API

-Video generation API

7.2 Agent TUI Configuration

'configs/agent.local.yaml ':

llm:
  model_provider: openai
  model: 
  base_url: 
  api_key: ''

image:
  model: 
  base_url: 
  api_key: ''

video:
  model: 
  base_url: 
  api_key: ''

embedding:
  model_provider: openai
  model: 
  base_url: 
  api_key: ''

reranker:
  model: 
  base_url: 
  api_key: ''

Start:

vimax tui
vimax tui new
vimax tui resume
vimax tui resume 

You can also pass key through environment variables, such as VIMAX_LLM_API_KEY, VIMAX_IMAGE_API_KEY, and VIMAX_VIDEO_API_KEY '.

7.3 Direct pipeline entrance

Idea2Video:

python main_idea2video.py

Script2Video:

python main_script2video.py

Corresponding configuration file:

-'configs/idea2video.yaml'

-'configs/script2video.yaml'

The README example uses the OpenRouter/Gemini, Google image generation, and Veo video generation paths. When the actual PoC, it needs to be replaced according to the customer's available API and budget.

8. What can I say before sales

Business-oriented

Most AI video tools are now suitable for making a few-second clip, but real brand films, short plays, novel adaptations, and training stories need scripts, characters, shots, shots, and continuity. The value of ViMax is to split these production processes into multi-agent processes, so that AI don't just "generate a video", but like a small production team, first plan the story, then design the lens, and finally generate and assemble the picture.

Business Value:

-Shorten the timeline from concept to sample.

-Reduced up-front trial-and-error costs for skits/advertising/educational content.

-Allow non-professional teams to generate more complete visual narrative drafts.

-Provide automatic storytelling and video preview capabilities for content production teams.

Technology-oriented

ViMax is a multi-agent video generation framework. The upper layer has Idea2Video, Novel2Video, Script2Video, and AutoCameo. In the middle, script understanding, split-mirror planning, reference map selection, and consistency tracking are done. The bottom layer is connected with LLM, image generator, video generator, embedding/reranker and video processing tools. It is suitable for evaluating the "long narrative video generation" technical route, rather than directly replacing the single video model.

Technical value:

-Framework to solve the problem of long video generation.

-Instead of binding a single model, the Provider is configurable.

-Support long text/long timeline with RAG and asset indexing.

-Reduce screen drift with VLM consistency check.

9. Frequently Asked Customer Questions

What is the difference between ViMax and Runway/Kling/Veo?Runway/Kling/Veo is more like the underlying video generation model. ViMax is a multi-agent production framework that can call the image/video generator to complete the processes of scripting, mirroring, consistency, and assembly.
Can you directly generate videos that are minutes long?The goal is for long narratives and multiple shots, but the actual quality depends on the underlying model, API, script complexity, and human intervention. Short chapter PoC should be done before sale.
Can you keep roles consistent?ViMax has reference map selection, role/environment tracking, and VLM consistency checking mechanisms, but it cannot guarantee 100% consistency at the commercial level and requires manual film review.
Is it suitable for Chinese short plays?There is a Chinese README, which can process Chinese input in theory, but Chinese text quality, character extraction, prompt words and underlying model support need to be verified with Chinese samples.
Will the cost be high?Will. Long videos call multiple rounds of LLM, image generation, and video generation APIs, and the cost must be estimated based on the number of scenes, the number of shots, and the number of retries.
Is it a mature product now?More like a research-based open source framework and PoC tool, with TUI and pipeline, but not a mature SaaS for business people.

10. PoC Recommendations

PoC 1: Novel Chapter to Short Video

Input:

-A 1000-3000 word novel chapter.

-Target style: national style, animation, realism, children's picture books, etc.

-Limits: 3-5 scenes, 10-20 shots.

Verification:

-- Whether to keep the key plot.

-Whether the roles are consistent.

-Whether the scenes and shots are reasonable.

-Whether the video clip can express the story.

Success Metrics:

-Manual rewrite script time is reduced.

-Generate a mirror that can be used by the content team.

-At least 60%-70% of the lens can enter the subsequent refinement.

PoC 2: Advertising Script Visualization

Input:

-30 second product advertising script.

-Brand tonality and target demographic.

-Product drawings or reference drawings.

Verification:

-Whether it can generate storyboard that conform to the brand's tonality.

-Whether the product/character image can be maintained.

-Whether samples that can be used for the proposal can be exported.

PoC 3: Interactive Guest Video

Input:

-User photo or pet photo.

-A short story script.

-Style: Anime/Fantasy/City/Campus.

Verification:

-Whether the guest character is recognizable.

-Whether there is a risk of misuse of the portrait.

-Whether the content security review process is accessible.

11. Risks and Considerations

RiskDescriptionResponse Recommendations
Underlying model dependencyViMax dependency LLM, image generation, video generation APIConfirm available models, price, rate limit before PoC
Cost not negligibleMultiple shots will trigger a large number of image/video generation and retriesSet maximum shots and budget
Consistency is still uncertainThe framework has a consistency mechanism, but the video model may still driftManual review of the reference graph locks multiple samples
Copyright and Portrait RisksAutoCameo and IP Adaptation Involving AuthorizationEstablishing Authorization, Content Security, Compliance Processes
Immature SaaSCurrently, it is mainly code/TUI/scriptRequires secondary encapsulation by technical team
Production quality is unstableAI video may have physical errors, picture distortion, style jumpPositioning as sample/rehearsal/auxiliary creation, rather than final film
Chinese content effect to be verifiedChinese script, Chinese cultural context depends on model abilityEvaluation with Chinese real sample

12. My Pre-Sales Judgment

ViMax is very suitable for "the next stage of AI video": from short prompt to generate a few seconds of clips, to Agentic Video Production with scripts, split mirrors, role continuity and multi-shot scheduling.

It is best suited for pre-sales positioning is:

Long narrative video generation PoC framework instead of mature video SaaS.

Recommended key customers:

  1. Short play/web/novel platform: verify IP content automatic visualization.
  2. Advertising/Brand Team: Quickly generate ad script samples.
  3. Educational Content Team: Videoize stories and course content.
  4. AI video platform team: refer to its multi-Agent architecture for production.

It is not recommended to say to customers that "ViMax can directly replace the director, screenwriter, editing and post". The safer words are:

ViMax can greatly advance the generation speed of early ideas, scripts, mirrors and samples, allowing human creators to focus on selection, aesthetics and final quality control.