1. Project Overview
| Dimension | Information |
|---|---|
| Projects | HKUDS/ViMax |
| Official Title | ViMax: Agentic Video Generation |
| Official Description | Director, Screenwriter, Producer, and Video Generator All-in-One |
| Paper | arXiv:2606.07649 ViMax: Agentic Video Generation |
| Core Issues | Existing video generation tools can only generate short segments, lacking narrative structure, role consistency, scene continuity and sound-picture collaboration |
| Core Capabilities | Multi-agent long video generation, RAG long script design, split mirror/lens planning, reference map selection, consistency check, multi-lens parallel generation |
| Main Processes | Idea2Video, Novel2Video, Script2Video, AutoCameo |
| Interactive Mode | Agent Loop TUI |
| Environment | Python 3.12,uv management, Linux/Windows |
| Main dependencies | LangChain, OpenAI SDK, Google GenAI, FAISS, MoviePy, OpenCV, PySceneDetect, Pillow, etc. |
| License | MIT |
| Information Check Time | 2026-06-30 |
The abstract of the paper makes the problem very clear: long video generation requires systematic narrative planning and visual consistency, while current short segment generation methods usually only generate isolated sequences and lack a role/environment consistency mechanism across scenes. ViMax is designed to use multi-agent collaboration to negotiate narrative decisions, visual continuity, and production quality.
2. The project comes with schematic diagram and Demo
Main drawing of 2.1 project
! ViMax
Official Video Demo in 2.2 README
Multiple videos of' github.com/user-attachments' are embedded in README, covering different subject matter clips generated from scratch, such as underwater, animal, sky city guest, cat guest, etc. Obsidian the rendering of remote video is not necessarily stable, it is recommended to open the GitHub README page to play during demonstration:
2.3 Multi-agent Video Generation Pipeline
The schema in README is in the form of an HTML table, and the core can be organized as follows:
创意 / 剧本 / 小说 / Prompt / 参考图 / 风格 / 配置"] --> B["中央调度
Agent Scheduling / Stage Transitions / Resource Management / Retry"] B --> C["剧本理解
角色与环境提取 / 场景边界 / 风格意图"] B --> D["场景与镜头规划
Storyboard / Shot List / Key Frames / Beats"] C --> E["视觉资产规划
参考图选择 / 风格引导 / Prompt Conditioning"] D --> E E --> F["资产索引
Frames & Refs Catalog / Embeddings / Retrieval"] E --> G["一致性与连续性
角色追踪 / 环境追踪 / Ref Matching / Temporal Coherence"] F --> H["视觉合成与组装
Image Generation / Best Frame Selection / First-Last Frame to Video / Timeline Assembly"] G --> H H --> I["输出层
Frames / Clips / Final Videos / Logs / Working Directory Artifacts"]
3. What can it mainly do
3.1 Generate video from Creative: Idea2Video
Idea2Video for a sentence or a paragraph of creative input. The user describes an idea, target audience and style, and ViMax automatically completes:
-Creative understanding
-Story structure generation
-Character Design
-Scene division
-Split mirror planning
-Image generation prompt
-Video generation call
-Fragment assembly
Example:
idea = """
If a cat and a dog are best friends, what would happen when they meet a new cat?
"""
user_requirement = """
For children, do not exceed 3 scenes.
"""
style = "Cartoon"
Pre-sales understanding: This is suitable for the low-cost sample generation of "marketing short film concept verification", "children's education story", "brand small theater" and "product creative film.
3.2 Video adaptation from "Novel/Long Text": Novel2Video
Novel2Video is what makes ViMax more interesting than ordinary video generation tools. Instead of just turning a single prompt into a few seconds of video, it attempts to compress a complete novel or long narrative into diverse video content.
It will handle:
-Long text understanding
-Narrative compression
-Key plot retention
-Character tracking
-Scene Split
-Mirror-level visual adaptation
Both the paper and README emphasize RAG-based long script design engine: analyze novel-level long texts through RAG, and automatically divide them into multi-scene script formats, keeping key plots and character dialogues as much as possible.
Pre-sales value:
-Automatic adaptation of web articles/skits/educational stories.
-Quick visualization of IP content.
-Long text content becomes video dailies.
-Help the content team reduce the cost of up-front mirror and script adaptation.
3.3 Generating video from "playbook": Script2Video
Script2Video for clients who already have a script. It allows users to provide screenplay and set requirements for style, rhythm, number of shots, etc.
The sample input is similar to a movie script format:
EXT. SCHOOL GYM - DAY
A group of students are practicing basketball in the gym...
John: I'm going to score a basket!
Jane: Good job, John!
At the same time, users can add requirements:
Fast-paced with no more than 20 shots.
Pre-sales value:
-When the customer already has a script or copy, visual samples can be quickly generated.
-Suitable for advertising scripts, training scripts, skit scripts, brand stories.
-allows customers to preview the narrative rhythm and lens feel before shooting.
3.4 Generate Guest Video with Photos: AutoCameo
The AutoCameo concept is to upload a photo of yourself or a pet and have the character appear in a creative script and cinematic footage. It is essentially aimed at personalizing interactive videos:
-Individual cameos
-Pet Cameos
-Brand spokesperson/digital avatar
-Fan interactive content
Pre-sales can be used for entertainment marketing, event communication, educational interaction and short drama personalization. But here must also remind customers: portrait rights, authorization, face change/identity abuse, content compliance to focus on review.
3.5 Multi-Agent Production Process
ViMax's focus is not on a model, but on a multi-agent production process:
| Phases | What to do | Why is it important |
|---|---|---|
| Script Understanding | Extract characters, environment, scene boundaries, style intent | Let the system know "who, where and what" in the story |
| Scene & Shot Planning | Generate Mirror, Shot List, Key Frame, Rhythm Point | Turn Text Narrative into Shot Language |
| Visual Asset Planning | Select reference diagrams, generate visuals prompt | Improve role/environment consistency |
| Asset Indexing | Maintain frames, reference images, embedding, and reusable footage | Multiplexing visual assets across scenes for long videos |
| Consistency & Continuity | Track Character, Environment and Time Continuity | Solve the Common "People Change, Scene Disorderly" in AI Video |
| Visual Synthesis & Assembly | Image Generation, Best Frame Selection, First and End Frame to Video, Timeline Assembly | From Material to Final Video |
3.6 Consistency Check with VLM/MLLM
README mentions that ViMax will generate multiple images in parallel and select the most consistent image as the first frame of the video through MLLM/VLM to mimic the workflow of a human creator.
This is very critical for long videos, because it is difficult to maintain only by prompt:
-Same character looks the same.
-Consistent environment in the same scene.
-Multi-role space position is reasonable.
-The front and rear lenses are not obtrusive.
Pre-sales reach:
ViMax's idea is not to let the model generate a complete long video at one time, but to split the long video into a production line of shots, reference maps and consistency checks to reduce the risk of long video out of control.
4. Applicable Scenario
4.1 Short Play/Online Text/Novel IP Visualization
Customer:
-Short Drama Company
-Web Text Platform
-IP Operations Team
-MCN/Content Studio
Pain point:
-The novel to the short play requires screenwriting, filming, directing and pre-art investment.
-It is difficult to see visual effects quickly before shooting.
-IP more, but the cost of trial and error is high.
ViMax cut in:
-Novel2Video do plot compression and split mirror.
-Script2Video to make specified script samples.
-Maintain roles and environments with consistency mechanisms.
PoC mode:
-Select 1 short chapter.
-Generate video samples of 3-5 scenes.
-Compare manual script/split mirror costs.
4.2 brand marketing and advertising creative sample
Customer:
-Brand Marketing Department
-Advertising agency
-Content Creative Team
Pain point:
-Creative proposals require samples, but the cost of shooting is high.
-Customers often want to look at the "feeling", just look at the text is not enough.
-Multi-version ideas require quick trial and error.
ViMax cut in:
-Idea2Video generate story fragments from ideas.
-Visualize the ad script Script2Video.
-AutoCameo do interactive communication play.
Suitable for output:
-Product story short film.
-Activity warm-up video.
-Brand concept piece.
-Social media interactive video.
4.3 Education/Children's Story Video
Customer:
-Educational content companies
-Picture book/children's story platform
-Home Education Products
Pain point:
-Lots of stories need to be videoed.
-To keep the character stable, unified style, simple scene.
-Limited budget for content production.
ViMax cut in:
-Idea2Video generation of children's stories.
-Novel2Video adaptation of picture books/chapters.
-Control the number of scenes, styles and shots.
4.4 Game/Animation Pre-Concept Proof
Customer:
-Game Company
-Animation Studio
-Virtual Person/Digital Content Team
Pain point:
-Worldview, characters, and plot need to be visualized early.
-High cost of split mirror and concept preview.
-Want to quickly test different lens rhythms.
ViMax cut in:
-Generate concept videos with scripts and character settings.
-Maintain character image with reference maps and consistency checks.
-As a previsualization tool, not as a final production tool.
4.5 Personalized Interactive Video/Activity Dissemination
Customer:
-Cultural Tourism Activities
-Brand Campaign
-Fan interactive platform
-Pet/Parent Content Products
Pain point:
-Users want to put themselves in the story.
-The cost of manual production of personalized video is high.
ViMax cut in:
-AutoCameo generate guest videos with user photos.
-Can do activity poster/short video diffusion.
Must be reminded:
-Portrait authorization.
-Protection of minors.
-Content security review.
-Risk of deep forgery.
5. Not suitable for the scene
| Scenario | Reason |
|---|---|
| Direct production of film and television-level films | At present, it is more like a research/PoC framework, and the final film still needs professional post-production and manual review |
| Commercials that require extremely high character consistency | Despite the consistency mechanism, the generative model may drift |
| Strictly compliant content, such as medical, financial and government publicity | AI generated video facts, images and expressions need to be strictly reviewed |
| Team without API key or Generate Model resource | ViMax relies on LLM, Image Generator, Video Generator API/Model |
| Business personnel who want ordinary editing software UI | Currently, it is mainly code and TUI, not mature SaaS products |
| Direct automatic mass production of long series | Long video continuity is still a problem, suitable for segmented PoC, not directly committed to mass production |
6. Core Competence List
| Capabilities | Descriptions | Pre-Sales Value |
|---|---|---|
| Idea2Video | From creative to full video story | Suitable for quick creative samples |
| Novel2Video | Novel/long text to diversity video | Ideal for IP visualization and pre-skit |
| Script2Video | Script to Video | Suitable for Ad Script, Training Script, Short Script |
| AutoCameo | Generate guest videos with photos | Suitable for interactive marketing and personalized content |
| RAG long script generation | long text analysis and multi-scenario script segmentation | solving long text adaptation problems |
| split-mirror design | planning lens-level storyboard with movie language | lowering the threshold for split-mirror |
| Multi-camera simulation | Simulate multi-camera shooting | Improve viewing experience and lens diversity |
| Reference map selection | Select the current video first frame and historical timeline reference map | Improve role/environment consistency |
| Consistency Check | Generate multiple graphs in parallel and use VLM/MLLM to select the best consistent graph | Reduce out-of-control images |
| Parallel shot generation | Parallel shot processing on the same camera | Improve production efficiency |
| TUI / Agent Loop | Interactive planning, revision, rendering control, session | More suitable for human participation in the authoring process |
| Configurable Provider | LLM, image, video, embedding, and reranker can be configured | Easy to connect different model services |
7. Architecture/Deployment/Integration
7.1 installation
git clone https://github.com/HKUDS/ViMax.git
cd ViMax
uv sync
Requirements:
-Python >= 3.12
-uv
-Linux or Windows
-LLM API
-Image generation API
-Video generation API
7.2 Agent TUI Configuration
'configs/agent.local.yaml ':
llm:
model_provider: openai
model:
base_url:
api_key: ''
image:
model:
base_url:
api_key: ''
video:
model:
base_url:
api_key: ''
embedding:
model_provider: openai
model:
base_url:
api_key: ''
reranker:
model:
base_url:
api_key: ''
Start:
vimax tui
vimax tui new
vimax tui resume
vimax tui resume
You can also pass key through environment variables, such as VIMAX_LLM_API_KEY, VIMAX_IMAGE_API_KEY, and VIMAX_VIDEO_API_KEY '.
7.3 Direct pipeline entrance
Idea2Video:
python main_idea2video.py
Script2Video:
python main_script2video.py
Corresponding configuration file:
-'configs/idea2video.yaml'
-'configs/script2video.yaml'
The README example uses the OpenRouter/Gemini, Google image generation, and Veo video generation paths. When the actual PoC, it needs to be replaced according to the customer's available API and budget.
8. What can I say before sales
Business-oriented
Most AI video tools are now suitable for making a few-second clip, but real brand films, short plays, novel adaptations, and training stories need scripts, characters, shots, shots, and continuity. The value of ViMax is to split these production processes into multi-agent processes, so that AI don't just "generate a video", but like a small production team, first plan the story, then design the lens, and finally generate and assemble the picture.
Business Value:
-Shorten the timeline from concept to sample.
-Reduced up-front trial-and-error costs for skits/advertising/educational content.
-Allow non-professional teams to generate more complete visual narrative drafts.
-Provide automatic storytelling and video preview capabilities for content production teams.
Technology-oriented
ViMax is a multi-agent video generation framework. The upper layer has Idea2Video, Novel2Video, Script2Video, and AutoCameo. In the middle, script understanding, split-mirror planning, reference map selection, and consistency tracking are done. The bottom layer is connected with LLM, image generator, video generator, embedding/reranker and video processing tools. It is suitable for evaluating the "long narrative video generation" technical route, rather than directly replacing the single video model.
Technical value:
-Framework to solve the problem of long video generation.
-Instead of binding a single model, the Provider is configurable.
-Support long text/long timeline with RAG and asset indexing.
-Reduce screen drift with VLM consistency check.
9. Frequently Asked Customer Questions
| What is the difference between ViMax and Runway/Kling/Veo? | Runway/Kling/Veo is more like the underlying video generation model. ViMax is a multi-agent production framework that can call the image/video generator to complete the processes of scripting, mirroring, consistency, and assembly. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can you directly generate videos that are minutes long? | The goal is for long narratives and multiple shots, but the actual quality depends on the underlying model, API, script complexity, and human intervention. Short chapter PoC should be done before sale. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can you keep roles consistent? | ViMax has reference map selection, role/environment tracking, and VLM consistency checking mechanisms, but it cannot guarantee 100% consistency at the commercial level and requires manual film review. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Is it suitable for Chinese short plays? | There is a Chinese README, which can process Chinese input in theory, but Chinese text quality, character extraction, prompt words and underlying model support need to be verified with Chinese samples. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Will the cost be high? | Will. Long videos call multiple rounds of LLM, image generation, and video generation APIs, and the cost must be estimated based on the number of scenes, the number of shots, and the number of retries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Is it a mature product now? | More like a research-based open source framework and PoC tool, with TUI and pipeline, but not a mature SaaS for business people. |
10. PoC Recommendations
PoC 1: Novel Chapter to Short Video
Input:
-A 1000-3000 word novel chapter.
-Target style: national style, animation, realism, children's picture books, etc.
-Limits: 3-5 scenes, 10-20 shots.
Verification:
-- Whether to keep the key plot.
-Whether the roles are consistent.
-Whether the scenes and shots are reasonable.
-Whether the video clip can express the story.
Success Metrics:
-Manual rewrite script time is reduced.
-Generate a mirror that can be used by the content team.
-At least 60%-70% of the lens can enter the subsequent refinement.
PoC 2: Advertising Script Visualization
Input:
-30 second product advertising script.
-Brand tonality and target demographic.
-Product drawings or reference drawings.
Verification:
-Whether it can generate storyboard that conform to the brand's tonality.
-Whether the product/character image can be maintained.
-Whether samples that can be used for the proposal can be exported.
PoC 3: Interactive Guest Video
Input:
-User photo or pet photo.
-A short story script.
-Style: Anime/Fantasy/City/Campus.
Verification:
-Whether the guest character is recognizable.
-Whether there is a risk of misuse of the portrait.
-Whether the content security review process is accessible.
11. Risks and Considerations
| Risk | Description | Response Recommendations |
|---|---|---|
| Underlying model dependency | ViMax dependency LLM, image generation, video generation API | Confirm available models, price, rate limit before PoC |
| Cost not negligible | Multiple shots will trigger a large number of image/video generation and retries | Set maximum shots and budget |
| Consistency is still uncertain | The framework has a consistency mechanism, but the video model may still drift | Manual review of the reference graph locks multiple samples |
| Copyright and Portrait Risks | AutoCameo and IP Adaptation Involving Authorization | Establishing Authorization, Content Security, Compliance Processes |
| Immature SaaS | Currently, it is mainly code/TUI/script | Requires secondary encapsulation by technical team |
| Production quality is unstable | AI video may have physical errors, picture distortion, style jump | Positioning as sample/rehearsal/auxiliary creation, rather than final film |
| Chinese content effect to be verified | Chinese script, Chinese cultural context depends on model ability | Evaluation with Chinese real sample |
12. My Pre-Sales Judgment
ViMax is very suitable for "the next stage of AI video": from short prompt to generate a few seconds of clips, to Agentic Video Production with scripts, split mirrors, role continuity and multi-shot scheduling.
It is best suited for pre-sales positioning is:
Long narrative video generation PoC framework instead of mature video SaaS.
Recommended key customers:
- Short play/web/novel platform: verify IP content automatic visualization.
- Advertising/Brand Team: Quickly generate ad script samples.
- Educational Content Team: Videoize stories and course content.
- AI video platform team: refer to its multi-Agent architecture for production.
It is not recommended to say to customers that "ViMax can directly replace the director, screenwriter, editing and post". The safer words are:
ViMax can greatly advance the generation speed of early ideas, scripts, mirrors and samples, allowing human creators to focus on selection, aesthetics and final quality control.