🔴 Advanced ⚙️ Type: World Model / Image-to-Video 💸 Free & Open Source (Apache-2.0) ⭐ 100+ Hugging Face Likes
What is SANA-WM?
SANA-WM (World Model) is a groundbreaking 2.6-billion parameter open-source AI model developed by researchers at NVIDIA and MIT. It takes a single starting image, a text prompt, and a specific camera trajectory, and generates a seamless, 60-second, 720p video that physically moves through that environment.
Instead of just creating random moving pixels like standard AI video generators, SANA-WM acts as a true “world model.” It understands 3D space, object permanence, and realistic physical lighting. If you tell it to walk forward through a virtual room and turn left, it synthesizes the continuous video exactly as a physical camera would capture it.
Historically, generating a 60-second video at 720p required massive server clusters. SANA-WM uses a highly optimized architecture called a Hybrid Linear Diffusion Transformer, which drastically cuts down memory costs. This allows you to generate incredibly long, consistent video rollouts entirely on a single high-end GPU or Apple Silicon Mac.
Who is it for?
- VFX Artists and Directors who want to pre-visualize complex camera movements (like drone sweeps or dolly zooms) through environments generated from a single piece of concept art.
- Game Developers looking to prototype level designs or experiment with interactive “playable” generative worlds.
- AI Researchers and Enthusiasts who want a powerful, open-source alternative to closed systems like OpenAI’s Sora to run locally on their own hardware.
- Developers looking to integrate long-horizon, spatially consistent video generation into their own custom applications without paying per-minute API fees.
What makes it special?
- One Image = Infinite Explorations — You can feed it the exact same starting image, provide three different camera paths (e.g., pan left, fly up, walk forward), and it will generate three entirely different explorations of that same world.
- Native 60-Second Generation — Most open-source video models collapse into hallucinatory static after 4 to 6 seconds. SANA-WM natively holds spatial coherence for a full 60 seconds in a single pass.
- Precise 6-DoF Camera Control — You explicitly feed the model 6-Degrees of Freedom (6-DoF) mathematical paths, giving you absolute control over the camera’s spatial coordinates rather than begging a text prompt to “pan slowly.”
- Single-GPU Efficiency — While it was trained on 64 H100 GPUs, the bidirectional inference variant can run entirely on a single H100, and its distilled versions can run on an RTX 5090 or a high-memory Mac.
- Two-Stage Refiner — It utilizes a smart two-step pipeline: it rapidly drafts the 60-second layout, and then passes it through a 17B long-video refiner to fix visual artifacts and stabilize textures.
Requirements before you start
Because this is a massive world model, it has strict hardware requirements. Ensure your system meets these before attempting installation:
- A High-VRAM GPU — You need a serious GPU to run the full pipeline. An NVIDIA GPU with 24GB+ VRAM (like an RTX 4090/5090 or A6000) for the distilled variants, or a workstation with 80GB VRAM (like an H100) for the full bidirectional model.
- Apple Silicon Alternative — Thanks to community ports, you can also run this on an M2/M3/M4 Max or Ultra Mac, provided you have at least 96GB of Unified Memory.
- Python 3.10+ — Required to run the PyTorch backend.
- Hugging Face CLI — Installed on your system to download the massive weight files (
pip install -U "huggingface_hub[cli]"). - At least 30 GB of free SSD space — The model weights, text encoders, and the long-video refiner are massive files.
Step-by-step installation
Step 1 — Clone the official repository
Open your terminal and pull down the NVIDIA NVlabs code framework:
git clone https://github.com/NVlabs/Sana.git
cd Sana
Step 2 — Create a virtual environment and install dependencies
It is highly recommended to isolate the heavy AI packages:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
(Windows users: Use venv\Scripts\activate to activate the environment.)
Step 3 — Download the SANA-WM and Gemma weights
Use the Hugging Face CLI to download the bidirectional model and its required text encoder into a local directory. This will take a while depending on your internet speed:
huggingface-cli download Efficient-Large-Model/SANA-WM_bidirectional --local-dir models/SANA-WM_bidirectional
huggingface-cli download google/gemma-2-2b-it --local-dir models/gemma-2-2b-it
Step 4 — Provide a starting image
Place a high-quality 1280×720 .jpg or .png image into your project folder. This will act as frame 0 for your world generation. Let’s assume you named it start_frame.jpg.
Step 5 — Run the world model generation
Execute the inference script, passing in your downloaded models, the starting image, your text description, and the camera trajectory preset:
python scripts/inference_wm.py \
--model_path models/SANA-WM_bidirectional \
--prompt "A first-person view moving slowly forward through a bioluminescent alien forest" \
--image_path start_frame.jpg \
--trajectory forward_slow \
--output_dir results/
Wait for the pipeline to finish processing the base model and the refiner. Once complete, you will find a 60-second 720p .mp4 file waiting in your results/ folder!
Common errors and fixes
| Error | What it means | How to fix it |
|---|---|---|
CUDA Out of Memory (OOM) | Your graphics card ran out of VRAM trying to load the model, the text encoder, and the refiner simultaneously. | You must enable memory offloading. Pass the --offload or --skip_refiner flags in the command line so it dumps inactive models to system RAM to save GPU space. |
ModuleNotFoundError: flash_attn | Flash Attention is missing or failed to compile for your specific hardware. | Install it directly using pip install flash-attn --no-build-isolation. If you are on a Mac, you must use a community Apple Silicon fork, as native Flash Attention is CUDA-only. |
| Video generates but lacks texture or looks blurry | The first-stage base model completed, but the second-stage long-video refiner failed to initialize. | Ensure you have downloaded the 17B refiner adapter weights and that your GPU has enough memory to hold both the latents and the refiner flow-matching states. |
Free vs Paid comparison
| Feature | SANA-WM (Free Local AI) | Premium Video APIs (Runway Gen-3 / Sora) |
|---|---|---|
| Cost per Generation | $0 (Just your electricity) | $0.10 to $0.50+ per clip |
| Max Video Duration | ✅ 60 seconds natively | Usually capped at 5 to 10 seconds |
| Camera Control Precision | ✅ Absolute math-based 6-DoF control | ❌ Relies on vague text prompts or simple sliders |
| Data Privacy | ✅ Complete — data never leaves your machine | ❌ Rendered entirely on cloud servers |
| Hardware Required | ⚠️ Massive (Requires high VRAM GPUs) | 🟢 None (Runs in a browser) |
Bottom line: If you are a developer, researcher, or VFX artist who already has access to serious hardware and needs precise 60-second camera paths for free, SANA-WM is currently unmatched in the open-source space. If you just want a quick, photorealistic 5-second video clip and only have a standard laptop, use a paid cloud platform.
Alternatives — 3 similar tools
1. Open-Sora
A community-driven open-source initiative attempting to reproduce OpenAI’s Sora capabilities. It is fantastic for generating high-quality text-to-video clips, though it currently struggles to maintain the strict 60-second spatial consistency and precise camera control that SANA-WM achieves.
🔗 github.com/hpcaitech/Open-Sora
2. CogVideoX (by Zhipu AI)
An incredibly powerful and highly accessible open-source video generation model. It is heavily optimized, allowing users to generate beautiful 6-second videos on consumer-grade GPUs (like an RTX 3060). It is much easier to run than SANA-WM, but lacks the minute-long world simulation focus.
3. Luma Dream Machine
A closed-source, cloud-based platform that offers some of the best image-to-video generation on the market right now. It features intuitive camera control keyframing directly in the browser, making it the best option for users who don’t want to touch Python code, provided they are willing to pay for credits.
🚀 Want more free AI tools like this?
We find, test, and write setup guides for the best free and open-source AI tools — so you don’t have to dig through GitHub yourself.Browse Free AI Tools at globalaiforce.com/shop →
📸 Follow us for daily AI tool tips and tutorials: instagram.com/globalaiforce