A powerful toolkit for fine-tuning Hunyuan Video LoRA using LoRA, plus advanced video inference and automatic captioning via QWEN-VL. This guide focuses on the most important aspect: how to run fine-tuning (training) and generation (inference) using Cog, with detailed explanations of all parameters.
-
Place your training videos in a ZIP file. Optionally include <video_name>.txt captions alongside each <video_name>.mp4, e.g.:
your_data.zip/ ├── dance_scene.mp4 ├── dance_scene.txt ├── city_stroll.mp4 └── ...
Tip: You can use create-video-dataset to easily prepare your training data with automatic QWEN-VL captioning.
-
Run the training example command (see below).
-
After training, run the inference example command to generate a new video.
- Install Docker (required by Cog).
- Install Cog from cog.run:
curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m` chmod +x /usr/local/bin/cog pip install cog
- Clone or download this repository.
- From the project root directory, you can run Cog commands with parameters:
# For training: sudo cog train -i "input_videos=@your_videos.zip" -i "trigger_word=MYSTYLE" # For inference: sudo cog predict -i "prompt=your prompt here" -i "replicate_weights=@/tmp/trained_model.tar"
- See below for detailed parameter explanations and more examples.
Use:
sudo cog train \
-i "input_videos=@your_videos.zip" \
[other parameters...]
The result of training is saved to /tmp/trained_model.tar
containing:
- LoRA weights (.safetensors)
- (Optional) ComfyUI-compatible LoRA
- Any logs or training artifacts
You can use this output directly in inference by passing it to the replicate_weights
parameter:
sudo cog predict \
-i "prompt='Your prompt here'" \
-i "replicate_weights=@/tmp/trained_model.tar" \
[other parameters...]
Below are the key parameters you can supply to cog train
. All parameters have validated types and ranges:
• input_videos (Path)
- Description: A ZIP file containing videos (and optional .txt captions).
- Example: -i "input_videos=@my_videos.zip"
• trigger_word (str)
- Description: A "fake" or "rare" word that represents the style or concept you're training on.
- Default: "TOK"
- Example: -i "trigger_word=STYLE3D"
• autocaption (bool)
- Description: Whether to auto-caption your videos using QWEN-VL.
- Default: True
- Example: -i "autocaption=false"
• autocaption_prefix (str)
- Description: Text prepended to all generated captions (helps set consistent context).
- Default: None
- Example: -i "autocaption_prefix='A cinematic scene of TOK, '"
• autocaption_suffix (str)
- Description: Text appended to all generated captions (helps reinforce the concept).
- Default: None
- Example: -i "autocaption_suffix='in the art style of TOK.'"
• epochs (int)
- Description: Number of full passes (epochs) over the dataset.
- Range: 1–2000
- Default: 16
• max_train_steps (int)
- Description: Limit the total number of steps (each step processes one batch). -1 for unlimited.
- Range: -1–1,000,000
- Default: -1
• rank (int)
- Description: LoRA rank. Higher rank can capture more detail but also uses more resources.
- Range: 1–128
- Default: 32
• batch_size (int)
- Description: Batch size (frames per iteration). Lower for less VRAM usage.
- Range: 1–8
- Default: 4
• learning_rate (float)
- Description: Training learning rate.
- Range: 1e-5–1
- Default: 1e-3
• optimizer (str)
- Description: Which optimizer to use. Usually "adamw8bit" is a good default.
- Choices: ["adamw", "adamw8bit", "AdaFactor", "adamw16bit"]
- Default: "adamw8bit"
• timestep_sampling (str)
- Description: Sampling strategy across diffusion timesteps.
- Choices: ["sigma", "uniform", "sigmoid", "shift"]
- Default: "sigmoid"
• consecutive_target_frames (str)
- Description: How many consecutive frames to pull from each video.
- Choices: ["[1, 13, 25]", "[1, 25, 45]", "[1, 45, 89]", "[1, 13, 25, 45]"]
- Default: "[1, 25, 45]"
• frame_extraction_method (str)
- Description: How frames are extracted (start, chunk, sliding-window, uniform).
- Choices: ["head", "chunk", "slide", "uniform"]
- Default: "head"
• frame_stride (int)
- Description: Stride used for slide-based extraction.
- Range: 1–100
- Default: 10
• frame_sample (int)
- Description: Number of samples used in uniform extraction.
- Range: 1–20
- Default: 4
• seed (int)
- Description: Random seed. Use <= 0 for truly random.
- Default: 0
• hf_repo_id (str)
- Description: If you want to push your LoRA to Hugging Face, specify "username/my-video-lora".
- Default: None
• hf_token (Secret)
- Description: Hugging Face token for uploading to a private or public repository.
- Default: None
-
Simple Training
sudo cog train \ -i "input_videos=@your_videos.zip" \ -i "trigger_word=MYSTYLE" \ -i "epochs=4"
This runs 4 epochs with default batch size and autocaption.
-
Memory-Constrained Training
sudo cog train \ -i "input_videos=@your_videos.zip" \ -i "rank=16" \ -i "batch_size=1" \ -i "gradient_checkpointing=true"
Uses a lower rank and smaller batch size to reduce VRAM usage, plus gradient checkpointing.
-
Motion-Focused Training
sudo cog train \ -i "[email protected]" \ -i "consecutive_target_frames=[1, 45, 89]" \ -i "frame_extraction_method=slide" \ -i "frame_stride=10"
Extracts frames in sliding windows to capture more motion variety.
-
Quick Test Run
sudo cog train \ -i "[email protected]" \ -i "rank=16" \ -i "epochs=4" \ -i "max_train_steps=100" \ -i "batch_size=1" \ -i "gradient_checkpointing=true"
Minimal training to verify your setup and data.
-
Style Focus
sudo cog train \ -i "[email protected]" \ -i "consecutive_target_frames=[1]" \ -i "frame_extraction_method=uniform" \ -i "frame_sample=8" \ -i "epochs=16"
Optimized for learning static style elements rather than motion.
Use:
sudo cog predict \
-i "prompt='Your prompt here'" \
[other parameters...]
The generated video is saved to the output directory (usually /src or /outputs inside Docker), and Cog returns the path.
Below are the key parameters for cog predict
:
• prompt (str)
- Description: Your text prompt for the scene or style.
- Example: -i "prompt='A cinematic shot of a forest in MYSTYLE'"
• lora_url (str)
- Description: URL or Hugging Face repo ID for the LoRA weights.
- Example: -i "lora_url='myuser/my-lora-repo'"
• lora_strength (float)
- Description: How strongly the LoRA style is applied.
- Range: -10.0–10.0
- Default: 1.0
• scheduler (str)
- Description: The diffusion sampling/flow algorithm.
- Choices: ["FlowMatchDiscreteScheduler", "SDE-DPMSolverMultistepScheduler", "DPMSolverMultistepScheduler", "SASolverScheduler", "UniPCMultistepScheduler"]
- Default: "DPMSolverMultistepScheduler"
• steps (int)
- Description: Number of diffusion steps.
- Range: 1–150
- Default: 50
• guidance_scale (float)
- Description: How strongly the prompt influences the generation.
- Range: 0.0–30.0
- Default: 6.0
• flow_shift (int)
- Description: Adjusts motion consistency across frames.
- Range: 0–20
- Default: 9
• num_frames (int)
- Description: Total frames in the output video.
- Range: 1–1440
- Default: 33
• width (int), height (int)
- Description: Dimensions of generated frames.
- Range: width (64–1536), height (64–1024)
- Default: 640x360
• denoise_strength (float)
- Description: Controls how strongly noise is applied each step: 0 = minimal noise, 2 = heavy noise.
- Range: 0.0–2.0
- Default: 1.0
• force_offload (bool)
- Description: Offload layers to CPU for lower VRAM usage.
- Default: True
• frame_rate (int)
- Description: Frames per second in the final video.
- Range: 1–60
- Default: 16
• crf (int)
- Description: H.264 compression quality. Lower = better.
- Range: 0–51
- Default: 19
• enhance_weight (float)
- Description: Strength of optional enhancement effect.
- Range: 0.0–2.0
- Default: 0.3
• enhance_single (bool) & enhance_double (bool)
- Description: Whether to enable enhancement on single frames or across pairs of frames.
- Default: True, True
• enhance_start (float) & enhance_end (float)
- Description: Control when in the video enhancement starts or ends (fractional times, 0.0–1.0 range).
- Default: 0.0–1.0
• seed (int)
- Description: Random seed for reproducible output.
- Default: random if not provided
• replicate_weights (Path)
- Description: Path to a local .tar containing LoRA weights from replicate training.
- Default: None
-
Basic Inference with Local LoRA
sudo cog predict \ -i "prompt='A serene lake at sunrise in the style of MYSTYLE'" \ -i "lora_url='local-file.safetensors'" \ -i "width=512" \ -i "height=512" \ -i "steps=30"
-
Advanced Motion and Quality
sudo cog predict \ -i "prompt='TOK winter cityscape, moody lighting'" \ -i "lora_url='myuser/my-lora-repo'" \ -i "steps=50" \ -i "flow_shift=15" \ -i "num_frames=80" \ -i "frame_rate=30" \ -i "crf=17" \ -i "lora_strength=1.2"
Here, we use more frames, faster frame rate, and a lower CRF for higher quality.
-
Using Replicate Tar
sudo cog predict \ -i "prompt='An astronaut dancing on Mars in style TOK'" \ -i "replicate_weights=@trained_model.tar" \ -i "guidance_scale=8" \ -i "num_frames=45"
Instead of lora_url, we pass a local .tar with LoRA weights.
-
Quick Preview
sudo cog predict \ -i 'steps=30' \ -i 'width=512' \ -i 'height=512' \ -i 'num_frames=33' \ -i 'force_offload=true'
-
Smooth Motion
sudo cog predict \ -i 'scheduler=FlowMatchDiscreteScheduler' \ -i 'flow_shift=15' \ -i 'frame_rate=30' \ -i 'num_frames=89'
-
Reduce OOM Errors
- Use a smaller
batch_size
or lowerrank
during training. - Enable
force_offload=true
during inference.
- Use a smaller
-
Better Quality
- Increase
steps
andguidance_scale
. - Use a lower
crf
(e.g., 17 or 18).
- Increase
-
Faster Training
- For smaller datasets, reduce
epochs
. - Increase
learning_rate
slightly (e.g., 2e-3) while monitoring for overfitting.
- For smaller datasets, reduce
-
Motion Emphasis
- Use
frame_extraction_method=slide
orconsecutive_target_frames=[1, 25, 45]
during training for improved motion consistency. - Adjust
flow_shift
(5–15 range) during inference.
- Use
-
Style Activation
- Always include your
trigger_word
in the inference prompt.
- Always include your
This project is released under the MIT License.
Please see the LICENSE file for details.