FLUX LoRA Optimal Training #635

billnye2 · 2024-08-04T22:44:13Z

billnye2
Aug 4, 2024

Just starting this so people can contribute what they've learned for optimal hyperparameters etc. for training flux loras.

For myself, using 10 images at 1e-3 or 1e-4 learning rate learns a little bit, but burns heavily at 500-1000 steps, surprisingly it fluctuates every few hundred steps between being so burned and generating something coherent. 1e-7 learning rate didn't really have any change after 1000 steps

bghira · 2024-08-04T22:58:42Z

bghira
Aug 4, 2024
Maintainer

you'll probably need to use a quantised base model which allows you to experiment with adaptive optimisers like prodigy, dadaptation, and adafactor (though we just really use it like a more efficient adamw here)

3 replies

bghira Aug 4, 2024
Maintainer

also, batch size matters a lot with LR and quantising lets you go higher with bsz

billnye2 Aug 4, 2024
Author

Yup I def wanna try higher batch sizes. GRADIENT_ACCUMULATION_STEPS mimics batch size right? (except N forward passes?). Oh just seeing now you pushed fixed for batch size >1 I'll try that

bghira Aug 4, 2024
Maintainer

they might cause loss of precision. it's a double edged sword

riffmaster-2001 · 2024-08-05T17:06:07Z

riffmaster-2001
Aug 5, 2024

@billnye2 how are you running the FLUX prompts with the new LORA after you trained it?

5 replies

billnye2 Aug 5, 2024
Author

Using comfy, lmk if you get results and what settings you used to train (Rank? Learning rate? LR scheduler? Optimizer? Batch size?) thank you!

billnye2 Aug 5, 2024
Author

Diffusers should work too I believe

riffmaster-2001 Aug 5, 2024

Diffusers should work too I believe

yes, I have a diffuser script that I've been using to test out the Flux dev model. I haven't used that directly before so not sure how to use the LORA with it after it's generated.

riffmaster-2001 Aug 5, 2024

lora_safetensors_path = "lora.safetensors"
pipe.load_lora_weights(lora_safetensors_path)

that look right? obviously with the right filename

billnye2 Aug 5, 2024
Author

Should work

riffmaster-2001 · 2024-08-05T20:51:16Z

riffmaster-2001
Aug 5, 2024

Think I finally got the kinks out, it's training now. Odd thing is my config/config.env file isn't being picked up so I had to do everything as cmd line options if anyone has any troubleshooting tips there.

python train.py --data_backend_config=$MYDIR/multidatabackend.json --keep_vae_loaded --validation_resolution=1024 --mixed_precision=bf16 --adam_bfloat16 --pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev --flux  --model_type="lora" --caption_dropout_probability=0.05 --train_batch_size=1 --lora_rank=8 --gradient_checkpointing  --validation_steps=20 --num_train_epochs=30 --learning_rate=1e-04 --vae_cache_preprocess --validation_prompt="$VPROMPT_TEXT" --allow_tf32 --gradient_precision=fp32 --lr_warmup_steps=25 --logging_dir="$MYDIR/st/run/logs" --output_dir="$MYDIR/st/run" --report_to="tensorboard" --resolution=1024

I'm running on an Nvidia 6000 ADA 48GB GPU, at batch size 1, I'm getting around 10-12seconds per iteration. ~30GB Of VRAM used.

One thing I can't figure out is I have 76 sample images but it seems to only pick 19 of them

2024-08-05 12:59:38,993 [INFO] (__main__) Loading sine learning rate scheduler with 25 warmup steps
2024-08-05 12:59:38,993 [INFO] (__main__) Using Sine learning rate scheduler.
2024-08-05 12:59:38,995 [INFO] (__main__) Loading our accelerator...
2024-08-05 12:59:39,013 [INFO] (__main__) After removing any undesired samples and updating cache entries, we have settled on 30 epochs and 19 steps per epoch.
2024-08-05 12:59:39,013 [INFO] (MultiAspectSampler-k_best_faces)
(Rank: 0)     -> Number of seen images: 0
(Rank: 0)     -> Number of unseen images: 76
(Rank: 0)     -> Current Bucket: None
(Rank: 0)     -> 1 Buckets: ['1.0']
(Rank: 0)     -> 0 Exhausted Buckets: []
2024-08-05 12:59:39,036 [INFO] (__main__)
***** Running training *****
-  Num batches = 76
-  Num Epochs = 30
  - Current Epoch = 1
-  Total train batch size (w. parallel, distributed & accumulation) = 4
  - Instantaneous batch size per device = 1
  - Gradient Accumulation steps = 4
-  Total optimization steps = 570
-  Total optimization steps remaining = 570

After removing any undesired samples and updating cache entries, we have settled on 30 epochs and 19 steps per epoch.
all my images are exactly 1024x1024

11 replies

bghira Aug 5, 2024
Maintainer

you have to use bash train.sh or chmod +x train.sh and then you can just ./train.sh

bghira Aug 5, 2024
Maintainer

using just the .py makes life so much harder :P

riffmaster-2001 Aug 5, 2024

ugh! can't believe I missed that in the docs... I see how it says bash train.sh, ok I'll give that a whirl afterwards. thanks!

@bghira any idea on the image issue I was seeing above?

bghira Aug 5, 2024
Maintainer

it's probably just the default minimum_image_size value, it is pretty aggressive to avoid upscaling images too hard

riffmaster-2001 Aug 5, 2024

all fixed up now, thanks!

kk-89 · 2024-08-07T03:14:10Z

kk-89
Aug 7, 2024

Could be a fluke, but FWIW I'm seeing better results after making a couple changes in SimpleTuner:

quick & dirty implementation of masked loss (repurposing the conditioning images system) to better focus on the concept I'm training
switched MSE to MAE for flow matching, just for the lulz

This is with LR 2e-4, batch size 1, grad accum steps 1, rank 32, AdamW8Bit, sine LR schedule with 2000 steps period, set up for 10k steps total but I'm already seeing better results by 1300 steps than in several thousand steps on previous runs. LoRA, not DoRA, since Comfy can't load the DoRAs last I checked.

3 replies

bghira Aug 7, 2024
Maintainer

the info about lora vs dora in the loader for comfy would probably be pretty helpful to @comfyanonymous

comfyanonymous Aug 7, 2024

Lycoris format doras are supported in ComfyUI, if you want support for another format I'm going to need to have some details how it works and an example file.

kk-89 Aug 7, 2024

@comfyanonymous I can't actually test at the moment since I'm still training, but i think it is just a naming issue; where Kohya has "dora_scale", diffusers uses "lora_magnitude_vector" e.g. see convert_state_dict_to_kohya in diffusers state_dict_utils.py

steve84wien · 2024-08-09T19:49:05Z

steve84wien
Aug 9, 2024

Hey! This could be a simple mistake I've made in my settings
I tried fp16 as well but I had no luck.
Thank you for any advice!

bash train.sh
"Your configuration is requesting an incompatible dtype and optimizer combination.
--base_model_default_dtype is set to bf16. You could resolve this by switching it to fp32, at the cost of more VRAM. This would be placed in TRAINER_EXTRA_ARGS.
--adam_bfloat16 could alternatively be provided to resolve the situation. Check your value for OPTIMIZER.
--mixed_precision is not bf16, but it should be. This value is referred to as MIXED_PRECISION in config.env.
Traceback (most recent call last):
File "/workspace/SimpleTuner/train.py", line 2776, in
main()
File "/workspace/SimpleTuner/train.py", line 269, in main
args = parse_args()
File "/workspace/SimpleTuner/helpers/arguments.py", line 1835, in parse_args
raise ValueError(

subprocess.CalledProcessError: Command '['/workspace/SimpleTuner/.venv/bin/python', 'train.py', '--model_type=lora', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--gradient_checkpointing', '--set_grads_to_none', '--gradient_accumulation_steps=1', '--resume_from_checkpoint=latest', '--snr_gamma=5', '--data_backend_config=config/multidatabackend.json', '--num_train_epochs=0', '--max_train_steps=30000', '--metadata_update_interval=65', '--use_8bit_adam', '--learning_rate=1e-5', '--lr_scheduler=constant', '--seed', '42', '--lr_warmup_steps=300', '--output_dir=output/models', '--inference_scheduler_timestep_spacing=trailing', '--training_scheduler_timestep_spacing=trailing', '--allow_tf32', '--mixed_precision=bf16', '--base_model_precision=int8-quanto', '--flux', '--train_batch=1', '--max_workers=32', '--read_batch_size=25', '--write_batch_size=64', '--caption_dropout_probability=0.1', '--torch_num_threads=8', '--image_processing_batch_size=32', '--vae_batch_size=4', '--validation_prompt=a handsome young girl leaning at the wall, wearing a red jacket', '--num_validation_images=1', '--validation_num_inference_steps=20', '--validation_seed=42', '--minimum_image_size=1024', '--resolution=1024', '--validation_resolution=1024', '--resolution_type=pixel', '--checkpointing_steps=100', '--checkpoints_total_limit=3', '--validation_steps=50', '--tracker_run_name=flux-winterwonderland', '--tracker_project_name=lora-training', '--validation_guidance=3.5', '--validation_guidance_rescale=0.0', '--validation_negative_prompt=']' returned non-zero exit status 2.

3 replies

bghira Aug 9, 2024
Maintainer

use OPTIMIZER="adamw_bf16" or set --base_model_default_dtype=fp32

steve84wien Aug 9, 2024

Thanks! It works :)

35/30000 [10:13<146:17:52, 17.58s/it, lr=1e-5, step_loss=0.249

I should think about a break since I can't solve simple things anymore by myself but I am so excited to get my first LoRA with SimpleTuner!

I am on an RTX 6000 Ada and its using just a bit more than half of the VRAM. I think I could even go with a higher BS like 7 or 10.

512 photos, BS 5, LR 1e-05, with "TRAINER_EXTRA_ARGS="--base_model_precision=fp8-quanto", OPTIMIZER="adamw_bf16"
Scheduler is set to "constant" since miraculously I got the best results on SDXL with that. (I train a photography style)

One more thing: What is the basic Lora Rank set to? I don't remember to manually set it in the config.env

Thanks again!

mhirki Aug 9, 2024

One more thing: What is the basic Lora Rank set to? I don't remember to manually set it in the config.env

LoRA rank is 16 by default.

mhirki · 2024-08-09T20:27:31Z

mhirki
Aug 9, 2024

Here are some config.env recommendations.

Prodigy optimizer is currently the easiest option since it doesn't require guessing a reasonable learning rate.

export FLUX=true
export LEARNING_RATE=1.0
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export MAX_NUM_STEPS=4000
export VALIDATION_NUM_INFERENCE_STEPS=20
export TRAIN_BATCH_SIZE=1
export GRADIENT_ACCUMULATION_STEPS=1
export LR_SCHEDULE="constant_with_warmup"
export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export VAE_BATCH_SIZE=1
export OPTIMIZER="prodigy"
export TRAINER_EXTRA_ARGS="--base_model_precision=fp8-quanto --base_model_default_dtype=fp32"

adamw_bf16 optimizer is faster and uses less VRAM but requires guessing a good learning rate. Here are some settings I'm currently testing:

export FLUX=true
export LEARNING_RATE=4e-4
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export MAX_NUM_STEPS=4000
export VALIDATION_NUM_INFERENCE_STEPS=20
export TRAIN_BATCH_SIZE=1
export GRADIENT_ACCUMULATION_STEPS=1
export LR_SCHEDULE="constant_with_warmup"
export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export VAE_BATCH_SIZE=1
export OPTIMIZER="adamw_bf16"
export TRAINER_EXTRA_ARGS="--base_model_precision=fp8-quanto"

The default setting export LR_SCHEDULE="polynomial" should produce better results but may require more trial and error to find a good combination of learning rate and number of steps. Flux training is slow enough already as it is so I prefer constant learning rate.

People with more than 24 GB of VRAM can try running without quantization (remove --base_model_precision=fp8-quanto).

Don't be alarmed if you see a few bad validation images. The training may still recover.

Update: If you want to increase GRADIENT_ACCUMULATION_STEPS, it is a good idea to add --gradient_precision=fp32 to TRAINER_EXTRA_ARGS.

0 replies

bghira · 2024-08-09T20:31:26Z

bghira
Aug 9, 2024
Maintainer

a bunch of updates have happened. if you're running a new lora, it may be better to start with the new defaults in mind.

flow_matching_loss is now compatible mode rather than diffusers
lora_alpha is lora_rank by default, so that it loads in Diffusers and works correctly after

to match x-flux trainer:

use gradient_precision=fp32 for highest accuracy training
use --base_model_default_dtype=fp32
use adamw optimizer (not adamw_bf16, just adamw)

5 replies

billnye2 Aug 9, 2024
Author

Nice, have you tested the changes / any empirical improvement in learning?

mhirki Aug 9, 2024

I tested --flow_matching_loss=compatible earlier with Flux and didn't see a big change.

bghira Aug 9, 2024
Maintainer

it's roughly mathematically equivalent to the diffusers loss, but still somehow "isn't".

it's probably a bf16 issue tbh, seeing that x-flux uses zero actual mixed-precision at all

steve84wien Aug 9, 2024

Thank you! I will try the recommended settings and compare them with my ongoing training (bf16, adamw_bf16)

bghira Aug 10, 2024
Maintainer

you know what it is? they train at 512px and sample with cfg

IdiotSandwichTheThird · 2024-08-10T16:13:26Z

IdiotSandwichTheThird
Aug 10, 2024

So I've been testing my results and most of my loras turned out quite bad.

Or so I thought? But doing some more extended testing, I think the big issue is that we might be breaking the cfg distillation, so regular cfg has to be introduced back in during sampling. With that the quality of my lora outputs increased quite a bit.

3 replies

bghira Aug 10, 2024
Maintainer

yes, we have independently discovered the same thing. come join the discord: https://discord.gg/vrMudhPv

AmericanPresidentJimmyCarter Aug 10, 2024

Yep, here is the result of my Loona lora with cfg=3.5

We are still trying to figure out the blur. I added a PR #706 so that we can use CFG for our validation images.

IdiotSandwichTheThird Aug 10, 2024

yes, we have independently discovered the same thing. come join the discord: https://discord.gg/vrMudhPv

I would love to, but that link is expired discord says. I've been playing with Adaptive Guidance and dynamic thresholding to even further increase the quality so far, and results are quite promising.

FLUX LoRA Optimal Training #635

Replies: 8 comments · 33 replies

bghira Aug 4, 2024 Maintainer

bghira Aug 4, 2024 Maintainer

billnye2 Aug 4, 2024 Author

bghira Aug 4, 2024 Maintainer

billnye2 Aug 5, 2024 Author

billnye2 Aug 5, 2024 Author

billnye2 Aug 5, 2024 Author

bghira Aug 5, 2024 Maintainer

bghira Aug 5, 2024 Maintainer

bghira Aug 5, 2024 Maintainer

bghira Aug 7, 2024 Maintainer

bghira Aug 9, 2024 Maintainer

bghira Aug 9, 2024 Maintainer

billnye2 Aug 9, 2024 Author

bghira Aug 9, 2024 Maintainer

bghira Aug 10, 2024 Maintainer

bghira Aug 10, 2024 Maintainer

Replies: 8 comments 33 replies

bghira
Aug 4, 2024
Maintainer

bghira Aug 4, 2024
Maintainer

billnye2 Aug 4, 2024
Author

bghira Aug 4, 2024
Maintainer

billnye2 Aug 5, 2024
Author

billnye2 Aug 5, 2024
Author

billnye2 Aug 5, 2024
Author

bghira Aug 5, 2024
Maintainer

bghira Aug 5, 2024
Maintainer

bghira Aug 5, 2024
Maintainer

bghira Aug 7, 2024
Maintainer

bghira Aug 9, 2024
Maintainer

bghira
Aug 9, 2024
Maintainer

billnye2 Aug 9, 2024
Author

bghira Aug 9, 2024
Maintainer

bghira Aug 10, 2024
Maintainer

bghira Aug 10, 2024
Maintainer