Hi everyone,
I’m training a conditional diffusion model from scratch on ~13k paired images, each 1×224×224.
My UNet2DModel takes 2 channels (noisy_opd + condition image) and outputs 1 channel, with block_out_channels (128,128,128,256,256) and attention on the last down/up block.
Training uses batch size 8 with gradient accumulation 2 (effective 16), AdamW with lr 2e-4, EMA 0.9995, CFG drop prob 0.1, cosine LR schedule 500 warmup steps, and a DDIMScheduler with 1000 training timesteps (prediction_type=“epsilon”).
So far I’ve trained for ~150 epochs, which is around 121,800 optimizer steps. For sampling I run DDIM with 50 inference steps, starting from noise and concatenating [noisy, condition] at each timestep.
The problem is that even after all this training the generated OPD images remain very noisy and never become clean.
Is DDIMScheduler appropriate for training from scratch, or should I use DDPM for training and DDIM only for inference? Could my setup (UNet size, scheduler choice, EMA, or number of inference steps) explain why the model still outputs so much noise?
Any advice would be very appreciated
![]()