Running Stable Diffusion on a Budget GPU: What Actually Works Below 8GB VRAM

Generating images on the GPU you already own, not the one Reddit insists you buy

Every thread about running Stable Diffusion locally eventually arrives at the same smug conclusion: just buy a 4090. This is wonderful advice if you have a spare grand and a power supply that doesn’t sound like a hairdryer. The rest of us are sitting on a 6GB laptop card, an old GTX 1060, or a 4GB GPU that the internet has decided is e-waste. Good news: the internet is wrong, and I have spent enough late nights proving it to write this down.

I run a 6GB card in a machine that mostly exists to transcode video, and it generates images perfectly happily. The trick is understanding where the VRAM actually goes and which knobs reclaim it without turning every render into a minute-long ordeal.

Advertisement

There are three consumers. The model weights themselves are the big one: a standard SD 1.5 checkpoint in fp16 is roughly 2GB, while SDXL is closer to 6.5GB before you’ve generated a single pixel. Then there’s the latent image, which grows with resolution — and crucially with the square of resolution, so doubling your dimensions quadruples the memory. Finally there’s the VAE decode step at the end, which is a notorious memory spike that has crashed more runs than anything else.

On 8GB or less, SDXL is a fight and SD 1.5 is a pleasure. I’d rather generate ten good 512x512 images and upscale the keeper than wait for one SDXL render that OOMs at the VAE step anyway.

If you’re running AUTOMATIC1111’s web UI — still the path of least resistance — most of the battle is won in the launch arguments. Here is the configuration I actually use on the 6GB box:

# webui-user.sh
export COMMANDLINE_ARGS="--medvram-sdxl \
  --xformers \
  --opt-split-attention \
  --no-half-vae"

--medvram (or --medvram-sdxl, which only applies the penalty to SDXL) moves model components in and out of VRAM as needed, trading a little speed for a lot of headroom. --lowvram exists for 4GB cards and below; it works, but it’s slow enough that you’ll feel every second. --xformers is non-negotiable on NVIDIA — it cuts attention memory dramatically and speeds things up at the same time, which is a rare free lunch. --no-half-vae fixes the black-image-output bug that plagues certain cards when the VAE runs in fp16.

The two operations that blow your memory budget are high-resolution generation and the VAE decode. Both have tiling solutions. For the VAE, --opt-sdp-no-mem-attention or simply enabling tiled VAE in an extension chops the decode into chunks the card can swallow one at a time. For going large, don’t generate at high resolution directly — generate at 512x512 and use a tiled upscaler.

This is where the real workflow lives:

1. Generate base image at 512x512 (or 768x512)
2. Pick the keeper
3. Send to img2img with denoising ~0.3
4. Set "SD upscale" script, tile overlap 64, scale 2x
5. VAE decodes per-tile, never the whole frame at once

A 512x512 base upscaled 2x gives you a clean 1024x1024 that never required more than a few gigabytes at any single moment. The card thinks it’s doing easy work the whole time, because it is.

Below 4GB you start running out of tricks. You can offload to system RAM, but at that point the GPU is barely helping and a CPU render — slow as it is — becomes almost competitive for SD 1.5. There’s also a steady trickle of quantised and pruned checkpoints; an fp16 model pruned of its training-only weights can shave a few hundred megabytes, which is occasionally the difference between OOM and success.

Honestly, though, the cleanest win on a truly tiny card is just patience and SD 1.5. The community’s best 1.5 fine-tunes still produce gorgeous output, and they fit where SDXL never will.

On my 6GB card, SD 1.5 at 512x512 with 20 sampling steps lands around 8 seconds an image with xformers on. SDXL with --medvram-sdxl at 1024x1024 takes me somewhere north of 40 seconds and occasionally falls over on the VAE if I’ve got a browser hogging VRAM in the background. That ratio tells you everything about where to spend your time.

If you already own any reasonably modern NVIDIA card with 4GB or more, absolutely. You will generate thousands of images for the cost of electricity, keep every prompt private, and never wait in a queue or hit a credit limit. The compromise is real — you live in SD 1.5 land and upscale your way to resolution rather than generating it directly — but the results are genuinely good and the iteration loop is fast.

Who should not bother? Anyone who needs SDXL’s native high-resolution coherence for paid work, on a deadline, every day. For them the cloud or a bigger card pays for itself. For everyone else — the tinkerers, the hobbyists, the people who just want to make weird pictures at 1am — the budget GPU you already own is more than enough. Don’t let the 4090 evangelists tell you otherwise.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.