Running Stable Diffusion on a Budget GPU: What Actually Works Below 8GB VRAM

Generating images on the GPU you already own, not the one Reddit insists you buy

Smarc Included in

27-02-2024 1834 words 9 min read

Running Stable Diffusion on a Budget GPU: What Actually Works Below 8GB VRAM

Contents

Every thread about running Stable Diffusion locally eventually arrives at the same smug conclusion: just buy a 4090. This is wonderful advice if you have a spare grand and a power supply that doesn’t sound like a hairdryer. The rest of us are sitting on a 6GB laptop card, an old GTX 1060, or a 4GB GPU that the internet has decided is e-waste. Good news: the internet is wrong, and I have spent enough late nights proving it to write this down.

I run a 6GB card in a machine that mostly exists to transcode video, and it generates images perfectly happily. It has produced somewhere north of ten thousand images at this point, none of which cost me a per-image credit or leaked a prompt to a third party. The trick is understanding where the VRAM actually goes and which knobs reclaim it without turning every render into a minute-long ordeal. Nothing here requires a soldering iron or a firmware flash — just a handful of launch flags, an honest appraisal of what your card can and cannot do, and a workflow that plays to its strengths instead of pretending it’s something bigger.

Where your VRAM actually goes

There are three consumers, and if you understand them you can predict every out-of-memory error before it happens.

The model weights themselves are the big one: a standard SD 1.5 checkpoint in fp16 is roughly 2GB, while SDXL is closer to 6.5GB before you’ve generated a single pixel. That number alone tells you why 8GB cards struggle with SDXL — most of your budget is gone before any work begins. If you add a couple of LoRAs and a refiner, SDXL wants more headroom than a nominal 8GB card actually exposes to CUDA once the desktop compositor has taken its cut.

Then there’s the latent image, which grows with resolution — and crucially with the square of resolution, so doubling your dimensions quadruples the memory. A 512×512 latent is trivial; a 1536×1536 one is nine times larger. This is why “just render bigger” is a trap on a small card, and why the whole low-VRAM game is really a game of keeping the latent small and adding detail afterwards.

Finally there’s the VAE decode step at the end, which is a notorious memory spike that has crashed more runs than anything else. It’s maddening precisely because it happens last: you wait through the whole sampling process, the progress bar hits 100%, and then the card falls over converting the finished latent back into pixels. Everything you do to survive on a budget card is, in one way or another, about smoothing out that spike.

On 8GB or less, SDXL is a fight and SD 1.5 is a pleasure. I’d rather generate ten good 512×512 images and upscale the keeper than wait for one SDXL render that OOMs at the VAE step anyway. If you’re new to local generation and coming at this from the API angle, my write-up on turning ComfyUI into an image API covers the automation side once you’ve got the basics rendering.

The flags that actually matter

If you’re running AUTOMATIC1111’s web UI — still the path of least resistance for a first setup — most of the battle is won in the launch arguments. Here is the configuration I actually use on the 6GB box:

1
2
3
4
5
# webui-user.sh
export COMMANDLINE_ARGS="--medvram-sdxl \
  --xformers \
  --opt-split-attention \
  --no-half-vae"

--medvram (or --medvram-sdxl, which only applies the penalty to SDXL) moves model components in and out of VRAM as needed, trading a little speed for a lot of headroom. The way it works is that the text encoder, the U-Net, and the VAE are never all resident at once; the framework shuttles them across the PCIe bus as each stage runs. You pay for that shuttling in wall-clock time, but you buy the ability to run a model that otherwise wouldn’t fit at all.

--lowvram exists for 4GB cards and below; it splits the model even more aggressively, and it works, but it’s slow enough that you’ll feel every second. Reach for it only when --medvram still OOMs. --xformers is non-negotiable on NVIDIA — it swaps in memory-efficient attention kernels that cut attention memory dramatically and speed things up at the same time, which is a rare free lunch. Newer builds can use PyTorch’s native scaled-dot-product attention (--opt-sdp-attention) to similar effect if xformers won’t install cleanly for your CUDA version, so keep that in your back pocket. --no-half-vae fixes the black-image-output bug that plagues certain cards when the VAE runs in fp16 — it forces the decode to full precision, costs a little memory, and saves you from staring at a black square wondering what you did wrong.

A word on the environment: mismatched CUDA, driver, and PyTorch versions are the single most common reason a “working” flag combination suddenly stops working. Pin them, note what you pinned, and don’t upgrade the driver casually the night before you need the setup.

Tiling the bits that spike

The two operations that blow your memory budget are high-resolution generation and the VAE decode. Both have tiling solutions, and tiling is the single most important idea on a budget card.

For the VAE, enabling tiled VAE — either through the built-in option in newer builds or the Tiled VAE extension — chops the decode into overlapping chunks the card can swallow one at a time, then stitches them back together. This alone eliminates the majority of end-of-run crashes. For going large, don’t generate at high resolution directly — generate at 512×512 and use a tiled upscaler.

This is where the real workflow lives:

1
2
3
4
5
1. Generate base image at 512x512 (or 768x512)
2. Pick the keeper
3. Send to img2img with denoising ~0.3
4. Set "SD upscale" script, tile overlap 64, scale 2x
5. VAE decodes per-tile, never the whole frame at once

A 512×512 base upscaled 2× gives you a clean 1024×1024 that never required more than a few gigabytes at any single moment. The card thinks it’s doing easy work the whole time, because it is. Keep the denoising strength low on the upscale pass — around 0.3 to 0.4 — or the tiles start hallucinating detail that doesn’t match across seams. Too high and you get the classic tiled-upscale artefact where each square grows its own tiny extra face or window; too low and the upscale adds nothing. It’s the one number worth dialling in by eye.

When to abandon ship to CPU or quantisation

Below 4GB you start running out of tricks. You can offload to system RAM, but at that point the GPU is barely helping and a CPU render — slow as it is — becomes almost competitive for SD 1.5. There’s also a steady trickle of quantised and pruned checkpoints; an fp16 model pruned of its training-only weights (the so-called “pruned” or “EMA-only” checkpoints) can shave a few hundred megabytes, which is occasionally the difference between OOM and success.

The newer frontier here is genuinely quantised inference — 8-bit and even 4-bit weights via tooling that’s maturing quickly. It’s not as turnkey on the image side as it is for large language models yet, but it’s worth watching if you’re stuck below the SD 1.5 comfort line. Honestly, though, the cleanest win on a truly tiny card is just patience and SD 1.5. The community’s best 1.5 fine-tunes still produce genuinely lovely output, and they fit where SDXL never will.

Troubleshooting: the errors you will actually hit

Because so much of this is trial and error, here is the short list of failures I hit most often and what each one actually means.

CUDA out of memory at the very end of a run. This is the VAE decode spike. Enable tiled VAE first; if it persists, add --no-half-vae isn’t the fix here — instead lower your target resolution or upscale in a separate pass rather than generating large directly.
CUDA out of memory immediately, before sampling. The model itself doesn’t fit. Move from --medvram to --lowvram, or switch from SDXL to SD 1.5. No amount of tiling saves you if the weights won’t load.
A completely black output image. Classic fp16 VAE bug on certain cards. Add --no-half-vae. If you’re using a custom VAE, try the default one to isolate the cause.
xformers refuses to install or import. Almost always a PyTorch/CUDA version mismatch. Drop --xformers and use --opt-sdp-attention instead; you lose a little speed but keep the memory savings.
Renders that were fast yesterday are crawling today. Something else is holding VRAM. A browser with hardware acceleration, another model server, or a stuck previous process. Check with nvidia-smi and kill the freeloader.

That last one is worth internalising: on a card this small, the difference between success and OOM is often a single Chrome tab. If you’re sharing the machine with other GPU workloads, my notes on GPU scheduling and resource sharing for AI inference cover how to stop two jobs fighting over the same 6GB.

A realistic benchmark

On my 6GB card, SD 1.5 at 512×512 with 20 sampling steps lands around 8 seconds an image with xformers on. SDXL with --medvram-sdxl at 1024×1024 takes me somewhere north of 40 seconds and occasionally falls over on the VAE if I’ve got a browser hogging VRAM in the background. That five-to-one ratio tells you everything about where to spend your time: for the cost of one SDXL render you could have generated five SD 1.5 candidates, picked the best, and upscaled it — and ended up somewhere better.

Steps matter too. Going from 20 to 40 sampling steps roughly doubles the time for a difference most people can’t spot in a blind comparison. Twenty steps on a good sampler is plenty for iteration; save the high step counts for the final render of a keeper.

Is it worth it?

If you already own any reasonably modern NVIDIA card with 4GB or more, absolutely. You will generate thousands of images for the cost of electricity, keep every prompt private, and never wait in a queue or hit a credit limit. The compromise is real — you live in SD 1.5 land and upscale your way to resolution rather than generating it directly — but the results are genuinely good and the iteration loop is fast enough that you’ll experiment more, not less.

Who should not bother? Anyone who needs SDXL’s native high-resolution coherence for paid work, on a deadline, every day. For them the cloud or a bigger card pays for itself, and pretending otherwise is false economy of exactly the kind I’ve argued against elsewhere — running your own hardware only makes sense when the sums, including your time, actually add up. For everyone else — the tinkerers, the hobbyists, the people who just want to make weird pictures at 1am — the budget GPU you already own is more than enough. Don’t let the 4090 evangelists tell you otherwise.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#machine-learning #stable-diffusion #GPU #self-hosting

Contents

Running Stable Diffusion on a Budget GPU: What Actually Works Below 8GB VRAM

Generating images on the GPU you already own, not the one Reddit insists you buy

Where your VRAM actually goes

The flags that actually matter

Tiling the bits that spike

When to abandon ship to CPU or quantisation

Troubleshooting: the errors you will actually hit

A realistic benchmark

Is it worth it?

Related Content

ComfyUI: Node-Based Image Generation for People Who Want Control

Immich Machine Learning on a Spare GPU

Label Studio: Self-Hosted Data Annotation for Training Your Own Models

Running AI Inference on Kubernetes: GPU Scheduling, Ollama, and Resource Sharing