the gguf working but need to reduce the gpu layers

#2
by gopi87 - opened

hi the gguf working but need to reduce the gpu layers and increase the cpu layers while doing qunatation

@ubergarm any help for this ? this is looks promising btw.

Deciding which layers go where sounds like it is configurable at inference / model loading time, right?

@yhavinga

Yes you can manually specify which layers go where.

I took only a very quick peek into the GGUFs and these look like mostly vanilla llama.cpp flavored mixtures? Already the attn/shexp/first N dense layers are quite small relative to the routed experts. In my own recipes I tend to keep those attn/shexp/first N dense layers larger and more heavily quantize the routed experts.

Ideally the attn/shexp/first N dense layers all fit into VRAM/GPU and only routed experts will run on the CPU/RAM for best speeds.

I'm curious how fast the REAP versions run here, as anecdotally I've heard REAP can run slower than the originals for some reason despite having much less weights. Also I have some full size versions that are smaller than this REAP version which would be interesting to compare perplexity. My smol-IQ1_KT can fit entirely on a 96GB VRAM for example and still runs okay haha...

Anyway cool to see so many options these days! Cheers!

@ubergarm I would love to try your iq ks quants for this one. Did extensive python grinding
with Q6 and results are really good - @yhavinga thank you for that. Achieved around 3ts with all experts mlocked in ram in vanilla llama cpp. Maybe with ik4_ks we might get close to this results but with much better speed on ik_llama cpp?

Sign up or log in to comment