anyone running via cpu+gpu+rpc gpu ?
i am getting very slow when i am running like this lol any help
Heya gopi! I'd have to see your full command to help out better and understand your setup too especially given you are using RPC.
RPC is not the most supported feature, and not likely to give best performance as you probably know.
Your best bet is trying to run on a single system with CPU+GPU(s).
Maybe a smaller quant would fit on a single rig? Otherwise, you'll have to play some games and do some research to figure out the best way of organizing the order of devices for using RPC. Also I don't think RPC can take advantage of the new -sm parallel last time I tried, but things move so quickly who knows today haha..
Happy new year!
happy new year!
ubergarm i have 256 ram and dual cpu and 12gb rtx and extranal connected gpu 3080 with 16vram.
looks like i need to add another gou in my server for boost up .
CUDA_VISIBLE_DEVICES="" ./bin/llama-server
--model "/home/gopi/deepresearch-ui/model/MiMo-V2-Flash-Q4_K_M-00001-of-00004.gguf"
--ctx-size 30000
--threads 40
--threads-batch 40
--host 0.0.0.0
--jinja
--port 8080
--mlock
--no-mmap
CUDA_VISIBLE_DEVICES="0" ./bin/llama-server
--model "/home/gopi/deepresearch-ui/model/MiniMax-M2.1-MXFP4_MOE-00001-of-00007.gguf"
--ctx-size 20000
-ngl 99
--n-cpu-moe 63
--threads 28
--threads-batch 28
--host 0.0.0.0
--mlock
--no-mmap
--jinja
--port 8080
i am currently expolering this two model and also i am thinking of to create a website for people share there model tricks and system specification so that it would be help to everyone. whats your on thoughts this ?
ubergarm i have 256 ram and dual cpu and 12gb rtx and extranal connected gpu 3080 with 16vram.
looks like i need to add another gou in my server for boost up .
Dual CPU can be tricky depending on how you have configured NUMA nodes. You might want to try using something like when you need all the RAM:
numactl --interleave=all llama-server --numa distribute ...
If your quant can fit in a single NUMA node, consider something like this:
numactl -N 0 -m 0 llama-server --numa numactl ...
MXFP4
I haven't tried MiniMax but looking at the modelcard https://huggingface.co/MiniMaxAI/MiniMax-M2.1 it does not seem to be released in MXFP4 so I would avoid that format unless the original release specifies that is the correct quant type to use. I would prefer usually ik's IQ4_KSS at 4.0 BPW or similar newer types if you can find or quantize them. I haven't done this model assuming it is supported on ik. Otherwise Q4_K_M is probably pretty good choice.
i am thinking of to create a website for people share there model tricks and system specification so that it would be help to everyone. whats your on thoughts this ?
There are so many features in ik_llama.cpp something like this could be useful. Though it is difficult to keep it up to date as it changes so quickly. Usually people can track the ik_llama.cpp PR history for the latest information, or sometimes read my most recent discussions on huggingface to get some tips. Or hang out on the Beaver AI discord for real-time chat.
But definitely share the link if you give it a go! Cheers!