Replies: 1 comment
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I saw my memory usage went to ~98% during LLM model load to GPU (I had not much free RAM at that moment) and then "Terminated" in terminal. So I've added MMAP option in koboldcpp and it loaded, albeit longer. The delay was on "Model warm up" message in terminal. But it took like ~20 seconds to generate 1st token.
Smaller model that loaded without MMAP worked equally fast loaded with and without MMAP. With MMAP RAM usage for smaller model was lower all the way (after load too). For larger model memory usage was about same as for smaller (both when loaded with MMAP). Larger worked fine (fast) before - when free RAM was abundant.
Questions: 1) why larger model so slow if should be fully in VRAM after load? If not in VRAM, why? Can I fully put LLM into VRAM with koboldcpp (or if not with other tool)? 2) seems after load into GPU (Vulkan on NVIDIA) RAM taken to model load is not released. Why? 3) What does it mean in koboldcpp GUI MMAP help says "model will not be unloadable"? How to upload models if MMAP is not used?
Beta Was this translation helpful? Give feedback.
All reactions