Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default. Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100% |
||
---|---|---|
.. | ||
fastapi_server.py | ||
high_level_api_embedding.py | ||
high_level_api_inference.py | ||
high_level_api_streaming.py | ||
langchain_custom_llm.py |