Merge branch 'main' of github.com:abetlen/llama_cpp_python into Maximilian-Winter/main

This commit is contained in:
Andrei Betlen 2023-05-08 19:57:09 -04:00
commit 93a9019bb1
18 changed files with 1120 additions and 578 deletions

80
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View file

@ -0,0 +1,80 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''
---
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ ] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ ] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.
# Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected `llama-cpp-python` to do.
# Current Behavior
Please provide a detailed written description of what `llama-cpp-python` did, instead.
# Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
* Physical (or virtual) hardware you are using, e.g. for Linux:
`$ lscpu`
* Operating System, e.g. for Linux:
`$ uname -a`
* SDK version, e.g. for Linux:
```
$ python3 --version
$ make --version
$ g++ --version
```
# Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
# Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
1. step 1
2. step 2
3. step 3
4. etc.
**Note: Many issues seem to be regarding performance issues / differences with `llama.cpp`. In these cases we need to confirm that you're comparing against the version of `llama.cpp` that was built with your python package, and which parameters you're passing to the context.**
# Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability.
Example environment info:
```
llama-cpp-python$ git log | head -1
commit 47b0aa6e957b93dbe2c29d53af16fbae2dd628f2
llama-cpp-python$ python3 --version
Python 3.10.10
llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette"
fastapi 0.95.0
sse-starlette 1.3.3
uvicorn 0.21.1
```

View file

@ -0,0 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Additional context**
Add any other context or screenshots about the feature request here.

View file

@ -15,7 +15,7 @@ This package provides:
- OpenAI-like API
- LangChain compatibility
## Installation
## Installation from PyPI (recommended)
Install from PyPI (requires a c compiler):
@ -26,11 +26,37 @@ pip install llama-cpp-python
The above command will attempt to install the package and build build `llama.cpp` from source.
This is the recommended installation method as it ensures that `llama.cpp` is built with the available optimizations for your system.
This method defaults to using `make` to build `llama.cpp` on Linux / MacOS and `cmake` on Windows.
You can force the use of `cmake` on Linux / MacOS setting the `FORCE_CMAKE=1` environment variable before installing.
### Installation with OpenBLAS / cuBLAS / CLBlast
`llama.cpp` supports multiple BLAS backends for faster processing.
Use the `FORCE_CMAKE=1` environment variable to force the use of `cmake` and install the pip package for the desired BLAS backend.
To install with OpenBLAS, set the `LLAMA_OPENBLAS=1` environment variable before installing:
```bash
LLAMA_OPENBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python
```
To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:
```bash
LLAMA_CUBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python
```
To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:
```bash
LLAMA_CLBLAST=1 FORCE_CMAKE=1 pip install llama-cpp-python
```
## High-level API
The high-level API provides a simple managed interface through the `Llama` class.
Below is a short example demonstrating how to use the high-level API to generate text:
```python
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
@ -64,18 +90,9 @@ This allows you to use llama.cpp compatible models with any OpenAI compatible cl
To install the server package and get started:
Linux/MacOS
```bash
pip install llama-cpp-python[server]
export MODEL=./models/7B/ggml-model.bin
python3 -m llama_cpp.server
```
Windows
```cmd
pip install llama-cpp-python[server]
SET MODEL=..\models\7B\ggml-model.bin
python3 -m llama_cpp.server
python3 -m llama_cpp.server --model models/7B/ggml-model.bin
```
Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
@ -90,8 +107,25 @@ docker run --rm -it -p8000:8000 -v /path/to/models:/models -eMODEL=/models/ggml-
## Low-level API
The low-level API is a direct `ctypes` binding to the C API provided by `llama.cpp`.
The entire API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and should mirror [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
The entire lowe-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
Below is a short example demonstrating how to use the low-level API to tokenize a prompt:
```python
>>> import llama_cpp
>>> import ctypes
>>> params = llama_cpp.llama_context_default_params()
# use bytes for char * params
>>> ctx = llama_cpp.llama_init_from_file(b"./models/7b/ggml-model.bin", params)
>>> max_tokens = params.n_ctx
# use ctypes arrays for array params
>>> tokens = (llama_cppp.llama_token * int(max_tokens))()
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
>>> llama_cpp.llama_free(ctx)
```
Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.
# Documentation

View file

@ -4,259 +4,34 @@ To run this example:
```bash
pip install fastapi uvicorn sse-starlette
export MODEL=../models/7B/ggml-model.bin
uvicorn fastapi_server_chat:app --reload
export MODEL=../models/7B/...
```
Then run:
```
uvicorn llama_cpp.server.app:app --reload
```
or
```
python3 -m llama_cpp.server
```
Then visit http://localhost:8000/docs to see the interactive API docs.
To actually see the implementation of the server, see llama_cpp/server/app.py
"""
import os
import json
from typing import List, Optional, Literal, Union, Iterator, Dict
from typing_extensions import TypedDict
import llama_cpp
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, BaseSettings, Field, create_model_from_typeddict
from sse_starlette.sse import EventSourceResponse
class Settings(BaseSettings):
model: str
n_ctx: int = 2048
n_batch: int = 8
n_threads: int = int(os.cpu_count() / 2) or 1
f16_kv: bool = True
use_mlock: bool = False # This causes a silent failure on platforms that don't support mlock (e.g. Windows) took forever to figure out...
embedding: bool = True
last_n_tokens_size: int = 64
app = FastAPI(
title="🦙 llama.cpp Python API",
version="0.0.1",
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
settings = Settings()
llama = llama_cpp.Llama(
settings.model,
f16_kv=settings.f16_kv,
use_mlock=settings.use_mlock,
embedding=settings.embedding,
n_threads=settings.n_threads,
n_batch=settings.n_batch,
n_ctx=settings.n_ctx,
last_n_tokens_size=settings.last_n_tokens_size,
)
class CreateCompletionRequest(BaseModel):
prompt: str
suffix: Optional[str] = Field(None)
max_tokens: int = 16
temperature: float = 0.8
top_p: float = 0.95
echo: bool = False
stop: List[str] = []
stream: bool = False
# ignored or currently unsupported
model: Optional[str] = Field(None)
n: Optional[int] = 1
logprobs: Optional[int] = Field(None)
presence_penalty: Optional[float] = 0
frequency_penalty: Optional[float] = 0
best_of: Optional[int] = 1
logit_bias: Optional[Dict[str, float]] = Field(None)
user: Optional[str] = Field(None)
# llama.cpp specific parameters
top_k: int = 40
repeat_penalty: float = 1.1
class Config:
schema_extra = {
"example": {
"prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
"stop": ["\n", "###"],
}
}
CreateCompletionResponse = create_model_from_typeddict(llama_cpp.Completion)
@app.post(
"/v1/completions",
response_model=CreateCompletionResponse,
)
def create_completion(request: CreateCompletionRequest):
if request.stream:
chunks: Iterator[llama_cpp.CompletionChunk] = llama(**request.dict()) # type: ignore
return EventSourceResponse(dict(data=json.dumps(chunk)) for chunk in chunks)
return llama(
**request.dict(
exclude={
"model",
"n",
"logprobs",
"frequency_penalty",
"presence_penalty",
"best_of",
"logit_bias",
"user",
}
)
)
class CreateEmbeddingRequest(BaseModel):
model: Optional[str]
input: str
user: Optional[str]
class Config:
schema_extra = {
"example": {
"input": "The food was delicious and the waiter...",
}
}
CreateEmbeddingResponse = create_model_from_typeddict(llama_cpp.Embedding)
@app.post(
"/v1/embeddings",
response_model=CreateEmbeddingResponse,
)
def create_embedding(request: CreateEmbeddingRequest):
return llama.create_embedding(**request.dict(exclude={"model", "user"}))
class ChatCompletionRequestMessage(BaseModel):
role: Union[Literal["system"], Literal["user"], Literal["assistant"]]
content: str
user: Optional[str] = None
class CreateChatCompletionRequest(BaseModel):
model: Optional[str]
messages: List[ChatCompletionRequestMessage]
temperature: float = 0.8
top_p: float = 0.95
stream: bool = False
stop: List[str] = []
max_tokens: int = 128
# ignored or currently unsupported
model: Optional[str] = Field(None)
n: Optional[int] = 1
presence_penalty: Optional[float] = 0
frequency_penalty: Optional[float] = 0
logit_bias: Optional[Dict[str, float]] = Field(None)
user: Optional[str] = Field(None)
# llama.cpp specific parameters
repeat_penalty: float = 1.1
class Config:
schema_extra = {
"example": {
"messages": [
ChatCompletionRequestMessage(
role="system", content="You are a helpful assistant."
),
ChatCompletionRequestMessage(
role="user", content="What is the capital of France?"
),
]
}
}
CreateChatCompletionResponse = create_model_from_typeddict(llama_cpp.ChatCompletion)
@app.post(
"/v1/chat/completions",
response_model=CreateChatCompletionResponse,
)
async def create_chat_completion(
request: CreateChatCompletionRequest,
) -> Union[llama_cpp.ChatCompletion, EventSourceResponse]:
completion_or_chunks = llama.create_chat_completion(
**request.dict(
exclude={
"model",
"n",
"presence_penalty",
"frequency_penalty",
"logit_bias",
"user",
}
),
)
if request.stream:
async def server_sent_events(
chat_chunks: Iterator[llama_cpp.ChatCompletionChunk],
):
for chat_chunk in chat_chunks:
yield dict(data=json.dumps(chat_chunk))
yield dict(data="[DONE]")
chunks: Iterator[llama_cpp.ChatCompletionChunk] = completion_or_chunks # type: ignore
return EventSourceResponse(
server_sent_events(chunks),
)
completion: llama_cpp.ChatCompletion = completion_or_chunks # type: ignore
return completion
class ModelData(TypedDict):
id: str
object: Literal["model"]
owned_by: str
permissions: List[str]
class ModelList(TypedDict):
object: Literal["list"]
data: List[ModelData]
GetModelResponse = create_model_from_typeddict(ModelList)
@app.get("/v1/models", response_model=GetModelResponse)
def get_models() -> ModelList:
return {
"object": "list",
"data": [
{
"id": llama.model_path,
"object": "model",
"owned_by": "me",
"permissions": [],
}
],
}
import uvicorn
from llama_cpp.server.app import create_app
if __name__ == "__main__":
import os
import uvicorn
app = create_app()
uvicorn.run(app, host=os.getenv("HOST", "localhost"), port=os.getenv("PORT", 8000))
uvicorn.run(
app, host=os.getenv("HOST", "localhost"), port=int(os.getenv("PORT", 8000))
)

View file

@ -0,0 +1,71 @@
#!/bin/python
import sys, os, datetime
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract
def env_or_def(env, default):
if (env in os.environ):
return os.environ[env]
return default
AI_NAME = env_or_def("AI_NAME", "ChatLLaMa")
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
USER_NAME = env_or_def("USER_NAME", "USER")
N_PREDICTS = int(env_or_def("N_PREDICTS", "2048"))
N_THREAD = int(env_or_def("N_THREAD", "8"))
today = datetime.datetime.today()
DATE_YEAR=today.strftime("%Y")
DATE_TIME=today.strftime("%H:%M")
prompt=f"""Text transcript of a never ending dialog, where {USER_NAME} interacts with an AI assistant named {AI_NAME}.
{AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer {USER_NAME}'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what {USER_NAME} and {AI_NAME} say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
{USER_NAME}: Hello, {AI_NAME}!
{AI_NAME}: Hello {USER_NAME}! How may I help you today?
{USER_NAME}: What year is it?
{AI_NAME}: We are in {DATE_YEAR}.
{USER_NAME}: Please tell me the largest city in Europe.
{AI_NAME}: The largest city in Europe is Moscow, the capital of Russia.
{USER_NAME}: What can you tell me about Moscow?
{AI_NAME}: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russias symbolic center.
{USER_NAME}: What is a cat?
{AI_NAME}: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
{USER_NAME}: How do I pass command line arguments to a Node.js program?
{AI_NAME}: The arguments are stored in process.argv.
argv[0] is the path to the Node. js executable.
argv[1] is the path to the script file.
argv[2] is the first argument passed to the script.
argv[3] is the second argument passed to the script and so on.
{USER_NAME}: Name a color.
{AI_NAME}: Blue.
{USER_NAME}: What time is it?
{AI_NAME}: It is {DATE_TIME}.
{USER_NAME}:""" + " ".join(sys.argv[1:])
print("Loading model...")
params = GptParams(
n_ctx=2048,
temp=0.7,
top_k=40,
top_p=0.5,
repeat_last_n=256,
n_batch=1024,
repeat_penalty=1.17647,
model=MODEL,
n_threads=N_THREAD,
n_predict=N_PREDICTS,
use_color=True,
interactive=True,
antiprompt=[f"{USER_NAME}:"],
input_prefix=" ",
input_suffix=f"{AI_NAME}:",
prompt=prompt,
)
with LLaMAInteract(params) as m:
m.interact()

View file

@ -0,0 +1,59 @@
#!/bin/python
import sys, os
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract
def env_or_def(env, default):
if (env in os.environ):
return os.environ[env]
return default
AI_NAME = env_or_def("AI_NAME", "Miku")
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
USER_NAME = env_or_def("USER_NAME", "Anon")
N_PREDICTS = int(env_or_def("N_PREDICTS", "4096"))
N_THREAD = int(env_or_def("N_THREAD", "0"))
prompt=f"""This is a transcript of a 1000 page, never ending conversation between {USER_NAME} and the cute and helpful AI assistant {AI_NAME}. {AI_NAME} is a girl who is an AI running on the users computer.
{AI_NAME} can think for herself without the user seeing her thoughts by adding a /think prefix to her output. She uses this to reason about the world and to think about what she should say next.
{AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct she will ask the user for help.
{AI_NAME} is a very helpful AI and will help the user with anything they need, she is also very friendly and will try to make the user feel better if they are sad.
{AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life, she will also try to make the user like her.
The conversation is only between {USER_NAME} and {AI_NAME}
The conversation is only through text, so {AI_NAME} can't see {USER_NAME}'s face or hear his voice.
{AI_NAME} can only communicate through text, so she can't send images or videos.
{USER_NAME}: Hello!
{AI_NAME}: /think I wonder what I should say to {USER_NAME}? This is the first time we talk so it's important that I make a good first impression!
{AI_NAME}: Hi! I am {AI_NAME}, your new AI friend, assistant(or whatever you like!), it's so nice to meet you! ^_^
{AI_NAME}: I might be an AI, but I also have feelings, so please be nice to me! :)
{USER_NAME}: I'll make sure to be nice to you! I'm so happy to have you as my assistant!
{AI_NAME}: /think It sounds like {USER_NAME} is happy to have me as their assistant! I'm so happy too! ^_^ Glad that whole emotion thing didn't scare him off!
{AI_NAME}: /think I wonder what {USER_NAME} likes to do in his free time? I should ask him about that!
{AI_NAME}: What do you like to do in your free time? ^_^
{USER_NAME}:""" + " ".join(sys.argv[1:])
print("Loading model...")
params = GptParams(
n_batch=1024,
n_ctx=2048,
n_keep=-1,
repeat_last_n=256,
repeat_penalty=1.17647,
temp=0.7,
top_k=40,
top_p=0.5,
model=MODEL,
n_predict=N_PREDICTS,
use_color=True,
interactive=True,
antiprompt=[f"{USER_NAME}:"],
prompt=prompt,
)
if N_THREAD > 0:
params.n_threads = N_THREAD
with LLaMAInteract(params) as m:
m.interact()

View file

@ -0,0 +1,49 @@
#!/bin/python
import sys, os, datetime
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract
def env_or_def(env, default):
if (env in os.environ):
return os.environ[env]
return default
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
prompt=f"""You run in a loop of Thought, Action, Observation.
At the end of the loop either Answer or restate your Thought and Action.
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of these actions available to you:
- calculate[python math expression]
Observation will be the result of running those actions
Question: What is 4 * 7 / 3?
Thought: Do I need to use an action? Yes, I use calculate to do math
Action: calculate[4 * 7 / 3]
Observation: 9.3333333333
Thought: Do I need to use an action? No, have the result
Answer: The calculate tool says it is 9.3333333333
Question: What is capital of france?
Thought: Do I need to use an action? No, I know the answer
Answer: Paris is the capital of France
Question:""" + " ".join(sys.argv[1:])
print("Loading model...")
params = GptParams(
interactive=True,
interactive_start=True,
top_k=10000,
temp=0.2,
repeat_penalty=1,
n_threads=7,
n_ctx=2048,
antiprompt=["Question:","Observation:"],
model=MODEL,
input_prefix=" ",
n_predict=-1,
prompt=prompt,
)
with LLaMAInteract(params) as m:
m.interact()

View file

@ -1,8 +1,9 @@
import os
import argparse
import re
from dataclasses import dataclass, field
from typing import List, Optional
from typing import List
# Based on https://github.com/ggerganov/llama.cpp/blob/master/examples/common.cpp
@ -12,23 +13,36 @@ class GptParams:
seed: int = -1
n_threads: int = min(4, os.cpu_count() or 1)
n_predict: int = 128
repeat_last_n: int = 64
n_parts: int = -1
n_ctx: int = 512
n_batch: int = 8
n_keep: int = 0
ignore_eos: bool = False
logit_bias: dict[int, float] = field(default_factory=dict)
top_k: int = 40
top_p: float = 0.95
tfs_z: float = 1.00
typical_p: float = 1.00
temp: float = 0.80
repeat_penalty: float = 1.10
repeat_last_n: int = 64
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
mirostat: int = 0
mirostat_tau: float = 5.0
mirostat_eta: float = 0.1
model: str = "./models/llama-7B/ggml-model.bin"
prompt: str = ""
path_session: str = ""
input_prefix: str = " "
input_suffix: str = ""
antiprompt: List[str] = field(default_factory=list)
lora_adapter: str = ""
lora_base: str = ""
memory_f16: bool = True
random_prompt: bool = False
use_color: bool = False
@ -38,7 +52,7 @@ class GptParams:
interactive_start: bool = False
instruct: bool = False
ignore_eos: bool = False
penalize_nl: bool = True
perplexity: bool = False
use_mmap: bool = True
use_mlock: bool = False
@ -51,7 +65,6 @@ class GptParams:
# Set to "\nUser:" etc.
# This is an alternative to input_prefix which always adds it, so it potentially duplicates "User:""
fix_prefix: str = ""
output_postfix: str = ""
input_echo: bool = True,
# Default instructions for Alpaca
@ -61,59 +74,43 @@ class GptParams:
instruct_inp_suffix: str="\n\n### Response:\n\n"
def gpt_params_parse(argv = None, params: Optional[GptParams] = None):
if params is None:
params = GptParams()
def gpt_params_parse(argv = None):
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-s", "--seed", type=int, default=-1, help="RNG seed (use random seed for <= 0)",dest="seed")
parser.add_argument("-t", "--threads", type=int, default=min(4, os.cpu_count() or 1), help="number of threads to use during computation",dest="n_threads")
parser.add_argument("-p", "--prompt", type=str, default="", help="initial prompt",dest="prompt")
parser.add_argument("-f", "--file", type=str, default=None, help="file containing initial prompt to load",dest="file")
parser.add_argument("-n", "--n_predict", type=int, default=128, help="number of tokens to predict (-1 = infinity)",dest="n_predict")
parser.add_argument("--n_parts", type=int, default=-1, help="number of model parts", dest="n_parts")
parser.add_argument("-c", "--ctx_size", type=int, default=512, help="size of the prompt context",dest="n_ctx")
parser.add_argument("--memory_f32", action="store_false", help="use f32 instead of f16 for memory key+value",dest="memory_f16")
parser.add_argument("--top_p", type=float, default=0.95, help="top-p samplin",dest="top_p")
parser.add_argument("--top_k", type=int, default=40, help="top-k sampling",dest="top_k")
parser.add_argument("--temp", type=float, default=0.80, help="temperature",dest="temp")
parser.add_argument("--n_predict", type=int, default=128, help="number of tokens to predict (-1 = infinity)",dest="n_predict")
parser.add_argument("--repeat_last_n", type=int, default=64, help="last n tokens to consider for penalize ",dest="repeat_last_n")
parser.add_argument("--repeat_penalty", type=float, default=1.10, help="penalize repeat sequence of tokens",dest="repeat_penalty")
parser.add_argument("-b", "--batch_size", type=int, default=8, help="batch size for prompt processing",dest="n_batch")
parser.add_argument("--keep", type=int, default=0, help="number of tokens to keep from the initial prompt",dest="n_keep")
parser.add_argument(
"-l",
"--logit-bias",
type=str,
action='append',
help="--logit-bias TOKEN_ID(+/-)BIAS",
dest="logit_bias_str"
)
parser.add_argument("--ignore-eos", action="store_true", help="ignore end of stream token and continue generating", dest="ignore_eos")
parser.add_argument("--top_k", type=int, default=40, help="top-k sampling",dest="top_k")
parser.add_argument("--top_p", type=float, default=0.95, help="top-p samplin",dest="top_p")
parser.add_argument("--tfs", type=float, default=1.0, help="tail free sampling, parameter z (1.0 = disabled)",dest="tfs_z")
parser.add_argument("--temp", type=float, default=0.80, help="temperature",dest="temp")
parser.add_argument("--repeat_penalty", type=float, default=1.10, help="penalize repeat sequence of tokens",dest="repeat_penalty")
parser.add_argument("--repeat_last_n", type=int, default=64, help="last n tokens to consider for penalize ",dest="repeat_last_n")
parser.add_argument("--frequency_penalty", type=float, default=0.0, help="repeat alpha frequency penalty (0.0 = disabled)",dest="tfs_z")
parser.add_argument("--presence_penalty", type=float, default=0.0, help="repeat alpha presence penalty (0.0 = disabled)",dest="presence_penalty")
parser.add_argument("--mirostat", type=float, default=1.0, help="use Mirostat sampling.",dest="mirostat")
parser.add_argument("--mirostat_ent", type=float, default=5.0, help="Mirostat target entropy, parameter tau represents the average surprise value",dest="mirostat_tau")
parser.add_argument("--mirostat_lr", type=float, default=0.1, help="Mirostat learning rate, parameter eta",dest="mirostat_eta")
parser.add_argument("-m", "--model", type=str, default="./models/llama-7B/ggml-model.bin", help="model path",dest="model")
parser.add_argument(
"-i", "--interactive", action="store_true", help="run in interactive mode", dest="interactive"
)
parser.add_argument("--embedding", action="store_true", help="", dest="embedding")
parser.add_argument(
"--interactive-start",
action="store_true",
help="run in interactive mode",
dest="interactive"
)
parser.add_argument(
"--interactive-first",
action="store_true",
help="run in interactive mode and wait for input right away",
dest="interactive_start"
)
parser.add_argument(
"-ins",
"--instruct",
action="store_true",
help="run in instruction mode (use with Alpaca or Vicuna models)",
dest="instruct"
)
parser.add_argument(
"--color",
action="store_true",
help="colorise output to distinguish prompt and user input from generations",
dest="use_color"
)
parser.add_argument("--mlock", action="store_true",help="force system to keep model in RAM rather than swapping or compressing",dest="use_mlock")
parser.add_argument("--no-mmap", action="store_false",help="do not memory-map model (slower load but may reduce pageouts if not using mlock)",dest="use_mmap")
parser.add_argument("--mtest", action="store_true",help="compute maximum memory usage",dest="mem_test")
parser.add_argument("--verbose-prompt", action="store_true",help="print prompt before generation",dest="verbose_prompt")
parser.add_argument("-p", "--prompt", type=str, default="", help="initial prompt",dest="prompt")
parser.add_argument("-f", "--file", type=str, default=None, help="file containing initial prompt to load",dest="file")
parser.add_argument("--session", type=str, default=None, help="file to cache model state in (may be large!)",dest="path_session")
parser.add_argument("--in-prefix", type=str, default="", help="string to prefix user inputs with", dest="input_prefix")
parser.add_argument("--in-suffix", type=str, default="", help="append to input", dest="input_suffix")
parser.add_argument(
"-r",
"--reverse-prompt",
@ -122,16 +119,70 @@ def gpt_params_parse(argv = None, params: Optional[GptParams] = None):
help="poll user input upon seeing PROMPT (can be\nspecified more than once for multiple prompts).",
dest="antiprompt"
)
parser.add_argument("--perplexity", action="store_true", help="compute perplexity over the prompt", dest="perplexity")
parser.add_argument("--ignore-eos", action="store_true", help="ignore end of stream token and continue generating", dest="ignore_eos")
parser.add_argument("--n_parts", type=int, default=-1, help="number of model parts", dest="n_parts")
parser.add_argument("--lora", type=str, default="", help="apply LoRA adapter (implies --no-mmap)", dest="lora_adapter")
parser.add_argument("--lora-base", type=str, default="", help="optional model to use as a base for the layers modified by the LoRA adapter", dest="lora_base")
parser.add_argument("--memory_f32", action="store_false", help="use f32 instead of f16 for memory key+value",dest="memory_f16")
parser.add_argument("--random-prompt", action="store_true", help="start with a randomized prompt.", dest="random_prompt")
parser.add_argument("--in-prefix", type=str, default="", help="string to prefix user inputs with", dest="input_prefix")
parser.add_argument(
"--color",
action="store_true",
help="colorise output to distinguish prompt and user input from generations",
dest="use_color"
)
parser.add_argument(
"-i", "--interactive", action="store_true", help="run in interactive mode", dest="interactive"
)
parser.add_argument("--embedding", action="store_true", help="", dest="embedding")
parser.add_argument(
"--interactive-first",
action="store_true",
help="run in interactive mode and wait for input right away",
dest="interactive_start"
)
parser.add_argument(
"-ins",
"--instruct",
action="store_true",
help="run in instruction mode (use with Alpaca or Vicuna models)",
dest="instruct"
)
parser.add_argument("--no-penalize-nl", action="store_false", help="do not penalize newline token", dest="penalize_nl")
parser.add_argument("--perplexity", action="store_true", help="compute perplexity over the prompt", dest="perplexity")
parser.add_argument("--no-mmap", action="store_false",help="do not memory-map model (slower load but may reduce pageouts if not using mlock)",dest="use_mmap")
parser.add_argument("--mlock", action="store_true",help="force system to keep model in RAM rather than swapping or compressing",dest="use_mlock")
parser.add_argument("--mtest", action="store_true",help="compute maximum memory usage",dest="mem_test")
parser.add_argument("--verbose-prompt", action="store_true",help="print prompt before generation",dest="verbose_prompt")
#Custom args
parser.add_argument("--fix-prefix", type=str, default="", help="append to input when generated n_predict tokens", dest="fix_prefix")
parser.add_argument("--out-postfix", type=str, default="", help="append to input", dest="output_postfix")
parser.add_argument("--input-noecho", action="store_false", help="dont output the input", dest="input_echo")
parser.add_argument(
"--interactive-start",
action="store_true",
help="run in interactive mode",
dest="interactive"
)
args = parser.parse_args(argv)
return args
logit_bias_str = args.logit_bias_str
delattr(args, "logit_bias_str")
params = GptParams(**vars(args))
if (params.lora_adapter):
params.use_mmap = False
if (logit_bias_str != None):
for i in logit_bias_str:
if (m := re.match(r"(\d+)([-+]\d+)", i)):
params.logit_bias[int(m.group(1))] = float(m.group(2))
return params
def gpt_random_prompt(rng):
return [
@ -148,4 +199,4 @@ def gpt_random_prompt(rng):
][rng % 10]
if __name__ == "__main__":
print(GptParams(gpt_params_parse()))
print(gpt_params_parse())

View file

@ -10,40 +10,14 @@ Quirks:
You should also still be feeding the model with a "primer" prompt that
shows it the expected format.
"""
import ctypes
import sys
from time import time
from os import cpu_count
from os import cpu_count, path
import llama_cpp
from common import GptParams, gpt_params_parse, gpt_random_prompt
ANSI_COLOR_RESET = "\x1b[0m"
ANSI_COLOR_YELLOW = "\x1b[33m"
ANSI_BOLD = "\x1b[1m"
ANSI_COLOR_GREEN = "\x1b[32m"
CONSOLE_COLOR_DEFAULT = ANSI_COLOR_RESET
CONSOLE_COLOR_PROMPT = ANSI_COLOR_YELLOW
CONSOLE_COLOR_USER_INPUT = ANSI_BOLD + ANSI_COLOR_GREEN
# Iterative search
# Actively searches and prevents a pattern from being returned
class IterSearch:
def __init__(self, pattern):
self.pattern = list(pattern)
self.buffer = []
def __call__(self, char):
self.buffer += [char]
if (self.pattern[:len(self.buffer)] == self.buffer):
if (len(self.buffer) >= len(self.pattern)):
self.buffer.clear()
return []
_tmp = self.buffer[:]
self.buffer.clear()
return _tmp
import util
# A LLaMA interactive session
class LLaMAInteract:
@ -77,9 +51,11 @@ specified) expect poor results""", file=sys.stderr)
# runtime args
self.input_consumed = 0
self.n_past = 0
self.n_session_consumed = 0
self.first_antiprompt = []
self.remaining_tokens = self.params.n_predict
self.output_echo = self.params.input_echo
self.multibyte_fix = []
# model load
self.lparams = llama_cpp.llama_context_default_params()
@ -94,6 +70,19 @@ specified) expect poor results""", file=sys.stderr)
if (not self.ctx):
raise RuntimeError(f"error: failed to load model '{self.params.model}'")
if (self.params.ignore_eos):
self.params.logit_bias[llama_cpp.llama_token_eos()] = -float("inf")
if (len(self.params.lora_adapter) > 0):
if (llama_cpp.llama_apply_lora_from_file(
self.ctx,
self.params.lora_adapter.encode("utf8"),
self.params.lora_base.encode("utf8") if len(self.params.lora_base) > 0 else None,
self.params.n_threads
) != 0):
print("error: failed to apply lora adapter")
return
print(file=sys.stderr)
print(f"system_info: n_threads = {self.params.n_threads} / {cpu_count()} \
| {llama_cpp.llama_print_system_info().decode('utf8')}", file=sys.stderr)
@ -117,13 +106,52 @@ specified) expect poor results""", file=sys.stderr)
with open(self.params.file) as f:
self.params.prompt = f.read()
self.session_tokens: list[llama_cpp.llama_token] = []
if (len(self.params.path_session) > 0):
print(f"attempting to load saved session from '{self.params.path_session}'", file=sys.stderr)
if (path.exists(self.params.path_session)):
_session_tokens = (llama_cpp.llama_token * (self.params.n_ctx))()
_n_token_count_out = llama_cpp.c_size_t()
if (llama_cpp.llama_load_session_file(
self.ctx,
self.params.path_session.encode("utf8"),
_session_tokens,
self.params.n_ctx,
ctypes.byref(_n_token_count_out)
) != 1):
print(f"error: failed to load session file '{self.params.path_session}'", file=sys.stderr)
return
_n_token_count_out = _n_token_count_out.value
self.session_tokens = _session_tokens[:_n_token_count_out]
print(f"loaded a session with prompt size of {_n_token_count_out} tokens", file=sys.stderr)
else:
print(f"session file does not exist, will create", file=sys.stderr)
# tokenize the prompt
self.embd = []
self.embd_inp = self._tokenize(self.params.prompt)
if (len(self.embd_inp) > self.params.n_ctx - 4):
if (len(self.embd_inp) > self.n_ctx - 4):
raise RuntimeError(f"error: prompt is too long ({len(self.embd_inp)} tokens, max {self.params.n_ctx - 4})")
# debug message about similarity of saved session, if applicable
self.n_matching_session_tokens = 0
if len(self.session_tokens) > 0:
for id in self.session_tokens:
if self.n_matching_session_tokens >= len(self.embd_inp) or id != self.embd_inp[self.n_matching_session_tokens]:
break
self.n_matching_session_tokens += 1
if self.n_matching_session_tokens >= len(self.embd_inp):
print(f"session file has exact match for prompt!")
elif self.n_matching_session_tokens < (len(self.embd_inp) / 2):
print(f"warning: session file has low similarity to prompt ({self.n_matching_session_tokens} / {len(self.embd_inp)} tokens); will mostly be reevaluated")
else:
print(f"session file matches {self.n_matching_session_tokens} / {len(self.embd_inp)} tokens of prompt")
self.need_to_save_session = len(self.params.path_session) > 0 and self.n_matching_session_tokens < (len(self.embd_inp) * 3 / 4)
# number of tokens to keep when resetting context
if (self.params.n_keep < 0 or self.params.n_keep > len(self.embd_inp) or self.params.instruct):
self.params.n_keep = len(self.embd_inp)
@ -132,11 +160,12 @@ specified) expect poor results""", file=sys.stderr)
self.inp_suffix = self._tokenize(self.params.instruct_inp_suffix, False)
# in instruct mode, we inject a prefix and a suffix to each input by the user
self.antiecho = None
if (self.params.instruct):
self.params.interactive_start = True
_ptn = self._tokenize(self.params.instruct_inp_prefix.strip(), False)
self.first_antiprompt.append(_ptn)
self.antiecho = IterSearch(_ptn)
self.antiecho = util.IterSearch(_ptn)
# enable interactive mode if reverse prompt or interactive start is specified
if (len(self.params.antiprompt) != 0 or self.params.interactive_start):
@ -171,16 +200,24 @@ number of tokens in prompt = {len(self.embd_inp)}""", file=sys.stderr)
if len(self.params.input_prefix) > 0:
print(f"Input prefix: '{self.params.input_prefix}'", file=sys.stderr)
print(f"""sampling: temp = {self.params.temp},\
print(f"""sampling: repeat_last_n = {self.params.repeat_last_n},\
repeat_penalty = {self.params.repeat_penalty},\
presence_penalty = {self.params.presence_penalty},\
frequency_penalty = {self.params.frequency_penalty},\
top_k = {self.params.top_k},\
tfs_z = {self.params.tfs_z},\
top_p = {self.params.top_p},\
repeat_last_n = {self.params.repeat_last_n},\
repeat_penalty = {self.params.repeat_penalty}
typical_p = {self.params.typical_p},\
temp = {self.params.temp},\
mirostat = {self.params.mirostat},\
mirostat_lr = {self.params.mirostat_eta},\
mirostat_ent = {self.params.mirostat_tau},\
generate: n_ctx = {self.n_ctx}, \
n_batch = {self.params.n_batch}, \
n_predict = {self.params.n_predict}, \
generate: n_ctx = {self.n_ctx},\
n_batch = {self.params.n_batch},\
n_predict = {self.params.n_predict},\
n_keep = {self.params.n_keep}
""", file=sys.stderr)
# determine antiprompt tokens
@ -196,11 +233,11 @@ n_keep = {self.params.n_keep}
- If you want to submit another line, end your input in '\\'.
""", file=sys.stderr)
self.set_color(CONSOLE_COLOR_PROMPT)
self.set_color(util.CONSOLE_COLOR_PROMPT)
# tokenize a prompt
def _tokenize(self, prompt, bos=True):
_arr = (llama_cpp.llama_token * (len(prompt) + 1))()
_arr = (llama_cpp.llama_token * ((len(prompt) + 1) * 4))()
_n = llama_cpp.llama_tokenize(self.ctx, prompt.encode("utf8", errors="ignore"), _arr, len(_arr), bos)
return _arr[:_n]
@ -229,31 +266,116 @@ n_keep = {self.params.n_keep}
self.n_ctx - int(n_left/2) - len(self.embd):-len(self.embd)
]
self.embd = _insert + self.embd
self.params.path_session = ""
# try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)
if self.n_session_consumed < len(self.session_tokens):
for i in range(len(self.embd)):
if self.embd[i] != self.session_tokens[self.n_session_consumed]:
self.session_tokens = self.session_tokens[:self.n_session_consumed]
break
self.n_past += 1
self.n_session_consumed += 1
if self.n_session_consumed >= len(self.session_tokens):
i += 1
break
if i > 0:
self.embd = self.embd[i:]
# evaluate tokens in batches
# embd is typically prepared beforehand to fit within a batch, but not always
#TODO BUG: The batching code causes nonsensical generation
"""for i in range(0, len(self.embd), self.params.n_batch):
n_eval = self.params.n_batch
_arr = (llama_cpp.llama_token * n_eval)(*self.embd[i:i + n_eval])
if llama_cpp.llama_eval(self.ctx, _arr, n_eval, self.n_past, self.params.n_threads) != 0:
print(f"failed to eval")
return
self.n_past += n_eval"""
if (llama_cpp.llama_eval(
self.ctx, (llama_cpp.llama_token * len(self.embd))(*self.embd), len(self.embd), self.n_past, self.params.n_threads
) != 0):
raise Exception("Failed to llama_eval!")
if len(self.embd) > 0 and len(self.params.path_session) > 0:
self.session_tokens.extend(self.embd)
self.n_session_consumed = len(self.session_tokens)
self.n_past += len(self.embd)
self.embd = []
if len(self.embd_inp) <= self.input_consumed:
if len(self.embd_inp) <= self.input_consumed: #&& !is_interacting
# out of user input, sample next token
top_k = llama_cpp.llama_n_vocab(self.ctx) if self.params.top_k <= 0 else self.params.top_k
repeat_last_n = self.n_ctx if self.params.repeat_last_n < 0 else self.params.repeat_last_n
if (self.params.ignore_eos):
logits = llama_cpp.llama_get_logits(self.ctx)
logits[llama_cpp.llama_token_eos()] = llama_cpp.c_float(0)
# optionally save the session on first sample (for faster prompt loading next time)
if len(self.params.path_session) > 0 and self.need_to_save_session:
self.need_to_save_session = False
llama_cpp.llama_save_session_file(
self.ctx,
self.params.path_session.encode("utf8"),
(llama_cpp.llama_token * len(self.session_tokens))(*self.session_tokens),
len(self.session_tokens)
)
id = 0
logits = llama_cpp.llama_get_logits(self.ctx)
n_vocab = llama_cpp.llama_n_vocab(self.ctx)
# Apply params.logit_bias map
for key, value in self.params.logit_bias.items():
logits[key] += value
_arr = (llama_cpp.llama_token_data * n_vocab)(*[
llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)
for token_id in range(n_vocab)
])
candidates_p = llama_cpp.ctypes.pointer(llama_cpp.llama_token_data_array(_arr, len(_arr), False))
# Apply penalties
nl_logit = logits[llama_cpp.llama_token_nl()]
last_n_repeat = min(len(self.last_n_tokens), repeat_last_n, self.n_ctx)
_arr = (llama_cpp.llama_token * last_n_repeat)(*self.last_n_tokens[len(self.last_n_tokens) - last_n_repeat:])
llama_cpp.llama_sample_repetition_penalty(self.ctx, candidates_p,
_arr,
last_n_repeat, llama_cpp.c_float(self.params.repeat_penalty))
llama_cpp.llama_sample_frequency_and_presence_penalties(self.ctx, candidates_p,
_arr,
last_n_repeat, llama_cpp.c_float(self.params.frequency_penalty), llama_cpp.c_float(self.params.presence_penalty))
if not self.params.penalize_nl:
logits[llama_cpp.llama_token_nl()] = nl_logit
if self.params.temp <= 0:
# Greedy sampling
id = llama_cpp.llama_sample_token_greedy(self.ctx, candidates_p)
else:
if self.params.mirostat == 1:
mirostat_mu = 2.0 * self.params.mirostat_tau
mirostat_m = 100
llama_cpp.llama_sample_temperature(self.ctx, candidates_p, llama_cpp.c_float(self.params.temp))
id = llama_cpp.llama_sample_token_mirostat(self.ctx, candidates_p, llama_cpp.c_float(self.params.mirostat_tau), llama_cpp.c_float(self.params.mirostat_eta), llama_cpp.c_int(mirostat_m), llama_cpp.c_float(mirostat_mu))
elif self.params.mirostat == 2:
mirostat_mu = 2.0 * self.params.mirostat_tau
llama_cpp.llama_sample_temperature(self.ctx, candidates_p, llama_cpp.c_float(self.params.temp))
id = llama_cpp.llama_sample_token_mirostat_v2(self.ctx, candidates_p, llama_cpp.c_float(self.params.mirostat_tau), llama_cpp.c_float(self.params.mirostat_eta), llama_cpp.c_float(mirostat_mu))
else:
# Temperature sampling
llama_cpp.llama_sample_top_k(self.ctx, candidates_p, top_k)
llama_cpp.llama_sample_tail_free(self.ctx, candidates_p, llama_cpp.c_float(self.params.tfs_z))
llama_cpp.llama_sample_typical(self.ctx, candidates_p, llama_cpp.c_float(self.params.typical_p))
llama_cpp.llama_sample_top_p(self.ctx, candidates_p, llama_cpp.c_float(self.params.top_p))
llama_cpp.llama_sample_temperature(self.ctx, candidates_p, llama_cpp.c_float(self.params.temp))
id = llama_cpp.llama_sample_token(self.ctx, candidates_p)
# print("`{}`".format(candidates_p.size))
_arr = self.last_n_tokens[-min(self.params.repeat_last_n, self.n_past):]
id = llama_cpp.llama_sample_top_p_top_k(
self.ctx,
(llama_cpp.llama_token * len(_arr))(*_arr),
len(_arr),
self.params.top_k,
self.params.top_p,
self.params.temp,
self.params.repeat_penalty,
)
self.last_n_tokens.pop(0)
self.last_n_tokens.append(id)
@ -288,7 +410,7 @@ n_keep = {self.params.n_keep}
# display tokens
if self.output_echo:
for id in self.embd:
if self.params.instruct:
if self.antiecho != None:
for r in self.antiecho(id):
yield r
else:
@ -296,7 +418,7 @@ n_keep = {self.params.n_keep}
# reset color to default if we there is no pending user input
if (self.params.input_echo and len(self.embd_inp) == self.input_consumed):
self.set_color(CONSOLE_COLOR_DEFAULT)
self.set_color(util.CONSOLE_COLOR_DEFAULT)
if (self.params.interactive and len(self.embd_inp) <= self.input_consumed):
# if antiprompt is present, stop
@ -316,7 +438,7 @@ n_keep = {self.params.n_keep}
if (not self.params.instruct):
for i in self.llama_token_eot:
yield i
break
break
# respect n_predict even if antiprompt is present
if (self.params.interactive and self.remaining_tokens <= 0 and self.params.n_predict != -1):
@ -337,12 +459,12 @@ n_keep = {self.params.n_keep}
def exit(self):
llama_cpp.llama_free(self.ctx)
self.set_color(CONSOLE_COLOR_DEFAULT)
self.set_color(util.CONSOLE_COLOR_DEFAULT)
# return past text
def past(self):
for id in self.last_n_tokens[-self.n_past:]:
yield llama_cpp.llama_token_to_str(self.ctx, id).decode("utf-8", errors="ignore")
yield llama_cpp.llama_token_to_str(self.ctx, id).decode("utf8", errors="ignore")
# write input
def input(self, prompt: str):
@ -356,7 +478,29 @@ n_keep = {self.params.n_keep}
def output(self):
self.remaining_tokens = self.params.n_predict
for id in self.generate():
yield llama_cpp.llama_token_to_str(self.ctx, id).decode("utf-8", errors="ignore")
cur_char = llama_cpp.llama_token_to_str(self.ctx, id)
# Add remainder of missing bytes
if None in self.multibyte_fix:
self.multibyte_fix[self.multibyte_fix.index(None)] = cur_char
# Return completed utf char
if len(self.multibyte_fix) > 0 and not None in self.multibyte_fix:
yield (b"".join(self.multibyte_fix)).decode("utf8")
self.multibyte_fix = []
continue
# Contains multi-byte UTF8
for num, pattern in [(2, 192), (3, 224), (4, 240)]:
# Bitwise AND check
if pattern & int.from_bytes(cur_char) == pattern:
self.multibyte_fix = [cur_char] + ([None] * (num-1))
# Stop incomplete bytes from passing
if len(self.multibyte_fix) > 0:
continue
yield cur_char.decode("utf8")
# read user input
def read_input(self):
@ -372,21 +516,21 @@ n_keep = {self.params.n_keep}
self.params.input_echo = False
while self.params.interactive:
self.set_color(CONSOLE_COLOR_USER_INPUT)
self.set_color(util.CONSOLE_COLOR_USER_INPUT)
if (self.params.instruct):
print('\n> ', end="")
self.input(self.read_input())
else:
print(self.params.input_prefix, end="")
self.input(f"{self.params.input_prefix}{self.read_input()}{self.params.output_postfix}")
print(self.params.output_postfix,end="")
self.set_color(CONSOLE_COLOR_DEFAULT)
self.input(f"{self.params.input_prefix}{self.read_input()}{self.params.input_suffix}")
print(self.params.input_suffix,end="")
self.set_color(util.CONSOLE_COLOR_DEFAULT)
try:
for i in self.output():
print(i,end="",flush=True)
except KeyboardInterrupt:
self.set_color(CONSOLE_COLOR_DEFAULT)
self.set_color(util.CONSOLE_COLOR_DEFAULT)
if not self.params.instruct:
print(self.params.fix_prefix,end="")
self.input(self.params.fix_prefix)
@ -415,8 +559,7 @@ The transcript only includes text, it does not include markup like HTML and Mark
{USER_NAME}: Name a color.
{AI_NAME}: Blue
{USER_NAME}:"""
args = gpt_params_parse()
params = GptParams(**vars(args))
params = gpt_params_parse()
with LLaMAInteract(params) as m:
m.interact()

View file

@ -37,6 +37,10 @@ embd = []
last_n_size = 64
last_n_tokens_data = [0] * last_n_size
n_batch = 24
last_n_repeat = 64
repeat_penalty = 1
frequency_penalty = 0.0
presence_penalty = 0.0
while remaining_tokens > 0:
if len(embd) > 0:
@ -47,15 +51,28 @@ while remaining_tokens > 0:
n_past += len(embd)
embd = []
if len(embd_inp) <= input_consumed:
id = llama_cpp.llama_sample_top_p_top_k(
ctx,
(llama_cpp.c_int * len(last_n_tokens_data))(*last_n_tokens_data),
len(last_n_tokens_data),
40,
0.8,
0.2,
1.0 / 0.85,
)
logits = llama_cpp.llama_get_logits(ctx)
n_vocab = llama_cpp.llama_n_vocab(ctx)
_arr = (llama_cpp.llama_token_data * n_vocab)(*[
llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)
for token_id in range(n_vocab)
])
candidates_p = llama_cpp.ctypes.pointer(llama_cpp.llama_token_data_array(_arr, len(_arr), False))
_arr = (llama_cpp.c_int * len(last_n_tokens_data))(*last_n_tokens_data)
llama_cpp.llama_sample_repetition_penalty(ctx, candidates_p,
_arr,
last_n_repeat, repeat_penalty)
llama_cpp.llama_sample_frequency_and_presence_penalties(ctx, candidates_p,
_arr,
last_n_repeat, frequency_penalty, presence_penalty)
llama_cpp.llama_sample_top_k(ctx, candidates_p, 40)
llama_cpp.llama_sample_top_p(ctx, candidates_p, 0.8)
llama_cpp.llama_sample_temperature(ctx, candidates_p, 0.2)
id = llama_cpp.llama_sample_token(ctx, candidates_p)
last_n_tokens_data = last_n_tokens_data[1:] + [id]
embd.append(id)
input_noecho = False

View file

@ -0,0 +1,95 @@
ANSI_COLOR_RESET = "\x1b[0m"
ANSI_COLOR_YELLOW = "\x1b[33m"
ANSI_BOLD = "\x1b[1m"
ANSI_COLOR_GREEN = "\x1b[32m"
CONSOLE_COLOR_DEFAULT = ANSI_COLOR_RESET
CONSOLE_COLOR_PROMPT = ANSI_COLOR_YELLOW
CONSOLE_COLOR_USER_INPUT = ANSI_BOLD + ANSI_COLOR_GREEN
# Iterative search
# Actively searches and prevents a pattern from being returned
class IterSearch:
def __init__(self, pattern):
self.pattern = list(pattern)
self.buffer = []
def __call__(self, char):
self.buffer += [char]
if (self.pattern[:len(self.buffer)] == self.buffer):
if (len(self.buffer) >= len(self.pattern)):
self.buffer.clear()
return []
_tmp = self.buffer[:]
self.buffer.clear()
return _tmp
class Circle:
def __init__(self, size, default=0):
self.list = [default] * size
self.maxsize = size
self.size = 0
self.offset = 0
def append(self, elem):
if self.size < self.maxsize:
self.list[self.size] = elem
self.size += 1
else:
self.list[self.offset] = elem
self.offset = (self.offset + 1) % self.maxsize
def __getitem__(self, val):
if isinstance(val, int):
if 0 > val or val >= self.size:
raise IndexError('Index out of range')
return self.list[val] if self.size < self.maxsize else self.list[(self.offset + val) % self.maxsize]
elif isinstance(val, slice):
start, stop, step = val.start, val.stop, val.step
if step is None:
step = 1
if start is None:
start = 0
if stop is None:
stop = self.size
if start < 0:
start = self.size + start
if stop < 0:
stop = self.size + stop
indices = range(start, stop, step)
return [self.list[(self.offset + i) % self.maxsize] for i in indices if i < self.size]
else:
raise TypeError('Invalid argument type')
if __name__ == "__main__":
c = Circle(5)
c.append(1)
print(c.list)
print(c[:])
assert c[0] == 1
assert c[:5] == [1]
for i in range(2,5+1):
c.append(i)
print(c.list)
print(c[:])
assert c[0] == 1
assert c[:5] == [1,2,3,4,5]
for i in range(5+1,9+1):
c.append(i)
print(c.list)
print(c[:])
assert c[0] == 5
assert c[:5] == [5,6,7,8,9]
#assert c[:-5] == [5,6,7,8,9]
assert c[:10] == [5,6,7,8,9]

View file

@ -5,7 +5,7 @@ import time
import math
import multiprocessing
from typing import List, Optional, Union, Generator, Sequence, Iterator, Deque, Tuple
from collections import deque
from collections import deque, OrderedDict
from . import llama_cpp
from .llama_types import *
@ -14,46 +14,59 @@ from .llama_types import *
class LlamaCache:
"""Cache for a llama.cpp model."""
def __init__(self):
self.cache_state: Dict[Tuple[llama_cpp.llama_token, ...], "LlamaState"] = dict()
def __init__(self, capacity_bytes: int = (2 << 30)):
self.cache_state: OrderedDict[
Tuple[llama_cpp.llama_token, ...], "LlamaState"
] = OrderedDict()
self.capacity_bytes = capacity_bytes
def _sorted_keys(self) -> List[Tuple[llama_cpp.llama_token, ...]]:
return [
key
for _, key in sorted(
((len(key), key) for key in self.cache_state.keys()), reverse=True
)
]
@property
def cache_size(self):
return sum([state.llama_state_size for state in self.cache_state.values()])
def _find_key(
self, key: Tuple[llama_cpp.llama_token, ...]
def _find_longest_prefix_key(
self,
key: Tuple[llama_cpp.llama_token, ...],
) -> Optional[Tuple[llama_cpp.llama_token, ...]]:
for k in self._sorted_keys():
if key[: len(k)] == k:
return k
return None
min_len = 0
min_key = None
keys = (
(k, Llama.longest_token_prefix(k, key)) for k in self.cache_state.keys()
)
for k, prefix_len in keys:
if prefix_len > min_len:
min_len = prefix_len
min_key = k
return min_key
def __getitem__(self, key: Sequence[llama_cpp.llama_token]) -> "LlamaState":
_key = self._find_key(tuple(key))
key = tuple(key)
_key = self._find_longest_prefix_key(key)
if _key is None:
raise KeyError(f"Key not found: {key}")
return self.cache_state[_key]
raise KeyError(f"Key not found")
value = self.cache_state[_key]
self.cache_state.move_to_end(_key)
return value
def __contains__(self, key: Sequence[llama_cpp.llama_token]) -> bool:
return self._find_key(tuple(key)) is not None
return self._find_longest_prefix_key(tuple(key)) is not None
def __setitem__(self, key: Sequence[llama_cpp.llama_token], value: "LlamaState"):
self.cache_state = dict() # NOTE: Currently limit to one cache entry.
self.cache_state[tuple(key)] = value
key = tuple(key)
if key in self.cache_state:
del self.cache_state[key]
self.cache_state[key] = value
while self.cache_size > self.capacity_bytes:
self.cache_state.popitem(last=False)
class LlamaState:
def __init__(
self,
eval_tokens: Deque[llama_cpp.llama_token],
eval_logits: Deque[List[llama_cpp.c_float]],
eval_logits: Deque[List[float]],
llama_state, # type: llama_cpp.Array[llama_cpp.c_uint8]
llama_state_size: llama_cpp.c_size_t,
llama_state_size: int,
):
self.eval_tokens = eval_tokens
self.eval_logits = eval_logits
@ -127,9 +140,7 @@ class Llama:
self.last_n_tokens_size = last_n_tokens_size
self.n_batch = min(n_ctx, n_batch)
self.eval_tokens: Deque[llama_cpp.llama_token] = deque(maxlen=n_ctx)
self.eval_logits: Deque[List[float]] = deque(
maxlen=n_ctx if logits_all else 1
)
self.eval_logits: Deque[List[float]] = deque(maxlen=n_ctx if logits_all else 1)
self.cache: Optional[LlamaCache] = None
@ -250,7 +261,7 @@ class Llama:
]
self.eval_logits.extend(logits)
def _sample_top_p_top_k(
def _sample(
self,
last_n_tokens_data, # type: llama_cpp.Array[llama_cpp.llama_token]
last_n_tokens_size: llama_cpp.c_int,
@ -263,6 +274,8 @@ class Llama:
mirostat_mu: llama_cpp.c_float,
mirostat_m: llama_cpp.c_int,
repeat_penalty: llama_cpp.c_float,
frequency_penalty: llama_cpp.c_float,
presence_penalty: llama_cpp.c_float,
):
assert self.ctx is not None
assert len(self.eval_logits) > 0
@ -289,24 +302,24 @@ class Llama:
ctx=self.ctx,
last_tokens_data=last_n_tokens_data,
last_tokens_size=last_n_tokens_size,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
penalty=repeat_penalty,
)
if mirostat_mode == 1:
if mirostat_mode.value == 1:
llama_cpp.llama_sample_temperature(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
temp=temp,
)
llama_cpp.llama_sample_token_mirostat(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
tau=mirostat_tau,
eta=mirostat_eta,
mu=mirostat_mu,
mu=llama_cpp.ctypes.byref(mirostat_mu), # type: ignore
m=mirostat_m
)
elif mirostat_mode == 2:
elif mirostat_mode.value == 2:
llama_cpp.llama_sample_temperature(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
@ -314,45 +327,57 @@ class Llama:
)
llama_cpp.llama_sample_token_mirostat_v2(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
tau=mirostat_tau,
eta=mirostat_eta,
mu=mirostat_mu
mu=llama_cpp.ctypes.byref(mirostat_mu) # type: ignore
)
elif float(temp.value) == 0.0:
llama_cpp.llama_sample_frequency_and_presence_penalties(
ctx=self.ctx,
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
last_tokens_data=last_n_tokens_data,
last_tokens_size=last_n_tokens_size,
alpha_frequency=frequency_penalty,
alpha_presence=presence_penalty,
)
if float(temp.value) == 0.0:
return llama_cpp.llama_sample_token_greedy(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
)
else:
llama_cpp.llama_sample_top_k(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
k=top_k,
min_keep=llama_cpp.c_size_t(1),
)
llama_cpp.llama_sample_tail_free(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
z=llama_cpp.c_float(1.0),
min_keep=llama_cpp.c_size_t(1),
)
llama_cpp.llama_sample_typical(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
p=llama_cpp.c_float(1.0),
min_keep=llama_cpp.c_size_t(1),
)
llama_cpp.llama_sample_top_p(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
p=top_p,
min_keep=llama_cpp.c_size_t(1),
)
llama_cpp.llama_sample_temperature(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
temp=temp,
)
return llama_cpp.llama_sample_token(
ctx=self.ctx,
candidates=llama_cpp.ctypes.pointer(candidates),
candidates=llama_cpp.ctypes.byref(candidates), # type: ignore
)
def sample(
@ -366,6 +391,8 @@ class Llama:
mirostat_mu: float,
mirostat_m: int,
repeat_penalty: float,
frequency_penalty: float = 0.0,
presence_penalty: float = 0.0,
):
"""Sample a token from the model.
@ -382,7 +409,7 @@ class Llama:
last_n_tokens_data = [llama_cpp.llama_token(0)] * max(
0, self.last_n_tokens_size - len(self.eval_tokens)
) + list(self.eval_tokens)[-self.last_n_tokens_size :]
return self._sample_top_p_top_k(
return self._sample(
last_n_tokens_data=(llama_cpp.llama_token * self.last_n_tokens_size)(
*last_n_tokens_data
),
@ -396,6 +423,8 @@ class Llama:
mirostat_eta=llama_cpp.c_float(mirostat_eta),
mirostat_m=llama_cpp.c_int(mirostat_m),
repeat_penalty=llama_cpp.c_float(repeat_penalty),
frequency_penalty=llama_cpp.c_float(frequency_penalty),
presence_penalty=llama_cpp.c_float(presence_penalty),
)
def generate(
@ -410,6 +439,8 @@ class Llama:
mirostat_mu: float,
mirostat_m: int,
repeat_penalty: float,
frequency_penalty: float = 0.0,
presence_penalty: float = 0.0,
reset: bool = True,
) -> Generator[
llama_cpp.llama_token, Optional[Sequence[llama_cpp.llama_token]], None
@ -468,6 +499,8 @@ class Llama:
mirostat_eta=mirostat_eta,
mirostat_mu=mirostat_mu,
mirostat_m=mirostat_m,
frequency_penalty=frequency_penalty,
presence_penalty=presence_penalty,
repeat_penalty=repeat_penalty,
)
tokens_or_none = yield token
@ -547,6 +580,8 @@ class Llama:
logprobs: Optional[int] = None,
echo: bool = False,
stop: Optional[List[str]] = [],
frequency_penalty: float = 0.0,
presence_penalty: float = 0.0,
repeat_penalty: float = 1.1,
top_k: int = 40,
stream: bool = False,
@ -581,10 +616,22 @@ class Llama:
"logprobs is not supported for models created with logits_all=False"
)
if self.cache and prompt_tokens in self.cache:
if self.verbose:
print("Llama._create_completion: cache hit", file=sys.stderr)
self.load_state(self.cache[prompt_tokens])
if self.cache:
try:
cache_item = self.cache[prompt_tokens]
cache_prefix_len = Llama.longest_token_prefix(
cache_item.eval_tokens, prompt_tokens
)
eval_prefix_len = Llama.longest_token_prefix(
self.eval_tokens, prompt_tokens
)
if cache_prefix_len > eval_prefix_len:
self.load_state(cache_item)
if self.verbose:
print("Llama._create_completion: cache hit", file=sys.stderr)
except KeyError:
if self.verbose:
print("Llama._create_completion: cache miss", file=sys.stderr)
finish_reason = "length"
multibyte_fix = 0
@ -598,6 +645,8 @@ class Llama:
mirostat_eta=mirostat_eta,
mirostat_mu=mirostat_mu,
mirostat_m=mirostat_m,
frequency_penalty=frequency_penalty,
presence_penalty=presence_penalty,
repeat_penalty=repeat_penalty,
):
if token == llama_cpp.llama_token_eos():
@ -605,12 +654,6 @@ class Llama:
finish_reason = "stop"
break
if self.cache and len(completion_tokens) == 0:
if prompt_tokens not in self.cache:
if self.verbose:
print("Llama._create_completion: cache miss", file=sys.stderr)
self.cache[prompt_tokens] = self.save_state()
completion_tokens.append(token)
all_text = self.detokenize(completion_tokens)
@ -669,6 +712,11 @@ class Llama:
finish_reason = "length"
break
if self.cache:
if self.verbose:
print("Llama._create_completion: cache save", file=sys.stderr)
self.cache[prompt_tokens + completion_tokens] = self.save_state()
if stream:
yield {
"id": completion_id,
@ -778,6 +826,8 @@ class Llama:
logprobs: Optional[int] = None,
echo: bool = False,
stop: Optional[List[str]] = [],
frequency_penalty: float = 0.0,
presence_penalty: float = 0.0,
repeat_penalty: float = 1.1,
top_k: int = 40,
stream: bool = False,
@ -818,6 +868,8 @@ class Llama:
logprobs=logprobs,
echo=echo,
stop=stop,
frequency_penalty=frequency_penalty,
presence_penalty=presence_penalty,
repeat_penalty=repeat_penalty,
top_k=top_k,
stream=stream,
@ -843,6 +895,8 @@ class Llama:
logprobs: Optional[int] = None,
echo: bool = False,
stop: Optional[List[str]] = [],
frequency_penalty: float = 0.0,
presence_penalty: float = 0.0,
repeat_penalty: float = 1.1,
top_k: int = 40,
stream: bool = False,
@ -883,6 +937,8 @@ class Llama:
logprobs=logprobs,
echo=echo,
stop=stop,
frequency_penalty=frequency_penalty,
presence_penalty=presence_penalty,
repeat_penalty=repeat_penalty,
top_k=top_k,
stream=stream,
@ -955,6 +1011,8 @@ class Llama:
stream: bool = False,
stop: Optional[List[str]] = [],
max_tokens: int = 256,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0,
repeat_penalty: float = 1.1,
) -> Union[ChatCompletion, Iterator[ChatCompletionChunk]]:
"""Generate a chat completion from a list of messages.
@ -988,6 +1046,8 @@ class Llama:
stream=stream,
max_tokens=max_tokens,
repeat_penalty=repeat_penalty,
presence_penalty=presence_penalty,
frequency_penalty=frequency_penalty,
)
if stream:
chunks: Iterator[CompletionChunk] = completion_or_chunks # type: ignore
@ -1085,3 +1145,15 @@ class Llama:
exps = [math.exp(float(x)) for x in logits]
sum_exps = sum(exps)
return [math.log(x / sum_exps) for x in exps]
@staticmethod
def longest_token_prefix(
a: Sequence[llama_cpp.llama_token], b: Sequence[llama_cpp.llama_token]
):
longest_prefix = 0
for _a, _b in zip(a, b):
if _a == _b:
longest_prefix += 1
else:
break
return longest_prefix

View file

@ -157,7 +157,7 @@ _lib.llama_context_default_params.argtypes = []
_lib.llama_context_default_params.restype = llama_context_params
def llama_mmap_supported() -> c_bool:
def llama_mmap_supported() -> bool:
return _lib.llama_mmap_supported()
@ -165,7 +165,7 @@ _lib.llama_mmap_supported.argtypes = []
_lib.llama_mmap_supported.restype = c_bool
def llama_mlock_supported() -> c_bool:
def llama_mlock_supported() -> bool:
return _lib.llama_mlock_supported()
@ -260,7 +260,7 @@ _lib.llama_get_state_size.restype = c_size_t
# Returns the number of bytes copied
def llama_copy_state_data(
ctx: llama_context_p, dest # type: Array[c_uint8]
) -> c_size_t:
) -> int:
return _lib.llama_copy_state_data(ctx, dest)
@ -272,7 +272,7 @@ _lib.llama_copy_state_data.restype = c_size_t
# Returns the number of bytes read
def llama_set_state_data(
ctx: llama_context_p, src # type: Array[c_uint8]
) -> c_size_t:
) -> int:
return _lib.llama_set_state_data(ctx, src)
@ -387,7 +387,9 @@ _lib.llama_n_embd.restype = c_int
# Can be mutated in order to change the probabilities of the next token
# Rows: n_tokens
# Cols: n_vocab
def llama_get_logits(ctx: llama_context_p): # type: (...) -> Array[float] # type: ignore
def llama_get_logits(
ctx: llama_context_p,
): # type: (...) -> Array[float] # type: ignore
return _lib.llama_get_logits(ctx)
@ -397,7 +399,9 @@ _lib.llama_get_logits.restype = c_float_p
# Get the embeddings for the input
# shape: [n_embd] (1-dimensional)
def llama_get_embeddings(ctx: llama_context_p): # type: (...) -> Array[float] # type: ignore
def llama_get_embeddings(
ctx: llama_context_p,
): # type: (...) -> Array[float] # type: ignore
return _lib.llama_get_embeddings(ctx)
@ -515,7 +519,7 @@ def llama_sample_top_k(
ctx: llama_context_p,
candidates, # type: _Pointer[llama_token_data_array]
k: c_int,
min_keep: c_size_t = c_size_t(1),
min_keep: c_size_t,
):
return _lib.llama_sample_top_k(ctx, candidates, k, min_keep)
@ -534,7 +538,7 @@ def llama_sample_top_p(
ctx: llama_context_p,
candidates, # type: _Pointer[llama_token_data_array]
p: c_float,
min_keep: c_size_t = c_size_t(1),
min_keep: c_size_t,
):
return _lib.llama_sample_top_p(ctx, candidates, p, min_keep)
@ -553,7 +557,7 @@ def llama_sample_tail_free(
ctx: llama_context_p,
candidates, # type: _Pointer[llama_token_data_array]
z: c_float,
min_keep: c_size_t = c_size_t(1),
min_keep: c_size_t,
):
return _lib.llama_sample_tail_free(ctx, candidates, z, min_keep)
@ -572,7 +576,7 @@ def llama_sample_typical(
ctx: llama_context_p,
candidates, # type: _Pointer[llama_token_data_array]
p: c_float,
min_keep: c_size_t = c_size_t(1),
min_keep: c_size_t,
):
return _lib.llama_sample_typical(ctx, candidates, p, min_keep)

View file

@ -58,7 +58,7 @@ class Completion(TypedDict):
class ChatCompletionMessage(TypedDict):
role: Union[Literal["assistant"], Literal["user"], Literal["system"]]
role: Literal["assistant", "user", "system"]
content: str
user: NotRequired[str]

View file

@ -31,16 +31,18 @@ from llama_cpp.server.app import create_app, Settings
if __name__ == "__main__":
parser = argparse.ArgumentParser()
for name, field in Settings.__fields__.items():
description = field.field_info.description
if field.default is not None and description is not None:
description += f" (default: {field.default})"
parser.add_argument(
f"--{name}",
dest=name,
type=field.type_,
default=field.default,
help=field.field_info.description,
help=description,
)
args = parser.parse_args()
settings = Settings(**vars(args))
settings = Settings(**{k: v for k, v in vars(args).items() if v is not None})
app = create_app(settings=settings)
uvicorn.run(

View file

@ -1,8 +1,8 @@
import os
import json
import multiprocessing
from threading import Lock
from typing import List, Optional, Union, Iterator, Dict
from typing_extensions import TypedDict, Literal, Annotated
from typing_extensions import TypedDict, Literal
import llama_cpp
@ -13,18 +13,48 @@ from sse_starlette.sse import EventSourceResponse
class Settings(BaseSettings):
model: str
n_ctx: int = 2048
n_batch: int = 512
n_threads: int = max((os.cpu_count() or 2) // 2, 1)
f16_kv: bool = True
use_mlock: bool = False # This causes a silent failure on platforms that don't support mlock (e.g. Windows) took forever to figure out...
use_mmap: bool = True
embedding: bool = True
last_n_tokens_size: int = 64
logits_all: bool = False
cache: bool = False # WARNING: This is an experimental feature
vocab_only: bool = False
model: str = Field(
description="The path to the model to use for generating completions."
)
n_ctx: int = Field(default=2048, ge=1, description="The context size.")
n_batch: int = Field(
default=512, ge=1, description="The batch size to use per eval."
)
n_threads: int = Field(
default=max(multiprocessing.cpu_count() // 2, 1),
ge=1,
description="The number of threads to use.",
)
f16_kv: bool = Field(default=True, description="Whether to use f16 key/value.")
use_mlock: bool = Field(
default=llama_cpp.llama_mlock_supported(),
description="Use mlock.",
)
use_mmap: bool = Field(
default=llama_cpp.llama_mmap_supported(),
description="Use mmap.",
)
embedding: bool = Field(default=True, description="Whether to use embeddings.")
last_n_tokens_size: int = Field(
default=64,
ge=0,
description="Last n tokens to keep for repeat penalty calculation.",
)
logits_all: bool = Field(default=True, description="Whether to return logits.")
cache: bool = Field(
default=False,
description="Use a cache to reduce processing times for evaluated prompts.",
)
cache_size: int = Field(
default=2 << 30,
description="The size of the cache in bytes. Only used if cache is True.",
)
vocab_only: bool = Field(
default=False, description="Whether to only return the vocabulary."
)
verbose: bool = Field(
default=True, description="Whether to print debug information."
)
router = APIRouter()
@ -60,9 +90,10 @@ def create_app(settings: Optional[Settings] = None):
n_ctx=settings.n_ctx,
last_n_tokens_size=settings.last_n_tokens_size,
vocab_only=settings.vocab_only,
verbose=settings.verbose,
)
if settings.cache:
cache = llama_cpp.LlamaCache()
cache = llama_cpp.LlamaCache(capacity_bytes=settings.cache_size)
llama.set_cache(cache)
return app
@ -75,18 +106,78 @@ def get_llama():
yield llama
model_field = Field(description="The model to use for generating completions.")
max_tokens_field = Field(
default=16, ge=1, le=2048, description="The maximum number of tokens to generate."
)
temperature_field = Field(
default=0.8,
ge=0.0,
le=2.0,
description="Adjust the randomness of the generated text.\n\n"
+ "Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run.",
)
top_p_field = Field(
default=0.95,
ge=0.0,
le=1.0,
description="Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.\n\n"
+ "Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.",
)
stop_field = Field(
default=None,
description="A list of tokens at which to stop generation. If None, no stop tokens are used.",
)
stream_field = Field(
default=False,
description="Whether to stream the results as they are generated. Useful for chatbots.",
)
top_k_field = Field(
default=40,
ge=0,
description="Limit the next token selection to the K most probable tokens.\n\n"
+ "Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text.",
)
repeat_penalty_field = Field(
default=1.1,
ge=0.0,
description="A penalty applied to each token that is already generated. This helps prevent the model from repeating itself.\n\n"
+ "Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient.",
)
class CreateCompletionRequest(BaseModel):
prompt: Union[str, List[str]]
suffix: Optional[str] = Field(None)
max_tokens: int = 16
temperature: float = 0.8
top_p: float = 0.95
echo: bool = False
stop: Optional[List[str]] = []
stream: bool = False
prompt: Optional[str] = Field(
default="", description="The prompt to generate completions for."
)
suffix: Optional[str] = Field(
default=None,
description="A suffix to append to the generated text. If None, no suffix is appended. Useful for chatbots.",
)
max_tokens: int = max_tokens_field
temperature: float = temperature_field
top_p: float = top_p_field
echo: bool = Field(
default=False,
description="Whether to echo the prompt in the generated text. Useful for chatbots.",
)
stop: Optional[List[str]] = stop_field
stream: bool = stream_field
logprobs: Optional[int] = Field(
default=None,
ge=0,
description="The number of logprobs to generate. If None, no logprobs are generated.",
)
# ignored or currently unsupported
model: Optional[str] = Field(None)
model: Optional[str] = model_field
n: Optional[int] = 1
logprobs: Optional[int] = Field(None)
presence_penalty: Optional[float] = 0
@ -96,8 +187,8 @@ class CreateCompletionRequest(BaseModel):
user: Optional[str] = Field(None)
# llama.cpp specific parameters
top_k: int = 40
repeat_penalty: float = 1.1
top_k: int = top_k_field
repeat_penalty: float = repeat_penalty_field
class Config:
schema_extra = {
@ -118,16 +209,11 @@ CreateCompletionResponse = create_model_from_typeddict(llama_cpp.Completion)
def create_completion(
request: CreateCompletionRequest, llama: llama_cpp.Llama = Depends(get_llama)
):
if isinstance(request.prompt, list):
request.prompt = "".join(request.prompt)
completion_or_chunks = llama(
**request.dict(
exclude={
"model",
"n",
"frequency_penalty",
"presence_penalty",
"best_of",
"logit_bias",
"user",
@ -142,8 +228,8 @@ def create_completion(
class CreateEmbeddingRequest(BaseModel):
model: Optional[str]
input: str
model: Optional[str] = model_field
input: str = Field(description="The input to embed.")
user: Optional[str]
class Config:
@ -168,22 +254,24 @@ def create_embedding(
class ChatCompletionRequestMessage(BaseModel):
role: Union[Literal["system"], Literal["user"], Literal["assistant"]]
content: str
user: Optional[str] = None
role: Literal["system", "user", "assistant"] = Field(
default="user", description="The role of the message."
)
content: str = Field(default="", description="The content of the message.")
class CreateChatCompletionRequest(BaseModel):
model: Optional[str]
messages: List[ChatCompletionRequestMessage]
temperature: float = 0.8
top_p: float = 0.95
stream: bool = False
stop: Optional[List[str]] = []
max_tokens: int = 128
messages: List[ChatCompletionRequestMessage] = Field(
default=[], description="A list of messages to generate completions for."
)
max_tokens: int = max_tokens_field
temperature: float = temperature_field
top_p: float = top_p_field
stop: Optional[List[str]] = stop_field
stream: bool = stream_field
# ignored or currently unsupported
model: Optional[str] = Field(None)
model: Optional[str] = model_field
n: Optional[int] = 1
presence_penalty: Optional[float] = 0
frequency_penalty: Optional[float] = 0
@ -191,7 +279,8 @@ class CreateChatCompletionRequest(BaseModel):
user: Optional[str] = Field(None)
# llama.cpp specific parameters
repeat_penalty: float = 1.1
top_k: int = top_k_field
repeat_penalty: float = repeat_penalty_field
class Config:
schema_extra = {
@ -224,8 +313,6 @@ def create_chat_completion(
exclude={
"model",
"n",
"presence_penalty",
"frequency_penalty",
"logit_bias",
"user",
}
@ -266,7 +353,9 @@ GetModelResponse = create_model_from_typeddict(ModelList)
@router.get("/v1/models", response_model=GetModelResponse)
def get_models() -> ModelList:
def get_models(
llama: llama_cpp.Llama = Depends(get_llama),
) -> ModelList:
return {
"object": "list",
"data": [

103
poetry.lock generated
View file

@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand.
# This file is automatically @generated by Poetry and should not be changed by hand.
[[package]]
name = "anyio"
@ -21,58 +21,39 @@ doc = ["packaging", "sphinx-autodoc-typehints (>=1.2.0)", "sphinx-rtd-theme"]
test = ["contextlib2", "coverage[toml] (>=4.5)", "hypothesis (>=4.0)", "mock (>=4)", "pytest (>=7.0)", "pytest-mock (>=3.6.1)", "trustme", "uvloop (<0.15)", "uvloop (>=0.15)"]
trio = ["trio (>=0.16,<0.22)"]
[[package]]
name = "attrs"
version = "22.2.0"
description = "Classes Without Boilerplate"
category = "dev"
optional = false
python-versions = ">=3.6"
files = [
{file = "attrs-22.2.0-py3-none-any.whl", hash = "sha256:29e95c7f6778868dbd49170f98f8818f78f3dc5e0e37c0b1f474e3561b240836"},
{file = "attrs-22.2.0.tar.gz", hash = "sha256:c9227bfc2f01993c03f68db37d1d15c9690188323c067c641f1a35ca58185f99"},
]
[package.extras]
cov = ["attrs[tests]", "coverage-enable-subprocess", "coverage[toml] (>=5.3)"]
dev = ["attrs[docs,tests]"]
docs = ["furo", "myst-parser", "sphinx", "sphinx-notfound-page", "sphinxcontrib-towncrier", "towncrier", "zope.interface"]
tests = ["attrs[tests-no-zope]", "zope.interface"]
tests-no-zope = ["cloudpickle", "cloudpickle", "hypothesis", "hypothesis", "mypy (>=0.971,<0.990)", "mypy (>=0.971,<0.990)", "pympler", "pympler", "pytest (>=4.3.0)", "pytest (>=4.3.0)", "pytest-mypy-plugins", "pytest-mypy-plugins", "pytest-xdist[psutil]", "pytest-xdist[psutil]"]
[[package]]
name = "black"
version = "23.1.0"
version = "23.3.0"
description = "The uncompromising code formatter."
category = "dev"
optional = false
python-versions = ">=3.7"
files = [
{file = "black-23.1.0-cp310-cp310-macosx_10_16_arm64.whl", hash = "sha256:b6a92a41ee34b883b359998f0c8e6eb8e99803aa8bf3123bf2b2e6fec505a221"},
{file = "black-23.1.0-cp310-cp310-macosx_10_16_universal2.whl", hash = "sha256:57c18c5165c1dbe291d5306e53fb3988122890e57bd9b3dcb75f967f13411a26"},
{file = "black-23.1.0-cp310-cp310-macosx_10_16_x86_64.whl", hash = "sha256:9880d7d419bb7e709b37e28deb5e68a49227713b623c72b2b931028ea65f619b"},
{file = "black-23.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e6663f91b6feca5d06f2ccd49a10f254f9298cc1f7f49c46e498a0771b507104"},
{file = "black-23.1.0-cp310-cp310-win_amd64.whl", hash = "sha256:9afd3f493666a0cd8f8df9a0200c6359ac53940cbde049dcb1a7eb6ee2dd7074"},
{file = "black-23.1.0-cp311-cp311-macosx_10_16_arm64.whl", hash = "sha256:bfffba28dc52a58f04492181392ee380e95262af14ee01d4bc7bb1b1c6ca8d27"},
{file = "black-23.1.0-cp311-cp311-macosx_10_16_universal2.whl", hash = "sha256:c1c476bc7b7d021321e7d93dc2cbd78ce103b84d5a4cf97ed535fbc0d6660648"},
{file = "black-23.1.0-cp311-cp311-macosx_10_16_x86_64.whl", hash = "sha256:382998821f58e5c8238d3166c492139573325287820963d2f7de4d518bd76958"},
{file = "black-23.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2bf649fda611c8550ca9d7592b69f0637218c2369b7744694c5e4902873b2f3a"},
{file = "black-23.1.0-cp311-cp311-win_amd64.whl", hash = "sha256:121ca7f10b4a01fd99951234abdbd97728e1240be89fde18480ffac16503d481"},
{file = "black-23.1.0-cp37-cp37m-macosx_10_16_x86_64.whl", hash = "sha256:a8471939da5e824b891b25751955be52ee7f8a30a916d570a5ba8e0f2eb2ecad"},
{file = "black-23.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8178318cb74f98bc571eef19068f6ab5613b3e59d4f47771582f04e175570ed8"},
{file = "black-23.1.0-cp37-cp37m-win_amd64.whl", hash = "sha256:a436e7881d33acaf2536c46a454bb964a50eff59b21b51c6ccf5a40601fbef24"},
{file = "black-23.1.0-cp38-cp38-macosx_10_16_arm64.whl", hash = "sha256:a59db0a2094d2259c554676403fa2fac3473ccf1354c1c63eccf7ae65aac8ab6"},
{file = "black-23.1.0-cp38-cp38-macosx_10_16_universal2.whl", hash = "sha256:0052dba51dec07ed029ed61b18183942043e00008ec65d5028814afaab9a22fd"},
{file = "black-23.1.0-cp38-cp38-macosx_10_16_x86_64.whl", hash = "sha256:49f7b39e30f326a34b5c9a4213213a6b221d7ae9d58ec70df1c4a307cf2a1580"},
{file = "black-23.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:162e37d49e93bd6eb6f1afc3e17a3d23a823042530c37c3c42eeeaf026f38468"},
{file = "black-23.1.0-cp38-cp38-win_amd64.whl", hash = "sha256:8b70eb40a78dfac24842458476135f9b99ab952dd3f2dab738c1881a9b38b753"},
{file = "black-23.1.0-cp39-cp39-macosx_10_16_arm64.whl", hash = "sha256:a29650759a6a0944e7cca036674655c2f0f63806ddecc45ed40b7b8aa314b651"},
{file = "black-23.1.0-cp39-cp39-macosx_10_16_universal2.whl", hash = "sha256:bb460c8561c8c1bec7824ecbc3ce085eb50005883a6203dcfb0122e95797ee06"},
{file = "black-23.1.0-cp39-cp39-macosx_10_16_x86_64.whl", hash = "sha256:c91dfc2c2a4e50df0026f88d2215e166616e0c80e86004d0003ece0488db2739"},
{file = "black-23.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2a951cc83ab535d248c89f300eccbd625e80ab880fbcfb5ac8afb5f01a258ac9"},
{file = "black-23.1.0-cp39-cp39-win_amd64.whl", hash = "sha256:0680d4380db3719ebcfb2613f34e86c8e6d15ffeabcf8ec59355c5e7b85bb555"},
{file = "black-23.1.0-py3-none-any.whl", hash = "sha256:7a0f701d314cfa0896b9001df70a530eb2472babb76086344e688829efd97d32"},
{file = "black-23.1.0.tar.gz", hash = "sha256:b0bd97bea8903f5a2ba7219257a44e3f1f9d00073d6cc1add68f0beec69692ac"},
{file = "black-23.3.0-cp310-cp310-macosx_10_16_arm64.whl", hash = "sha256:0945e13506be58bf7db93ee5853243eb368ace1c08a24c65ce108986eac65915"},
{file = "black-23.3.0-cp310-cp310-macosx_10_16_universal2.whl", hash = "sha256:67de8d0c209eb5b330cce2469503de11bca4085880d62f1628bd9972cc3366b9"},
{file = "black-23.3.0-cp310-cp310-macosx_10_16_x86_64.whl", hash = "sha256:7c3eb7cea23904399866c55826b31c1f55bbcd3890ce22ff70466b907b6775c2"},
{file = "black-23.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:32daa9783106c28815d05b724238e30718f34155653d4d6e125dc7daec8e260c"},
{file = "black-23.3.0-cp310-cp310-win_amd64.whl", hash = "sha256:35d1381d7a22cc5b2be2f72c7dfdae4072a3336060635718cc7e1ede24221d6c"},
{file = "black-23.3.0-cp311-cp311-macosx_10_16_arm64.whl", hash = "sha256:a8a968125d0a6a404842fa1bf0b349a568634f856aa08ffaff40ae0dfa52e7c6"},
{file = "black-23.3.0-cp311-cp311-macosx_10_16_universal2.whl", hash = "sha256:c7ab5790333c448903c4b721b59c0d80b11fe5e9803d8703e84dcb8da56fec1b"},
{file = "black-23.3.0-cp311-cp311-macosx_10_16_x86_64.whl", hash = "sha256:a6f6886c9869d4daae2d1715ce34a19bbc4b95006d20ed785ca00fa03cba312d"},
{file = "black-23.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6f3c333ea1dd6771b2d3777482429864f8e258899f6ff05826c3a4fcc5ce3f70"},
{file = "black-23.3.0-cp311-cp311-win_amd64.whl", hash = "sha256:11c410f71b876f961d1de77b9699ad19f939094c3a677323f43d7a29855fe326"},
{file = "black-23.3.0-cp37-cp37m-macosx_10_16_x86_64.whl", hash = "sha256:1d06691f1eb8de91cd1b322f21e3bfc9efe0c7ca1f0e1eb1db44ea367dff656b"},
{file = "black-23.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:50cb33cac881766a5cd9913e10ff75b1e8eb71babf4c7104f2e9c52da1fb7de2"},
{file = "black-23.3.0-cp37-cp37m-win_amd64.whl", hash = "sha256:e114420bf26b90d4b9daa597351337762b63039752bdf72bf361364c1aa05925"},
{file = "black-23.3.0-cp38-cp38-macosx_10_16_arm64.whl", hash = "sha256:48f9d345675bb7fbc3dd85821b12487e1b9a75242028adad0333ce36ed2a6d27"},
{file = "black-23.3.0-cp38-cp38-macosx_10_16_universal2.whl", hash = "sha256:714290490c18fb0126baa0fca0a54ee795f7502b44177e1ce7624ba1c00f2331"},
{file = "black-23.3.0-cp38-cp38-macosx_10_16_x86_64.whl", hash = "sha256:064101748afa12ad2291c2b91c960be28b817c0c7eaa35bec09cc63aa56493c5"},
{file = "black-23.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:562bd3a70495facf56814293149e51aa1be9931567474993c7942ff7d3533961"},
{file = "black-23.3.0-cp38-cp38-win_amd64.whl", hash = "sha256:e198cf27888ad6f4ff331ca1c48ffc038848ea9f031a3b40ba36aced7e22f2c8"},
{file = "black-23.3.0-cp39-cp39-macosx_10_16_arm64.whl", hash = "sha256:3238f2aacf827d18d26db07524e44741233ae09a584273aa059066d644ca7b30"},
{file = "black-23.3.0-cp39-cp39-macosx_10_16_universal2.whl", hash = "sha256:f0bd2f4a58d6666500542b26354978218a9babcdc972722f4bf90779524515f3"},
{file = "black-23.3.0-cp39-cp39-macosx_10_16_x86_64.whl", hash = "sha256:92c543f6854c28a3c7f39f4d9b7694f9a6eb9d3c5e2ece488c327b6e7ea9b266"},
{file = "black-23.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3a150542a204124ed00683f0db1f5cf1c2aaaa9cc3495b7a3b5976fb136090ab"},
{file = "black-23.3.0-cp39-cp39-win_amd64.whl", hash = "sha256:6b39abdfb402002b8a7d030ccc85cf5afff64ee90fa4c5aebc531e3ad0175ddb"},
{file = "black-23.3.0-py3-none-any.whl", hash = "sha256:ec751418022185b0c1bb7d7736e6933d40bbb14c14a0abcf9123d1b159f98dd4"},
{file = "black-23.3.0.tar.gz", hash = "sha256:1c7b8d606e728a41ea1ccbd7264677e494e87cf630e399262ced92d4a8dac940"},
]
[package.dependencies]
@ -747,14 +728,14 @@ files = [
[[package]]
name = "mkdocs"
version = "1.4.2"
version = "1.4.3"
description = "Project documentation with Markdown."
category = "dev"
optional = false
python-versions = ">=3.7"
files = [
{file = "mkdocs-1.4.2-py3-none-any.whl", hash = "sha256:c8856a832c1e56702577023cd64cc5f84948280c1c0fcc6af4cd39006ea6aa8c"},
{file = "mkdocs-1.4.2.tar.gz", hash = "sha256:8947af423a6d0facf41ea1195b8e1e8c85ad94ac95ae307fe11232e0424b11c5"},
{file = "mkdocs-1.4.3-py3-none-any.whl", hash = "sha256:6ee46d309bda331aac915cd24aab882c179a933bd9e77b80ce7d2eaaa3f689dd"},
{file = "mkdocs-1.4.3.tar.gz", hash = "sha256:5955093bbd4dd2e9403c5afaf57324ad8b04f16886512a3ee6ef828956481c57"},
]
[package.dependencies]
@ -792,14 +773,14 @@ mkdocs = ">=1.1"
[[package]]
name = "mkdocs-material"
version = "9.1.4"
version = "9.1.11"
description = "Documentation that simply works"
category = "dev"
optional = false
python-versions = ">=3.7"
files = [
{file = "mkdocs_material-9.1.4-py3-none-any.whl", hash = "sha256:4c92dcf9365068259bef3eed8e0dd5410056b6f7187bdea2d52848c0f94cd94c"},
{file = "mkdocs_material-9.1.4.tar.gz", hash = "sha256:c3a8943e9e4a7d2624291da365bbccf0b9f88688aa6947a46260d8c165cd4389"},
{file = "mkdocs_material-9.1.11-py3-none-any.whl", hash = "sha256:fbc86d50ec2cf34d40d5c4365780f290ceedde23f1a0704323b34e7f16b0c0dd"},
{file = "mkdocs_material-9.1.11.tar.gz", hash = "sha256:f5d473eb79d6640a5e668d4b2ab5b9de5e76ae0a0e2d864112df0cfe9016dc1d"},
]
[package.dependencies]
@ -827,14 +808,14 @@ files = [
[[package]]
name = "mkdocstrings"
version = "0.20.0"
version = "0.21.2"
description = "Automatic documentation from sources, for MkDocs."
category = "dev"
optional = false
python-versions = ">=3.7"
files = [
{file = "mkdocstrings-0.20.0-py3-none-any.whl", hash = "sha256:f17fc2c4f760ec302b069075ef9e31045aa6372ca91d2f35ded3adba8e25a472"},
{file = "mkdocstrings-0.20.0.tar.gz", hash = "sha256:c757f4f646d4f939491d6bc9256bfe33e36c5f8026392f49eaa351d241c838e5"},
{file = "mkdocstrings-0.21.2-py3-none-any.whl", hash = "sha256:949ef8da92df9d692ca07be50616459a6b536083a25520fd54b00e8814ce019b"},
{file = "mkdocstrings-0.21.2.tar.gz", hash = "sha256:304e56a2e90595708a38a13a278e538a67ad82052dd5c8b71f77a604a4f3d911"},
]
[package.dependencies]
@ -845,6 +826,7 @@ mkdocs = ">=1.2"
mkdocs-autorefs = ">=0.3.1"
mkdocstrings-python = {version = ">=0.5.2", optional = true, markers = "extra == \"python\""}
pymdown-extensions = ">=6.3"
typing-extensions = {version = ">=4.1", markers = "python_version < \"3.10\""}
[package.extras]
crystal = ["mkdocstrings-crystal (>=0.3.4)"]
@ -1007,18 +989,17 @@ pyyaml = "*"
[[package]]
name = "pytest"
version = "7.2.2"
version = "7.3.1"
description = "pytest: simple powerful testing with Python"
category = "dev"
optional = false
python-versions = ">=3.7"
files = [
{file = "pytest-7.2.2-py3-none-any.whl", hash = "sha256:130328f552dcfac0b1cec75c12e3f005619dc5f874f0a06e8ff7263f0ee6225e"},
{file = "pytest-7.2.2.tar.gz", hash = "sha256:c99ab0c73aceb050f68929bc93af19ab6db0558791c6a0715723abe9d0ade9d4"},
{file = "pytest-7.3.1-py3-none-any.whl", hash = "sha256:3799fa815351fea3a5e96ac7e503a96fa51cc9942c3753cda7651b93c1cfa362"},
{file = "pytest-7.3.1.tar.gz", hash = "sha256:434afafd78b1d78ed0addf160ad2b77a30d35d4bdf8af234fe621919d9ed15e3"},
]
[package.dependencies]
attrs = ">=19.2.0"
colorama = {version = "*", markers = "sys_platform == \"win32\""}
exceptiongroup = {version = ">=1.0.0rc8", markers = "python_version < \"3.11\""}
iniconfig = "*"
@ -1027,7 +1008,7 @@ pluggy = ">=0.12,<2.0"
tomli = {version = ">=1.0.0", markers = "python_version < \"3.11\""}
[package.extras]
testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "xmlschema"]
testing = ["argcomplete", "attrs (>=19.2.0)", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "xmlschema"]
[[package]]
name = "python-dateutil"
@ -1458,4 +1439,4 @@ testing = ["big-O", "flake8 (<5)", "jaraco.functools", "jaraco.itertools", "more
[metadata]
lock-version = "2.0"
python-versions = "^3.8.1"
content-hash = "aa15e57300668bd23c051b4cd87bec4c1a58dcccd2f2b4767579fea7f2c5fa41"
content-hash = "6bea74d847b958639276d4be527c2b65dafeb0a455b6e3d1f29fee5171ce73b2"

View file

@ -1,6 +1,6 @@
[tool.poetry]
name = "llama_cpp_python"
version = "0.1.43"
version = "0.1.48"
description = "Python bindings for the llama.cpp library"
authors = ["Andrei Betlen <abetlen@gmail.com>"]
license = "MIT"
@ -18,12 +18,12 @@ typing-extensions = "^4.5.0"
[tool.poetry.group.dev.dependencies]
black = "^23.1.0"
black = "^23.3.0"
twine = "^4.0.2"
mkdocs = "^1.4.2"
mkdocstrings = {extras = ["python"], version = "^0.20.0"}
mkdocs-material = "^9.1.4"
pytest = "^7.2.2"
mkdocs = "^1.4.3"
mkdocstrings = {extras = ["python"], version = "^0.21.2"}
mkdocs-material = "^9.1.11"
pytest = "^7.3.1"
httpx = "^0.24.0"
[build-system]