Appearance
llama.cpp server overview and usage
Get an overview of the llama.cpp server and its usage, including configuration options and API endpoints.
Quick start
Start the llama-server with a pre-trained model from the Hugging Face model hub. The server will download the specified model and start on port 8000 by default.
sh
llama-server -hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF -hff Llama-3.2-1B-Instruct-Q4_K_M.ggufCommand line options
Common options
-c,--ctx-size: Size of the prompt context (default: 0, 0 = loaded from model)
(env:LLAMA_ARG_CTX_SIZE) Here is the wrapped text with backticks ("`") around numbers, environment variables, and command options:-n,--predict,--n-predict: Number of tokens to predict (default:-1,-1= infinity,-2= until context filled)
(env:LLAMA_ARG_N_PREDICT)-keep: Number of tokens to keep from the initial prompt (default:0,-1= all)-m,--model: Model path (default:models/$filenamewith filename from--hf-fileor--model-urlif set, otherwisemodels/7B/ggml-model-f16.gguf)
(env:LLAMA_ARG_MODEL)-mu,--model-url: Model download url (default: unused)
(env:LLAMA_ARG_MODEL_URL)-hfr,--hf-repo: Hugging Face model repository (default: unused)
(env:LLAMA_ARG_HF_REPO)-hff,--hf-file: Hugging Face model file (default: unused)
(env:LLAMA_ARG_HF_FILE)-hft,--hf-token: Hugging Face access token (default: value fromHF_TOKENenvironment variable)
(env:HF_TOKEN)
Server options
--host: IP address to listen (default:127.0.0.1)
(env:LLAMA_ARG_HOST)--port: Port to listen (default:8080)
(env:LLAMA_ARG_PORT)--api-key: API key to use for authentication (default: none)
(env:LLAMA_API_KEY)--api-key-file: Path to file containing API keys (default: none)--ssl-key-file: Path to file a PEM-encoded SSL private key
(env:LLAMA_ARG_SSL_KEY_FILE)--ssl-cert-file: Path to file a PEM-encoded SSL certificate
(env:LLAMA_ARG_SSL_CERT_FILE)-to,--timeout: Server read/write timeout in seconds (default:600)
(env:LLAMA_ARG_TIMEOUT)--metrics: Enable prometheus compatible metrics endpoint (default: disabled)
(env:LLAMA_ARG_ENDPOINT_METRICS)
Default parameters
- Seed:
-1(random) - The seed value for generating random numbers. - Temperature:
0.8- A value that controls the randomness of the output. Lower values produce more conservative results, while higher values produce more diverse results. - Top-k:
40- The number of highest-probability tokens to consider when generating output. - Top-p:
0.9- The cumulative probability threshold for the top tokens to consider when generating output. - Min-p:
0.1- The minimum probability threshold for a token to be considered in the output. - Presence penalty:
0.0(disabled) - A penalty applied to tokens that are already present in the output. - Frequency penalty:
0.0(disabled) - A penalty applied to tokens that are frequently used in the output.
Endpoints
GET /health: Health check
This endpoint checks the health status of the service.
- Success Response
- Code:
200 - Example:
{"status": "ok" }
- Code:
- Error Response
- Code:
503 - Example:
{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}
- Code:
POST /v1/chat/completions: OpenAI compatible chat completions
This endpoint generates text completions in response to user input, compatible with the OpenAI API.
python
import openai
client = openai.OpenAI(base_url='http://llama-cpp:8080/v1', api_key='dummy')
response = client.chat.completions.create(
model='dummy',
messages=[
{'role': 'system', 'content': 'You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.'},
{'role': 'user', 'content': 'Write a limerick about python exceptions'},
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or '', end='', flush=True)