Test Drive: llamafile
Published on .
Today I test drove llamafile to “distribute and run LLMs with a single file.” I typically prefer hosted solutions in order to avoid distraction, but its also fun to tinker. Now that I have this model laying around, I could work on an application that implements OpenAI even if the internet goes out. I wanted to try the combination of llamafile
+ --server
mode + litellm, since sever mode and Litellm are OpenAI compatible. The process is:
- Start llamafile in server mode
- Profit
Newer llamafiles have a much improved chat interface, so now you have to pass the server flag (e.g. ./Llama-3.2-1B-Instruct.Q6_K.llamafile --server --port 8081
). Once its running, the code can be simple:
import litellm
response = litellm.completion(
model="openai/local",
api_key="sk-1234",
api_base="http://localhost:8081/v1",
messages=[
{"role": "user", "content": "Write a limerick about LLMs"}
],
)
print(response.choices[0].message.content)
Code language: Python (python)