Nat TaylorBlog, AI, Product Management & Tinkering

Test Drive: llamafile

Published on .

Today I test drove llamafile to “distribute and run LLMs with a single file.” I typically prefer hosted solutions in order to avoid distraction, but its also fun to tinker. Now that I have this model laying around, I could work on an application that implements OpenAI even if the internet goes out. I wanted to try the combination of llamafile + --server mode + litellm, since sever mode and Litellm are OpenAI compatible. The process is:

  1. Start llamafile in server mode
  2. Profit

Newer llamafiles have a much improved chat interface, so now you have to pass the server flag (e.g. ./Llama-3.2-1B-Instruct.Q6_K.llamafile --server --port 8081). Once its running, the code can be simple:

import litellm

response = litellm.completion(
    model="openai/local",
    api_key="sk-1234",
    api_base="http://localhost:8081/v1",
    messages=[
        {"role": "user", "content": "Write a limerick about LLMs"}
    ],
)
print(response.choices[0].message.content)

Post Navigation

«
»