Llama-Server is (Almost) All You Need

tags: #Ai #Homelab
published: September 4, 2025
reading time: 3 minutes

If you’re running LLMs locally, you’ve probably used Ollama or LM Studio. But they are effectively wrappers for llama.cpp. They’re convenient, but they impose limits that get in the way once you start trying to squeeze maximum performance out of your hardware.

LM Studio is primarily a desktop app and while it can act as a server, it cannot be used in truly headless environment. Ollama, on the other hand, is headless, but it is very opinionated and hides the advanced configuration parameters. It also frequently breaks compatibility with other community models.

For more control, there’s llama-server from llama.cpp. It offers extensive configuration options, provides OpenAI-compatible APIs, and, in my opinion, is noticeably faster than Ollama. But it’s CLI-only, when I wanted to change the running model, I had to stop the current process and run a new one with my desired model parametes.

I wanted to manage my home LLM server from anywhere without constantly SSH-ing just to switch models. Since I couldn’t find an existing solution, I decided to build the management layer myself.

Orchestrating Llama-Server

To solve the friction of constant terminal access, I developed Llamactl. It acts as a management server and proxy that sits on top of llama-server It exposes the full configuration surface of the underlying llama-server engine—parameters while providing a web dashboard for orchestration.

Rather than toggling models via SSH, you interact with a modern React UI or standard REST APIs. You can run multiple instances, for instance, pairing a fast 7B model for quick responses with a 70B model for complex reasoning, and switch between them simply by updating the model parameter in your API request.

Because it maintains full OpenAI API compatibility, your existing ecosystem, like Open WebUI, works without modification. Just point your client to the Llamactl instance instead of the OpenAI endpoint.

Features:

Instance Management: Create, start, stop, and delete llama-server instances directly from the dashboard.
Resource Lifecycle: Automatic idle timeouts, LRU eviction, configurable instance limits
On-Demand Loading: Automatically loads models into memory when an API request is received for an inactive instance.
API Security: Provides separate keys for management operations and inference requests. Inference keys can be scoped with per-instance permissions.

My Setup

I run my LLMs on a Mac Mini M4 Pro at home. The 48GB of unified memory gives me enough room to run larger models like Gemma 3 27B or Qwen 3 Coder 32B and switch between them as needed.

Both my home Mac Mini and a cloud VPS are connected via Tailscale. This creates a secure, private network that lets them communicate as if they were on the same LAN. Llamactl runs on the Mac Mini, managing my llama-server instances. Open WebUI also runs locally, providing a ChatGPT-like interface. Traefik runs on my VPS as a reverse proxy

Traefik on the VPS proxies requests through the Tailscale network to my home setup, giving me a clean public URL (like llm.mydomain.com) that securely tunnels to my home lab.

What’s Next

Llamactl is still evolving, and I have several ideas for future improvements:

Enhanced Admin Dashboard: I’d like to add proper user authentication with usernames and passwords.
Multiple Backend Support: Right now it’s focused on llama-server, but adding support for other inference engines like vLLM or mlx_lm.server could make it even more versatile.
Better Resource Scheduling: Smart load balancing and automatic model placement based on hardware capabilities.

Llamactl is open source and available on GitHub. You can find complete documentation and guides at llamactl.org. I welcome all feedback and contributions.