tool

llama.cpp Official WebUI is Finally Here! Create the Ultimate Local AI Chat Experience

November 5, 2025
Updated Nov 5
6 min read

Say goodbye to complex setups! llama.cpp officially launches its new official WebUI, built on SvelteKit, powerful and completely free. This article will guide you through getting started quickly, exploring cool features like multimodal, parallel conversations, and JSON constrained generation, enjoying a 100% private AI assistant on your own computer.


If you’re a player who likes to run large language models (LLMs) on your own computer, then you’re definitely familiar with the name llama.cpp. It’s lightweight, efficient, and can run on almost any hardware, making it synonymous with local AI. But honestly, in the past, finding a handy and powerful graphical interface (UI) for it always took some effort.

But now, that trouble can officially come to an end. The core development team of llama.cpp has launched a brand new official web user interface (WebUI)! This is not just a simple chat window, but a complete solution that aims to create the “ultimate local AI chat experience.”

So, what’s special about this official WebUI?

You might be thinking, aren’t there already many WebUIs on the market? Yes, but official products always have that little extra “favorite child” advantage. This interface, built with SvelteKit technology, perfectly integrates with the llama-server backend, bringing several amazing features:

  • Completely free and open source: Community-driven, you have complete control over everything.
  • Extreme performance: Whether your computer has a high-end graphics card or an ordinary CPU, it can deliver excellent performance.
  • Advanced caching technology: With advanced context and prefix caching, response speed is faster.
  • Lightweight and efficient: Extremely low memory footprint, won’t drag down your system.
  • 100% privacy: All computations are done on your computer, and your conversation data won’t go anywhere.

Sounds great, right? Next, let’s see how easy it is to get started.

Three steps to get started quickly, experience it now

Ready to start? The process is really simple, you don’t need to be a programming expert to get it done easily.

  1. Get llama.cpp: First, you need to get the llama.cpp main program. You can get it by installing, downloading, or building it yourself.

  2. Start the llama-server: Next, open your terminal (Terminal or Command Prompt) and enter the command to start the backend server. Here’s an example that downloads and runs a model:

    # Run an example server using the gpt-oss-20b model
    llama-server -hf ggml-org/gpt-oss-20b-GGUF --jinja -c 0 --host 127.0.0.1 --port 8033
    
  3. Open your browser and start chatting: After the server starts, open http://127.0.0.1:8033 directly in your browser (Chrome, Edge, Firefox, etc.), and you’ll see the clean chat interface!

Tip: If you’re a Mac user and don’t like dealing with commands, you can try LlamaBarn, an application that provides a simpler graphical interface for setting up llama.cpp.

More than just chat: Explore the powerful features of WebUI

This WebUI is not just about looks; it comes with many practical and powerful features that take your local AI experience to a new level.

Documents, PDFs, images? Throw them all in!

This is probably one of the most useful features. You can directly drag and drop multiple text files (.txt), PDF files, and even images into the conversation.

  • Document processing: Whether from your computer’s hard drive or directly pasted from the clipboard, it can add text content to the conversation’s context.
  • PDF processing: By default, it converts PDF content to plain text. If your AI model supports vision capabilities, you can even set it to treat PDFs as images, directly analyzing charts or layouts within them.
  • Image input: For models that support vision (such as LLaVA or Qwen-VL), you can upload images and let the AI describe the image content, answer related questions, and even engage in multimodal conversations with text and images.

Multitasking? Parallel conversations and branch management

Have you ever had this experience: wanting to discuss several different topics with AI at the same time? Or wanting to try another way of questioning based on one of AI’s answers?

  • Parallel conversations: This WebUI allows you to open multiple independent chat windows simultaneously, each with its own context, without interfering with each other.
  • Conversation branching: You can go back at any time to edit any message from yourself or the AI, and then “branch” out a new conversation direction from that point in time. This is very useful for comparing the effects of different prompts or correcting the AI’s response direction.

Make AI obey: Precise control and formatted output

For advanced users and developers, precise control over the model’s output format is crucial.

  • Constrained generation: This is a super cool feature! You can provide a custom JSON Schema to force the AI’s response to conform to your specified format. For example, you can have it automatically extract fields like “company name,” “amount,” and “date” from a bunch of invoice images and output them in a standard JSON format, greatly simplifying subsequent data processing tasks.
  • Render mathematical formulas and code: It can perfectly render LaTeX mathematical expressions and code blocks (HTML/JS), making academic discussions and code development more intuitive.

Take it with you anywhere: Perfect mobile experience

That’s right, this WebUI is also mobile-friendly! You can open it in your phone or tablet’s browser, and the interface will automatically adapt to the screen size, allowing you to use your local AI assistant anytime, anywhere.

Frequently Asked Questions (FAQ)

In community discussions, some common questions have been raised, and they are summarized here for you.

Q: How to enable parallel conversation function? A: When starting llama-server, add the --parallel N parameter, where N is the number of conversations you want to handle simultaneously (e.g., --parallel 2). If it’s a single user, multi-conversation scenario, it’s recommended to add another --kv-unified parameter so that all conversations share the KV cache instead of splitting it, which can more effectively utilize context space.

Q: I don’t want to download models from Hugging Face. How do I load model files from my own computer? A: It’s very simple. Use the -m or --model parameter, followed by the path to your local GGUF model file. For example: llama-server -m /path/to/your/model.gguf

Q: How to make AI strictly reply in the JSON format I specify? A: This is the “constrained generation” feature mentioned earlier. You can find the “Custom JSON” option in the WebUI’s developer settings and then paste your JSON Schema definition.

Conclusion

This new official WebUI for llama.cpp undoubtedly provides local AI enthusiasts with a feature-rich, high-performance, and user-friendly excellent choice. It not only makes getting started easier but also provides rich customization options for advanced users.

All thanks to the project’s lead developer Aleksander Grygier, the highly contributing ServeurpersoCom, and the Hugging Face community for their extensive support.

If you’re also passionate about running AI on your own computer, now is the best time. Go check out the GitHub project page and experience this powerful new tool firsthand!

Share on:
Featured Partners

© 2025 Communeify. All rights reserved.