Generative AI~ 4 min read

Chatbot application

Serving a large language model (LLM) over HTTP is the foundation; talking to it through curl is not how most people want to interact with a model. Flama ships a complete, ready-to-use chat interface as part of the native serving layer, so any model you serve comes with a polished web application out of the box, no frontend code required. This page covers the built-in chat interface, how to enable it, and how to talk to your model through it.

What is the built-in chat interface?

In Flama, the chat interface is a single-page web application served directly from your model's resource. It is part of the native serving dialect: when the native layer is enabled, the model exposes a GET /chat/ route that returns a self-contained HTML application wired to the model's streaming endpoint. Open it in a browser and you have a working chat client against your own model.

Why is it important?

Zero frontend effort: A production-quality chat UI ships with the framework; you do not write or build any JavaScript.
Rich rendering: Responses render as Markdown, with LaTeX math (via KaTeX) and Mermaid diagrams, so technical answers display the way they were meant to.
Streaming and resumption: The interface consumes the native Server-Sent Events (SSE) stream, displaying tokens as they arrive and reconnecting transparently with Last-Event-ID if the connection drops.
Self-contained: The whole application is a single HTML file with no external runtime dependencies, so it works behind a firewall and deploys wherever your API does.

The headline benefit is a usable product on day one: every model you serve becomes immediately approachable by real users, not only by command-line clients.

Enabling the chat interface

The chat interface is part of the native serving dialect, so you enable it by including "native" in the resource's serving tuple (it is included by default when serving is omitted):

app.models.add_model(    path="/llm/",    model="google_gemma-4-E2B-it.flm",    name="assistant",    serving=("native",),)

With the native dialect enabled, the resource gains a GET /llm/chat/ route that returns the self-contained chat application, alongside the streaming endpoints it talks to.

Using the chat interface

With the application running, navigate to http://127.0.0.1:8000/llm/chat/ in your browser. You are greeted with a chat window where you can type a prompt and watch the model's reply stream in token by token. Markdown formatting, code blocks, mathematical expressions, and diagrams are all rendered as the response arrives.

Because the interface is mounted relative to the resource path, each model you serve gets its own chat window. If you register a second model under /coder/, its chat interface lives at /coder/chat/, talking to that model's own streaming endpoint.

How it works

The chat interface is a thin client over the native streaming endpoints described in the previous page. When the page loads, it resolves the URL of the resource's create_stream endpoint and uses it to drive a two-step exchange for every message:

A POST to /llm/stream/ creates a generation and returns a stream identifier.
A native EventSource connects to GET /llm/stream/{id}/ and renders each block.delta event as it arrives, until the closing message.stop event.

This is the same flow any custom frontend would use, which means the built-in interface is also a reference implementation: if you outgrow it, you can build your own client against exactly the same endpoints.

Example

Putting it together, here is a complete application that serves a single model with the native dialect, exposing both the streaming endpoints and the chat interface built on top of them:

# examples/chatbot.pyimport flamafrom flama import Flama
app = Flama(    openapi={        "info": {            "title": "Generative AI API",            "version": "1.0.0",            "description": "A chatbot powered by Flama 🔥",        },    },    docs="/docs/",)
app.models.add_model(    path="/llm/",    model="google_gemma-4-E2B-it.flm",    name="assistant",    serving=("native",),)

if __name__ == "__main__":    flama.run(flama_app=app, server_host="0.0.0.0", server_port=8000)

Run the application as usual:

flama run examples.chatbot:app
INFO:     Started server process [44021]INFO:     Waiting for application startup.INFO:     Model ready (name: assistant)INFO:     Application startup complete.INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Open http://127.0.0.1:8000/llm/chat/ to talk to your model. In the next page we move from serving a model to exposing tools, resources, and prompts to AI clients through the Model Context Protocol.

Introduction

Getting Started

Fundamentals

Flama CLI

Advanced Topics

Predictive AI

Generative AI

Domain driven design