Flama CLI~ 5 min read

Serve

Serve models without writing an app

In the previous section we introduced the run command of the Flama CLI, which serves any Flama app you have written. The serve command comes to the rescue of those who want an instantaneous deployment of one or more models without writing a single line of code. It builds the application for you and exposes the models you point it at, whether they are traditional predictive models or generative models.

Models must be packaged as binary .flm files beforehand. For predictive models, see packaging models; for generative models, see getting models.

Example files

The best way to see the power of serve is by example. You can use any of the following predictive example files:

For generative serving, fetch a small generative model:

flama get --source huggingface --family llm google/gemma-4-E2B-it

This produces a google_gemma-4-E2B-it.flm artifact.

Codeless serving

Every model is declared with a --model option. In its simplest form, you pass the path to a .flm file:

flama serve --model sklearn_model.flm
INFO:     Started server process [15822]INFO:     Waiting for application startup.INFO:     Application startup complete.INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Does it seem familiar? Indeed, we have a Flama app up and running on http://127.0.0.1 listening on port 8000, without having written a single line of code. This is the codeless approach of Flama to model deployment, which makes it possible to expose an already trained model as an API ready to receive requests almost without any effort.

The `--model` specification

The bare path is shorthand. The full form of --model is a comma-separated list of key=value pairs that configure how each model is mounted:

flama serve --model file=model.flm,url=/route,name=my-model

The keys are:

file (required): the path to the .flm artifact.
url: the route under which the model is served (default: /).
name: the model name, used for documentation (default: model).
serving: for generative models, a colon-separated list of serving dialects, e.g. serving=native:openai (see serving generative models).
params: for generative models, a colon-separated list of generation defaults, e.g. params=temperature=0.7:max_tokens=200.
channel_scanner, tool_scanner, tool_parser: advanced decoder controls for generative models, defaulting to auto-detection.

Serving a predictive model

To serve a predictive model under a named route, point --model at its artifact and choose a URL and name:

flama serve --model file=sklearn_model.flm,url=/xor,name=xor-classifier
INFO:     Started server process [15999]INFO:     Waiting for application startup.INFO:     Model ready (name: xor-classifier)INFO:     Application startup complete.INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

The model is now reachable under /xor/, exposing the prediction and documentation endpoints described in how do we interact with the model?.

Serving a generative model

Serving a generative model is the same command with a generative model artifact and a choice of dialects. The following exposes the model under /llm/ with both the native and OpenAI-compatible dialects, and a default sampling temperature:

flama serve --model file=google_gemma-4-E2B-it.flm,url=/llm,name=assistant,serving=native:openai,params=temperature=0.7
INFO:     Started server process [15999]INFO:     Waiting for application startup.INFO:     Model ready (name: assistant)INFO:     Application startup complete.INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

The model is now reachable through the native endpoints (including the chat interface at /llm/chat/) and the OpenAI-compatible endpoints under /llm/openai/v1/. See serving generative models for the full route map.

Serving multiple models

The --model option is repeatable, so a single command can serve several models at once, each on its own route:

flama serve \  --model file=sklearn_model.flm,url=/sklearn,name=logistic-regression \  --model file=pytorch_model.flm,url=/pytorch,name=pytorch-model

Specification files

For anything beyond a couple of inline options, a whole model specification can be loaded from a JSON, YAML, or TOML file with the @ prefix:

// assistant.json{  "file": "google_gemma-4-E2B-it.flm",  "url": "/llm",  "name": "assistant",  "serving": ["native", "openai"],  "params": {"temperature": 0.7, "max_tokens": 512}}

With the specification saved, pass it to serve with the @ prefix:

flama serve --model @assistant.json

How do we interact with the model?

All predictive models served with flama serve come with the following endpoints for free:

GET http://127.0.0.1:8000/docs/: An interactive HTML documentation page to explore and call every route in the browser. This is the recommended starting point.
GET http://127.0.0.1:8000/schema/: The OpenAPI schema describing the served API.
POST http://127.0.0.1:8000/predict/: Expects the input data (observations) to be passed as argument to predict their output.

For a prediction against a predictive model served at the default route:

curl --request POST \  --url http://127.0.0.1:8000/predict/ \  --header 'Content-Type: application/json' \  --data '{"input": [[0, 0], [1, 0], [0, 1], [0, 0]]}'
{"output": [0, 1, 1, 0]}

Generative models served with serve expose the dialect endpoints described in the Generative AI section instead.

Parameters

Beyond --model, the serve command accepts application and server parameters. To inspect them all, run:

flama serve --help

App parameters

app-debug: enable debug mode (default: False).
app-title: name of the application (default: Flama).
app-version: version of the application (default: 0.1.0).
app-description: description of the application.
app-docs: route of the application documentation (default: /docs/).
app-schema: route of the application schema (default: /schema/).

The parameter app-debug brings useful tools that make debugging easier, e.g. highly-detailed error messages and interactive error webpages.

Server parameters

All uvicorn options can be passed to the command serve with the format server-<UVICORN_OPTION_NAME>, as discussed for the command run, e.g.:

server-host: bind socket to this host (default: 127.0.0.1).
server-port: bind socket to this port (default: 8000).

How to use parameters

Each parameter can be passed with --<option-name> <value> or via the equivalent environment variable, named FLAMA_<OPTION_NAME> (uppercase, underscored). For instance, to change the host:

flama serve --model sklearn_model.flm --server-host=0.0.0.0

or, equivalently:

export FLAMA_SERVER_HOST=0.0.0.0flama serve --model sklearn_model.flm

Example

As a quick example populating application and model parameters, we can run:

flama serve \  --model file=sklearn_model.flm,url=/xor,name="XOR dummy model" \  --app-debug \  --app-title="Predictive AI API" \  --app-version="1.0.0" \  --app-description="XOR model serving for predictions" \  --app-docs="/docs/" \  --app-schema="/schema/"

With these changes, the documentation at http://127.0.0.1:8000/docs/ reflects the customised application metadata and serves the model at /xor/.

Introduction

Getting Started

Fundamentals

Flama CLI

Advanced Topics

Predictive AI

Generative AI

Domain driven design

Contributing

Serve

Serve models without writing an app

Example files

Codeless serving

The `--model` specification

Serving a predictive model

Serving a generative model

Serving multiple models

Specification files

How do we interact with the model?

Parameters

App parameters

Server parameters

How to use parameters

Example

Introduction

Getting Started

Fundamentals

Flama CLI

Advanced Topics

Predictive AI

Generative AI

Domain driven design

Contributing

Serve

Serve models without writing an app

Example files

Codeless serving

The --model specification

Serving a predictive model

Serving a generative model

Serving multiple models

Specification files

How do we interact with the model?

Parameters

App parameters

Server parameters

How to use parameters

Example

The `--model` specification