The production framework for Predictive and Generative AI

Turn any model into a production API in a single line of code. Serve predictive and generative models on a Rust-powered core, and expose your tools to AI agents over the Model Context Protocol (MCP).

Flama is the Framework for Lightweight Applications, artificial intelligence Models, and Automation.

Light up your models πŸ”₯

Run a model
flama serve --model model.flm
INFO: Started server process [15822]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Any Framework, One Format

There isn’t a single ML framework, and the models you build in scikit-learn, TensorFlow, or PyTorch should be just as easy to ship together. That integration is usually the unproductive, fiddly part of a data scientist’s day.


Flama packages a model from any of the mainstream frameworks into a single portable format, the .flm file, so every model looks the same to your API no matter where it came from.

scikit-learn
import flamafrom sklearn.neural_network import MLPClassifier
model = MLPClassifier(activation="tanh", hidden_layer_sizes=(10,))
# Training
flama.dump(model, "model.flm")

Models on Demand

Not every model lives on your disk. Whether you need a predictive model or a generative one, the right one is often a single download away on a hub such as HuggingFace.


With a single command, Flama downloads a model from the hub and serialises it straight into the portable .flm format, ready to serve. No glue code, no manual conversion, no boilerplate.

Predictive
flama get --source huggingface \  --family ml \  scikit-learn/Fish-Weight
Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4.1 MB 12.0 MB/s 0:00:00Packaging...Model saved to scikit-learn_Fish-Weight.flm

Generative AI Serving

Serving a generative model should be as simple as serving any other model. With Flama, it is. Package a model, point the CLI at it, and you have a production-ready generative API in seconds.


Pick the dialects your clients already speak (OpenAI, Anthropic, Ollama, and the channel-aware native protocol) and Flama exposes them side by side.

Native
flama serve --model file=google_gemma-4-E2B-it.flm,serving=native
INFO: Model ready (name: assistant)INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Chatbot Out of the Box

Every model you serve with the native dialect comes with a polished chat interface for free, served straight from your application at /chat/. No frontend code, no build step.


Responses stream in token by token and render as Markdown, with LaTeX maths and Mermaid diagrams. Built with Flama and shipped as a single self-contained page, it is the fastest way to put a model in front of real users.

Chatbot

Model Context Protocol

Models are far more useful when they can reach into your world. Flama ships native support for the Model Context Protocol (MCP), the open standard for exposing tools, resources, and prompts to AI clients.


Declare each capability with a single decorator, mount the server on your application, and Flama derives the JSON Schema from your type hints and serves it over a stateless protocol, with Tasks, Elicitation, and MCP Apps included.

Tool
import flama
app = flama.Flama()app.mcp.add_server("/mcp/server/", "server")

@app.mcp.tool("add", description="Add two integers", mcp="server")def add(a: int, b: int) -> int: return a + b

Production-Ready First

Going from a packaged model to a running service should take minutes, not months. Flama makes that the default, whether you are serving a predictive model or a generative one.


Point it at a packaged model from the command line, in Python, with a specification file, or inside a container, and it is ready to serve over HTTP in seconds.

Command Line
flama serve --model file=model.flm,url=/model,name=model-name
INFO: Started server process [78260]INFO: Waiting for application startup.INFO: Model ready (name: model-name)INFO: Application startup complete.INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

A Rust-Powered Core

Flama moves its performance-critical paths into a compiled Rust core, built with PyO3 and shipped as native wheels for every supported Python version.


You get the speed-ups with a plain pip install (no Rust toolchain required), while the same ergonomic Python API you already know stays exactly the same.

Routing

Path matching and route resolving compiled to native code.

JSON encoding

Request and response serialisation handled by the Rust crate.

Request parsing

Multipart and URL-encoded form parsing, the fast way.

Compression

Stream-based gzip, brotli, bzip2, lzma, and zstd codecs.

Effortless Development

Flama is designed to be quick to learn and use. This goal is accomplished with a simple and clear syntax, and a rich spectrum of built-in functionality, reducing boilerplating and development time.

There is a wide spectrum of data validation libraries for Python to combine data types into structures, validate them, and provide tools for serialisation of app-level objects to primitive Python types.


Flama natively supports Pydantic, Typesystem, and Marshmallow, now split into optional packages so you install only what you need. These data-type validation libraries make possible the standardisation of the API via generation of OpenAPI schemas, and allow the user to define API schemas effortlessly.


Flama Schema generator gathers all the API information needed directly from your code and infers the schema that represents your API based on OpenAPI standard. The schema will be also served at the route /schema/ by default.

Extensibility

Flama ships a focused core for building, maintaining, and deploying model APIs, but the ecosystem around models moves fast and new tools appear all the time. Being able to plug those into your API matters.


Flama is extensible by design. With a simple Module you can build your own plugins and grow what Flama integrates with, without touching the core.

Extensibility
import typing
import mlflowfrom flama import Flama, Module

class MLFlowModule(Module): name = "mlflow"
def __init__(self, app: Flama, url: typing.Optional[str] = None, *args, **kwargs): super().__init__(app, *args, **kwargs) self.url = url
async def on_startup(self): mlflow.set_tracking_uri(self.url)
async def on_shutdown(self): ...
def search_runs(self, experiment_ids: typing.List[str], filter_string: str): return mlflow.search_runs(experiment_ids, filter_string)

app = Flama(modules=[MLFlowModule])
# Module usage exampleruns = app.mlflow.search_runs(["foo"], "tags.name = 'bar'")

Development Tools

Building and debugging APIs can be slow and frustrating, especially when an error gives you nothing to go on. Identifying and fixing problems, from a simple typo to a misconfigured resource, eats into development time.


Flama provides graphical tools that make debugging direct: trace server errors (Internal Server Error) or requests to resources that do not exist (Not Found) at a glance.

Internal Server Error