From REST APIs to predictive and generative AI, seamlessly

Flama is a framework to rapidly build modern, robust APIs, with a Rust-powered core and first-class serving for both machine-learning models and large language models. Deploy models, ship a chatbot, or expose tools over MCP, in seconds.
Fire up your models with the flame 🔥

Framework
import flama
app = flama.Flama()
app.models.add_model("/puppy/", "/path/to/puppy_model.flm", name="Puppy")
if __name__ == "__main__": flama.run(flama_app=app, server_host="0.0.0.0", server_port=8080)

Machine Learning Responsive

Let’s face it, there isn’t a single ML framework. Models developed in such different frameworks should be easily integrated together in a single API. However this integration presents a technical challenge, typically unproductive and annoying for a data scientist.


Flama is thought from the very beginning to be compatible with the mainstream data-science frameworks, and it makes easy and simple the packaging of ML models to be integrated together.

Scikit Learn
import flamafrom sklearn.neural_network import MLPClassifier
model = MLPClassifier(activation="tanh", max_iter=2000, hidden_layer_sizes=(10,))model.fit( np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), np.array([0, 1, 1, 0]),)
flama.dump(model, "sklearn_model.flm")

Generative AI Serving

Serving a large language model should be as simple as serving any other model. With Flama 2.0, it is. Package a model, point an application at it, and you have a production-ready generative API in a single line of code.


Pick the dialects your clients already speak — OpenAI, Anthropic, Ollama, and the channel-aware native protocol — and Flama exposes them side by side. Your existing SDKs work by changing nothing but the base URL, with HuggingFace, vLLM, and MLX backends selected automatically for your hardware.

Native
import flama
app = flama.Flama()
# Flama's channel-aware native dialect powers /llm/query/, /llm/stream/ and /llm/chat/app.models.add_model( "/llm/", "Qwen_Qwen2.5-0.5B.flm", name="assistant", serving=("native",),)
if __name__ == "__main__": flama.run(flama_app=app, server_host="0.0.0.0", server_port=8080)

Chatbot Out of the Box

Every model you serve with the native dialect comes with a polished chat interface for free, served straight from your application at /llm/chat/. No frontend code, no build step.


Responses stream in token by token and render as Markdown, with LaTeX maths and Mermaid diagrams. Built with Flama and shipped as a single self-contained page, it is the fastest way to put a model in front of real users.

Chat interface
What is Flama in one sentence?

Flama is a Python framework that unifies REST API development with predictive and generative model serving into a single production stack. 🔥

flama serve --model model.flm
Send a message…

Model Context Protocol

Models are far more useful when they can reach into your world. Flama 2.0 ships native support for the Model Context Protocol (MCP), the open standard for exposing tools, resources, and prompts to AI clients.


Turn any Python function into an agent-ready tool with a single decorator. Flama derives its JSON Schema from your type hints and serves it over the stateless 2026-07-28 protocol, with Tasks, Elicitation, and MCP Apps included.

MCP Server
import flamafrom flama.mcp.server import MCPServer
app = flama.Flama()
tools = MCPServer("tools", version="2.0.0")

@tools.tool("add", description="Add two integers")def add(a: int, b: int) -> int: return a + b

app.mcp.add_server("/mcp/tools/", "tools", server=tools)
if __name__ == "__main__": flama.run(flama_app=app, server_host="0.0.0.0", server_port=8080)

Production-Ready First

Need your models serving ASAP? It does not feel right to have to wait months to see if your models work outside a Jupyter notebook, does it?


Flama makes the deployment of ML models into production as straightforwardly as possible. With the ease of a single command line your packaged models will be ready to serve via HTTP requests in seconds. Flama transforms any model into an ML-API ready to serve its purpose.

Command Line
> flama serve --model model.flm
INFO: Started server process [78260]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

A Rust-Powered Core

Flama 2.0 moves its performance-critical paths into a compiled Rust core, built with PyO3 and shipped as native wheels for every supported Python version.


You get the speed-ups with a plain pip install — no Rust toolchain required — while the same ergonomic Python API you already know stays exactly the same.

Routing

Path matching and route resolving compiled to native code.

JSON encoding

Request and response serialisation handled by the Rust crate.

Request parsing

Multipart and URL-encoded form parsing, the fast way.

Compression

Stream-based gzip, brotli, bzip2, lzma, and zstd codecs.

Effortless Development

Flama is designed to be quick to learn and use. This goal is accomplished with a simple and clear syntax, and a rich spectrum of built-in functionality, reducing boilerplating and development time.

There is a wide spectrum of data validation libraries for Python to combine data types into structures, validate them, and provide tools for serialisation of app-level objects to primitive Python types.


Flama natively supports Pydantic, Typesystem, and Marshmallow, now split into optional packages so you install only what you need. These data-type validation libraries make possible the standardisation of the API via generation of OpenAPI schemas, and allow the user to define API schemas effortlessly.


Flama Schema generator gathers all the API information needed directly from your code and infers the schema that represents your API based on OpenAPI standard. The schema will be also served at the route /schema/ by default.

Models Lifecycle

Loading ML models in a production application is a demanding and prone-to-error task, which also depends on the specific ML framework.


Flama provides a clean solution to the problem via Components, which load models seamlessly.

Models Lifecycle
from flama import Flama, ModelComponentBuilder

with open("/path/to/model.flm", "rb") as f: component = ModelComponentBuilder.loads(f.read()) ModelType = component.get_model_type # Get the type to allow inject dependency

app = Flama(components=[component])

@app.get("/")def model_view(model: ModelType, model_input: str): """ tags: - model summary: Model prediction. description: Interact with the model to generate a prediction based on given input responses: 200: description: Model prediction. """ model_output = model.predict(model_input) return {"model_output": model_output}

Extensibility

Flama consists of a core of functionality for creating, maintaining and deploying ML-APIs. However, the ML arena is constantly changing, with new products for managing ML projects appearing very often. Being able to integrate your API with such third parties is of crucial importance.


Flama is natively an extensible framework. With the ease of Module you will be able to rapidly develop your own plugins and keep improving Flama integrability.

Extensibility
import typing
import mlflowfrom flama import Module, Flama

class MLFlowModule(Module): name = "mlflow"
def __init__(self, app: Flama, url: str = None, *args, **kwargs): super().__init__(app, *args, **kwargs) self.url = url
async def on_startup(self): mlflow.set_tracking_uri(self.url)
async def on_shutdown(self): ...
def search_runs(self, experiment_ids: typing.List[str], filter_string: str): return mlflow.search_runs(experiment_ids, filter_string)

app = Flama(modules=[MLFlowModule])
# Module usage examplemodel = app.mlflow.search_runs(["foo"], "tags.name = 'bar'")

Development Tools

The process of developing APIs for Machine Learning can be complex and time-consuming, especially when it comes to debugging. Debugging refers to the process of identifying and fixing errors in the code, which can range from simple syntax errors to more complex issues such as incorrect data access or resource management.


Flama provides graphical tools that make debugging simple and direct, allowing you to trace code errors (Internal Server Error), or access to non-existent resources (Not Found) with ease.

Internal Server Error