Flama
Flama is a Python framework that establishes a standard for the development and deployment of production-ready APIs, with a special focus on predictive machine learning (ML) and generative AI. Its guiding philosophy is to make the deployment of an API ridiculously simple, reducing the entire process, where possible, to a single line of code. This is the Framework for Lightweight Applications, artificial intelligence Models, and Automation.
Flama spans the full spectrum of modern model serving: from REST APIs, through predictive ML models, all the way to large language models (LLMs). The same one-line developer experience serves generative models through a first-class serving layer, exposes them behind OpenAI, Anthropic, and Ollama compatible endpoints, and provides tools to AI agents through the Model Context Protocol (MCP).
Core
Whatever you are serving, every Flama application is built on the same foundations. Some of its most remarkable characteristics are:
-
A Rust-powered core: the performance-critical paths (routing, JSON encoding, request parsing, and compression) are compiled into a native extension, so you get the speed-ups simply by running
pip install, with no Rust toolchain required. -
Streaming-first HTTP: Server-Sent Events (SSE) and NDJSON responses, plus stream-oriented HTTP throughout, so long generations stay responsive.
-
A pluggable schema system, with optional Pydantic, Typesystem, and Marshmallow backends, for declaring endpoint inputs and outputs with reliable, automatic data-type validation.
-
Dependency injection to ease the management of the parameters needed in endpoints via the use of Components. Flama objects such as Request, Response, and Session are defined as Components ready to be injected, and are also the base of the plugin ecosystem, so you can create custom ones or reuse those already defined.
-
Generic classes for API Resources with the convenience of standard CRUD methods over SQLAlchemy tables.
-
Auto-generated API schema using the OpenAPI standard, with interactive docs served from a Flama docs endpoint.
-
Automatic handling of pagination, with several methods at your disposal such as limit, offset, and page numbers, to name a few.
You can learn these foundations in depth in the Fundamentals section, and explore the Advanced topics for production-ready patterns.
Predictive AI
Flama turns a trained model into a fully-fledged API with minimal boilerplate. You can package a model trained with Scikit-Learn, TensorFlow, or PyTorch as a lightweight binary file, integrate it into your application, and customise how clients interact with it through Components and Resources. See the Predictive AI section to get started.
Generative AI
The same one-line experience extends to large language models. Flama provides a first-class serving layer for LLMs, with vLLM and MLX backends auto-detected at load time based on the available hardware (CUDA on Linux, Metal on macOS). The same model file works everywhere since the framework picks the right backend. It also provides a compatibility layer that exposes them through OpenAI, Anthropic, and Ollama compatible endpoints. It also ships a built-in chatbot application and native support for the Model Context Protocol (MCP), to expose tools, resources, and prompts to AI agents. See the Generative AI section for the details.
Domain Driven Design
Flama helps you place your business domain at the centre of the architecture from the very beginning, with the building blocks to apply Domain-Driven Design patterns such as repositories, workers, and domain models. Good practices and a clear separation of responsibilities should not be an afterthought. Flama removes all the boilerplate so you get them at zero cost. See the Domain Driven Design section to learn how to assemble an application around your domain.
Now that you know the main advantages of Flama, you can learn how to install it and run your first API here.