Model
Serverless interaction
We have discussed how to serve models as APIs that we interact with over HTTP requests. However, it might be the case that we want to interact with a model directly without the overhead of a server, e.g.:
-
Development and testing: We are working with a model locally and want to try it out on some data to quickly check everything behaves as expected. This is typical during the development stage of the model lifecycle.
-
Streaming workflow: We want to use a model as part of a larger pipeline where it acts as a data processor in a stream, piping data in and getting output back.
It is then clear that we need a different way to make use of a model than the client-server approach discussed so far.
The command model lets us interact with models directly from the command line, with no server. It works with both
traditional predictive models and generative models; the artifact family recorded in the .flm manifest is
dispatched accordingly. To inspect the command options, run:
flama model --help
🔥 Flama v2.0.4
Usage: flama model [OPTIONS] FLAMA_MODEL_PATH COMMAND [ARGS]... Interact with a packaged model without server.
Works with both traditional ML models and large language models. The artifact family is recorded in the .flm manifest at download time (``flama get --family ...``) and dispatched accordingly. LLM artifacts automatically pick the available runtime - vLLM on Linux/CUDA, MLX on macOS / Apple Silicon - based on what is importable in the current environment. To serve models over HTTP, use 'flama serve'.
<FLAMA_MODEL_PATH> is the path of the model to be used, e.g. 'path/to/model.flm'. This can be passed directly as argument of the command line, or by environment variable.
--channel-scanner, --tool-scanner and --tool-parser are LLM-only; they are rejected for ML artifacts at build time.╭─ Options ────────────────────────────────────────────────────────────────────╮│ --channel-scanner DECODER_REF Channel scanner format (LLM-only). ││ --tool-scanner DECODER_REF Tool scanner format (LLM-only). ││ --tool-parser DECODER_REF Tool body parser (LLM-only). ││ --help Show this message and exit. │╰──────────────────────────────────────────────────────────────────────────────╯╭─ Commands ───────────────────────────────────────────────────────────────────╮│ inspect Inspect a model artifact. ││ run Run a one-shot inference. ││ stream Stream output from a model. │╰──────────────────────────────────────────────────────────────────────────────╯The --channel-scanner, --tool-scanner, and --tool-parser options are decoder controls for generative models; they
default to auto-detection and are rejected for predictive models.
Sub-commands
The model command exposes three sub-commands, each operating on a packaged .flm artifact: inspect reads its
metadata without loading the weights, while run and stream perform inference.
Inspect
The sub-command inspect gives us access to the model metadata and the list of bundled artifacts, without loading the model:
flama model path/to/model.flm inspect --help
Usage: flama model FLAMA_MODEL_PATH inspect [OPTIONS] Inspect a model artifact.
Extracts the model metadata, including the ID, time when the model was created, information of the framework, and the model info; and the list of artifacts packaged with the model.╭─ Options ────────────────────────────────────────────────────────────────────╮│ -p, --pretty Pretty print the model inspection. ││ --help Show this message and exit. │╰──────────────────────────────────────────────────────────────────────────────╯Run
The sub-command run performs a one-shot inference. For predictive models, the input is a JSON list of feature vectors and the output is a JSON list of predictions. For generative models, the input is a prompt and the output is the generated response:
flama model path/to/model.flm run --help
Usage: flama model FLAMA_MODEL_PATH run [OPTIONS] Run a one-shot inference.
For ML models, the input must be a JSON list of feature vectors and the output is a JSON list of predictions. For LLM models, the input is a prompt and the output is the generated response.╭─ Options ────────────────────────────────────────────────────────────────────╮│ -i, --input FILENAME Input file to read from (defaults to stdin). ││ --transport LLM input shape (raw|chat|conversation). ││ --system TEXT LLM system instruction (chat transport only). ││ --param TEXT Generation parameter as key=value (LLM only, ││ repeatable). ││ -o, --output FILENAME File to be used as output. (default: stdout). ││ -p, --pretty Pretty print the output. ││ --channel TEXT Channel(s) to include in output (LLM only, ││ repeatable). ││ --help Show this message and exit. │╰──────────────────────────────────────────────────────────────────────────────╯Stream
The sub-command stream emits output incrementally. For predictive models, the model emits one output per input item; for generative models, it emits the response one block at a time:
flama model path/to/model.flm stream --help
Usage: flama model FLAMA_MODEL_PATH stream [OPTIONS] Stream output from a model.
For ML models, the input must be a JSON list of feature vectors and the model emits one output per item. For LLM models, the input is a prompt and the model emits one block at a time.╭─ Options ────────────────────────────────────────────────────────────────────╮│ -i, --input FILENAME Input file to read from (defaults to stdin). ││ --transport LLM input shape (raw|chat|conversation). ││ --system TEXT LLM system instruction (chat transport only). ││ --param TEXT Generation parameter as key=value (LLM only, ││ repeatable). ││ -o, --output FILENAME File to be used as output for the stream. ││ (default: stdout). ││ -b, --buffer Buffer all output and write at once instead of ││ streaming. ││ --channel TEXT Channel(s) to include in output (LLM only, ││ repeatable). ││ --help Show this message and exit. │╰──────────────────────────────────────────────────────────────────────────────╯Predictive model examples
To illustrate the predictive usage, we use the Scikit-Learn model introduced in the Serve section.
Model inspection
To start with, let us inspect the model:
flama model sklearn_model.flm inspect --pretty
{ "manifest": [ "foo.json" ], "meta": { "capabilities": { "kind": "ml" }, "extra": { "model_author": "John Doe", "model_description": "This is a test model", "model_license": "MIT", "model_version": "1.0.0", "tags": ["test", "example"] }, "framework": { "config": null, "family": "ml", "lib": "sklearn", "version": "1.9.0" }, "id": "cb659dec-ca09-40f8-a804-63c1f89113f6", "model": { "info": { ... }, "metrics": {"recall": "0.95"}, "obj": "MLPClassifier", "params": {"solver": "adam"} }, "timestamp": "2026-06-17T16:47:58.002794" }}The usefulness of inspect is that it lets us see all the relevant information about a model without loading it, which
is handy for verifying model integrity and version within a continuous integration and continuous deployment (CI/CD)
process.
Inline input inference
Now let us run an inference by piping input through run:
echo '[[0, 0], [0, 1], [1, 0], [1, 1]]' | flama model sklearn_model.flm run
[0, 1, 1, 0]The --pretty option formats the output:
echo '[[0, 0], [0, 1], [1, 0], [1, 1]]' | flama model sklearn_model.flm run --pretty
[ 0, 1, 1, 0]File input inference
The run command can also read its input from a file and write its output to another:
flama model sklearn_model.flm run --input input.json --output output.jsonPiping several models together
Because run reads from standard input and writes to standard output by default, we can pipe several models together in
a single command. For example, to feed the output of model_a.flm into model_b.flm:
echo '[0, 1, 2, 3]' | flama model model_a.flm run | flama model model_b.flm run
[1, 0, 1, 0]This generalises to as many models as you need:
echo 'input' | flama model model_1.flm run | flama model model_2.flm run | ... | flama model model_n.flm runGenerative model examples
For generative usage, we use the google_gemma-4-E2B-it.flm artifact fetched in
getting models.
One-shot generation
Pipe a prompt to run to get the full response once generation completes:
echo "What is Python?" | flama model google_gemma-4-E2B-it.flm run
Python is a high-level, general-purpose programming language known for its readability.Add a system instruction (chat transport only) or override generation parameters with repeatable --param:
echo "Explain dependency injection." | flama model google_gemma-4-E2B-it.flm run \ --system "Be concise." \ --param temperature=0.7 \ --param max_tokens=100Streaming generation
The stream command emits the response as it is produced, which is the natural way to read long generations at the
terminal:
echo "Explain dependency injection." | flama model google_gemma-4-E2B-it.flm stream
Dependency injection is a pattern where a component receives its dependencies from the outside...Multi-turn conversations
With --transport conversation, the input is a JSON list of messages rather than a single prompt:
flama model google_gemma-4-E2B-it.flm run --transport conversation -i conversation.jsonwhere conversation.json is a list of role/content objects, e.g.:
[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Flama?"}]Selecting channels
By default only the user-visible output channel is shown. Use --channel (repeatable) to include reasoning or other
channels; pass all to include every channel:
echo "Solve 2 + 2 step by step." | flama model google_gemma-4-E2B-it.flm run --channel allWith a single channel the output is plain text; with several channels (or all) the output is a JSON list of
{channel, text} blocks so you can tell each block's origin apart.