Getting models
Before serving a large language model (LLM) with Flama, you need it packaged as a self-contained artifact. In
the Predictive AI section we saw how traditional models are serialised
into .flm files; generative models use the very same format, but rather than serialising a model you have trained
yourself, you typically fetch a pre-trained model from a hub and let Flama package it for you. This page covers
the flama get command, the one-step way to download and package a model ready for serving.
What is an LLM artifact?
In Flama, an LLM artifact is a .flm file (a Flama Lightweight Model) whose manifest records the model
family as llm. As we saw with predictive models, a .flm file bundles the model weights, its configuration, and
the metadata Flama needs to build a serving application around it. The family is the crucial difference: it is
persisted in the manifest at download time and drives runtime dispatch, so an llm artifact is routed through the LLM
machinery (vLLM or MLX) while an ml artifact runs through its predictive framework.
Why does the family matter?
- Dispatch: The family tells Flama whether to wrap the artifact in an
LLMResourceor a predictiveMLResource, and which backend to load. - Portability: The on-disk weights are framework-agnostic; the family, not the original library, decides how they are run, so the same artifact serves on vLLM (Linux/CUDA) or MLX (Apple Silicon).
- Immutability: The family is fixed at packaging time and cannot be changed without repacking, which keeps the serving behaviour reproducible across environments.
Fetching models
The flama get command downloads a model from a supported source and serialises it into a .flm artifact in a single
step. Its inspection output describes the required options:
flama get --help
Usage: flama get [OPTIONS] MODEL_NAME Download and package a model as .flm.
Download a model from a supported source and serialize it into Flama's .flm format, ready for serving with 'flama serve' or interaction with 'flama model'. The artifact family must be declared explicitly via ``--family``: ML artifacts run through the framework recorded in the manifest, while LLM artifacts are dispatched to vLLM or MLX at load time depending on what is installed.
Example: flama get --source huggingface --family ml scikit-learn/Fish-Weight flama get --source huggingface --family llm Qwen/Qwen2.5-0.5Bāā Options āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®ā --source Model source provider. āā --family Artifact family recorded in the āā manifest. Use 'ml' for traditional ML āā models and 'llm' for large language āā models. The choice is persisted in the āā .flm manifest and drives runtime āā dispatch at load time; it cannot be āā changed without repacking. āā -o, --output TEXT Output .flm path (default: āā <model-name>.flm). āā --max-concurrent INTEGER RANGE Maximum number of files to download āā concurrently. āā --help Show this message and exit. āā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆTwo options are required: --source selects the provider (currently HuggingFace), and
--family declares whether the artifact is a traditional ml model or an llm. For generative models you always pass
--family llm.
Examples
Let us fetch a small instruction-tuned model so the examples in the rest of this section have something concrete to serve. We will use Gemma 4 E2B, the smallest model in the Gemma 4 family:
flama get --source huggingface --family llm google/gemma-4-E2B-it
Downloading āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā 100% 5.1 GB 24.3 MB/s 0:00:00Packaging...Model saved to google_gemma-4-E2B-it.flmBy default the artifact is written to <model-name>.flm with slashes replaced by underscores, hence
google_gemma-4-E2B-it.flm. To choose a different path, pass --output:
flama get --source huggingface --family llm google/gemma-4-E2B-it --output models/assistant.flm
Downloading āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā 100% 5.1 GB 31.0 MB/s 0:00:00Packaging...Model saved to models/assistant.flmInspecting the artifact
Once the model is packaged, you can verify its metadata without loading the weights using the flama model command
introduced in the CLI section. The inspect sub-command reads only the cheap metadata header:
flama model google_gemma-4-E2B-it.flm inspect --pretty
{ "manifest": [ "config.json", "model.safetensors", "tokenizer.json", "tokenizer_config.json" ], "meta": { "capabilities": { "kind": "llm" }, "extra": { "model_name": "google/gemma-4-E2B-it" }, "framework": { "config": null, "family": "llm", "lib": "transformers", "version": "4.57.0" }, "id": "0d7c3a4e-1f2b-4c8a-9e6d-5b4a3c2d1e0f", "model": { "info": null, "metrics": {}, "obj": null, "params": {} }, "timestamp": "2026-06-13T09:12:44.512030" }}Note the framework.family field set to llm: this is what tells Flama to treat the artifact as a generative
model when it is served. With a packaged model in hand, we are ready to serve it.
In the next page we will register this artifact in a Flama application and interact with it over the native and vendor-compatible dialects.