Generative AIGetting models
Generative AI~ 4 min read

Getting models

Before serving a large language model (LLM) with Flama, you need it packaged as a self-contained artifact. In the Predictive AI section we saw how traditional models are serialised into .flm files; generative models use the very same format, but rather than serialising a model you have trained yourself, you typically fetch a pre-trained model from a hub and let Flama package it for you. This page covers the flama get command, the one-step way to download and package a model ready for serving.

What is an LLM artifact?

In Flama, an LLM artifact is a .flm file (a Flama Lightweight Model) whose manifest records the model family as llm. As we saw with predictive models, a .flm file bundles the model weights, its configuration, and the metadata Flama needs to build a serving application around it. The family is the crucial difference: it is persisted in the manifest at download time and drives runtime dispatch, so an llm artifact is routed through the LLM machinery (vLLM or MLX) while an ml artifact runs through its predictive framework.

Why does the family matter?

  • Dispatch: The family tells Flama whether to wrap the artifact in an LLMResource or a predictive MLResource, and which backend to load.
  • Portability: The on-disk weights are framework-agnostic; the family, not the original library, decides how they are run, so the same artifact serves on vLLM (Linux/CUDA) or MLX (Apple Silicon).
  • Immutability: The family is fixed at packaging time and cannot be changed without repacking, which keeps the serving behaviour reproducible across environments.

Fetching models

The flama get command downloads a model from a supported source and serialises it into a .flm artifact in a single step. Its inspection output describes the required options:

flama get --help
Usage: flama get [OPTIONS] MODEL_NAME Download and package a model as .flm.
Download a model from a supported source and serialize it into Flama's .flm format, ready for serving with 'flama serve' or interaction with 'flama model'. The artifact family must be declared explicitly via ``--family``: ML artifacts run through the framework recorded in the manifest, while LLM artifacts are dispatched to vLLM or MLX at load time depending on what is installed.
Example: flama get --source huggingface --family ml scikit-learn/Fish-Weight flama get --source huggingface --family llm Qwen/Qwen2.5-0.5B╭─ Options ────────────────────────────────────────────────────────────────────╮│ --source Model source provider. ││ --family Artifact family recorded in the ││ manifest. Use 'ml' for traditional ML ││ models and 'llm' for large language ││ models. The choice is persisted in the ││ .flm manifest and drives runtime ││ dispatch at load time; it cannot be ││ changed without repacking. ││ -o, --output TEXT Output .flm path (default: ││ <model-name>.flm). ││ --max-concurrent INTEGER RANGE Maximum number of files to download ││ concurrently. ││ --help Show this message and exit. │╰──────────────────────────────────────────────────────────────────────────────╯

Two options are required: --source selects the provider (currently HuggingFace), and --family declares whether the artifact is a traditional ml model or an llm. For generative models you always pass --family llm.

Examples

Let us fetch a small instruction-tuned model so the examples in the rest of this section have something concrete to serve. We will use Gemma 4 E2B, the smallest model in the Gemma 4 family:

flama get --source huggingface --family llm google/gemma-4-E2B-it
Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5.1 GB 24.3 MB/s 0:00:00Packaging...Model saved to google_gemma-4-E2B-it.flm

By default the artifact is written to <model-name>.flm with slashes replaced by underscores, hence google_gemma-4-E2B-it.flm. To choose a different path, pass --output:

flama get --source huggingface --family llm google/gemma-4-E2B-it --output models/assistant.flm
Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5.1 GB 31.0 MB/s 0:00:00Packaging...Model saved to models/assistant.flm

Inspecting the artifact

Once the model is packaged, you can verify its metadata without loading the weights using the flama model command introduced in the CLI section. The inspect sub-command reads only the cheap metadata header:

flama model google_gemma-4-E2B-it.flm inspect --pretty
{ "manifest": [ "config.json", "model.safetensors", "tokenizer.json", "tokenizer_config.json" ], "meta": { "capabilities": { "kind": "llm" }, "extra": { "model_name": "google/gemma-4-E2B-it" }, "framework": { "config": null, "family": "llm", "lib": "transformers", "version": "4.57.0" }, "id": "0d7c3a4e-1f2b-4c8a-9e6d-5b4a3c2d1e0f", "model": { "info": null, "metrics": {}, "obj": null, "params": {} }, "timestamp": "2026-06-13T09:12:44.512030" }}

Note the framework.family field set to llm: this is what tells Flama to treat the artifact as a generative model when it is served. With a packaged model in hand, we are ready to serve it.

In the next page we will register this artifact in a Flama application and interact with it over the native and vendor-compatible dialects.