Generative AIGetting models
Generative AI~ 3 min read

Getting models

Before serving a large language model (LLM) with Flama, you need it packaged as a self-contained artifact. In the Predictive AI section we saw how traditional models are serialised into .flm files; generative models use the very same format, but rather than serialising a model you have trained yourself, you typically fetch a pre-trained model from a hub and let Flama package it for you. This page covers the flama get command, the one-step way to download and package a model ready for serving.

What is an LLM artifact?

In Flama, an LLM artifact is a .flm file (a Flama Lightweight Model) whose manifest records the model family as llm. As we saw with predictive models, a .flm file bundles the model weights, its configuration, and the metadata Flama needs to build a serving application around it. The family is the crucial difference: it is persisted in the manifest at download time and drives runtime dispatch, so an llm artifact is routed through the LLM machinery (vLLM or MLX) while an ml artifact runs through its predictive framework.

Why does the family matter?

  • Dispatch: The family tells Flama whether to wrap the artifact in an LLMResource or a predictive ModelResource, and which backend to load.
  • Portability: The on-disk weights are framework-agnostic; the family, not the original library, decides how they are run, so the same artifact serves on vLLM (Linux/CUDA) or MLX (Apple Silicon).
  • Immutability: The family is fixed at packaging time and cannot be changed without repacking, which keeps the serving behaviour reproducible across environments.

Fetching a model with flama get

The flama get command downloads a model from a supported source and serialises it into a .flm artifact in a single step. Its inspection output describes the required options:

flama get --help
Usage: flama get [OPTIONS] MODEL_NAME
Download and package a model as .flm.
Download a model from a supported source and serialize it into Flama's .flm format, ready for serving with 'flama serve' or interaction with 'flama model'.
Options: --source [huggingface] Model source provider. [required] --family [ml|llm] Artifact family recorded in the manifest. [required] -o, --output TEXT Output .flm path (default: <model-name>.flm). --max-concurrent INTEGER RANGE Maximum number of files to download concurrently. [default: 8; x>=1] --help Show this message and exit.

Two options are required: --source selects the provider (currently HuggingFace), and --family declares whether the artifact is a traditional ml model or an llm. For generative models you always pass --family llm.

Examples

Let us fetch a small instruction-tuned model so the examples in the rest of this section have something concrete to serve. We will use Qwen2.5-0.5B, which is small enough to download quickly:

flama get --source huggingface --family llm Qwen/Qwen2.5-0.5B
Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1.2 GB 24.3 MB/s 0:00:00Packaging...Model saved to Qwen_Qwen2.5-0.5B.flm

By default the artifact is written to <model-name>.flm with slashes replaced by underscores, hence Qwen_Qwen2.5-0.5B.flm. To choose a different path, pass --output:

flama get --source huggingface --family llm Qwen/Qwen2.5-0.5B --output models/assistant.flm
Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1.2 GB 31.0 MB/s 0:00:00Packaging...Model saved to models/assistant.flm

Inspecting the artifact

Once the model is packaged, you can verify its metadata without loading the weights using the flama model command introduced in the CLI section. The inspect sub-command reads only the cheap metadata header:

flama model Qwen_Qwen2.5-0.5B.flm inspect --pretty
{ "manifest": [ "config.json", "model.safetensors", "tokenizer.json", "tokenizer_config.json" ], "meta": { "id": "0d7c3a4e-1f2b-4c8a-9e6d-5b4a3c2d1e0f", "timestamp": "2026-06-13T09:12:44.512030", "framework": { "family": "llm", "lib": "transformers", "version": "4.57.0", "config": null }, "model": { "obj": null, "info": null, "params": {}, "metrics": {} }, "extra": { "model_name": "Qwen/Qwen2.5-0.5B" } }}

Note the framework.family field set to llm: this is what tells Flama to treat the artifact as a generative model when it is served. With a packaged model in hand, we are ready to serve it.

In the next page we will register this artifact in a Flama application and interact with it over the native and vendor-compatible dialects.