Machine Learning APIPackaging models
Machine Learning API~ 9 min read

Packaging models

Any machine-learning model built using one of the mainstream data-science frameworks, e.g. Scikit-Learn, TensorFlow or PyTorch, can be served using Flama. This, indeed, is what we have been explaining in the previous sections on Flama CLI commands run, serve, and start. For this to happen, we needed either of the following two options:

  • A model packaged as a binary file (.flm files)
  • A model embedded in a Flama App

The second option will be explained in detail in the following sections: add models, model resource, and model components. The first option (which is the one we are going to discuss in what follows) requires us to save the models following a certain procedure. For the sake of convenience and speeding up the process of integrating these models into an API, Flama comes with the functionality required to serialise and package them, automatically adding important metadata which make the resulting files operational.

FLM files

The binary files needed by the Flama CLI are typically named with the suffix .flm. We call them flama files for the sake of simplicity, but FLM stands for Flama Lightweight Model. This comes from the fact that, FLM files are a lightweight representation of ML models, which come with useful metadata needed for later purposes, e.g. building a wrapper Flama app containing the model.

FLM file structure

The structure of an FLM file is thought to be as simple as possible, and aims at keeping in a single file all the information needed to load and use the model. The structure of an FLM file is as follows:

├── model.flm│   └── model│       ├── model (python object)│       └── meta│           ├── id│           ├── timestamp│           ├── framework│           ├── model│           │   ├── obj│           │   ├── info│           │   ├── params│           │   └── metrics│           └── extra└── artifacts    ├── foo.json    └── bar.csv

Dump & load

Let's consider the following familiar situation, which is the day-to-day routine of many data scientists. After careful experimentation, cross-validation, testing, and so on, we have found the optimal ML model for our problem. Great job! Now, we want to take our model out of our Jupyter Notebook, and offer it as a service to make predictions on demand. The first thing we think about is pickling (i.e., using pickle.dump) the model, and pass the resulting file to the corresponding team/colleague to develop the wrapper API which will have to eventually unpickle (i.e., using pickle.load) the object, and expose the predict method. It seems like a very repetitive and boring task, doesn't it?

As we have seen already when we introduced serve and start,

Flama comes equipped with a very convenient CLI which does all the boring part for you seamlessly, just with a single line of code. For this, we only need our models to be packaged with the Flama counterparts of pickle's dump and load commands, namely: flama.dump and flama.load.

Dump method

Flama dump method uses optimal compression with the aim of making the packing process more efficient, and faster. The packing step can live completely out of any Flama application. Indeed, the natural place to package your models will be at the model-building stage, which will be very likely happening on your Jupyter notebook. An example of usage of this method:

import flama
flama.dump( model, path="path/to/file.flm", compression="zstd", model_id=uuid.uuid4(), timestamp=datetime.datetime(2023, 3, 10, 11, 30, 0), params={"optimizer": "adams"}, metrics={"recall": "0.95"}, extra={ "model_version": "1.0.0", "model_description": "This is a test model", "model_author": "John Doe", "model_license": "MIT", "tags": ["test", "example"], }, artifacts={"foo.json": "path/to/artifact.json"},)

The first parameter is the model object itself, and the keyword argument path specifies where the resulting file will be stored. The remaining parameters are optional, and are used to add metadata to the resulting file which might be quite useful for model management purposes:

  • model_id: a unique identifier for the model. If not provided, a random UUID will be generated.
  • compression: the compression format to be used. It can be one of the following: "bz2", "lzma", "zlib", or "zstd". The default value is "zstd" (Zstandard), which offers an excellent balance between compression ratio and speed.
  • timestamp: the timestamp of the model. If not provided, the current timestamp will be used.
  • params: a dictionary containing hyper-parameters used to train the model.
  • metrics: a dictionary containing metrics of the model, e.g. accuracy, recall, precision, etc.
  • extra: a dictionary containing any other metadata you might want to add to the model. This is a good place to add information about model version, description, author, license, tags, etc.
  • artifacts: a dictionary containing any artifacts associated with the model. The keys are the names of the artifacts, and the values are the paths to the files containing the artifacts. These files will be automatically packed and unpacked when the model is loaded.

Compression

When serialising ML models, the resulting binary data can be quite large, especially for deep learning models with millions of parameters. To address this, Flama applies compression to the model data during the packing process. The compression parameter of the dump method controls which algorithm is used, and this is one of the key features that makes FLM files lightweight.

Flama supports four compression formats, all of which are well-established algorithms widely used in the Python ecosystem:

  • bz2 (Burrows-Wheeler): A block-sorting compressor that achieves good compression ratios but at the cost of slower compression and decompression speeds. Part of the Python standard library.
  • lzma (Lempel-Ziv-Markov chain): Achieves the highest compression ratios among the supported formats, but is significantly slower than the alternatives. Part of the Python standard library.
  • zlib (DEFLATE): A fast and well-proven compressor that offers good compression ratios with low latency. Part of the Python standard library.
  • zstd (Zstandard): A modern compression algorithm developed by Meta, designed to provide compression ratios comparable to lzma while approaching the speed of zlib. This is the default format in Flama.

To illustrate the practical impact of these choices, the following table summarises the key characteristics of each format, measured on a simple Scikit-Learn model (MLPClassifier trained on XOR):

FormatFile sizeDump speedLoad speedStandard library
bz2~24.9 KB~9.3 ms~3.5 msYes
lzma~21.2 KB~23.4 ms~2.8 msYes
zlib~22.9 KB~2.9 ms~1.0 msYes
zstd~20.5 KB~2.2 ms~1.0 msFrom Python 3.14

As the results clearly show, Zstandard ("zstd") stands out as the winner in most practical scenarios, combining the smallest file size with the fastest dump and load times. This is precisely why Flama uses it as the default compression format. For Python versions prior to 3.14, where zstd is not yet part of the standard library, Flama relies on the python-zstd package as a compatibility layer. From Python 3.14 onwards, zstd is included in the new compression standard library module, and Flama will use it natively.

You can specify the desired compression format when calling flama.dump:

import flama
# Using the default (zstd), recommended for most use casesflama.dump(model, path="model.flm")
# Explicitly specifying the compression formatflama.dump(model, path="model.flm", compression="lzma") # Smallest file, slowest speedflama.dump(model, path="model.flm", compression="zlib") # Good balance, no external dependencyflama.dump(model, path="model.flm", compression="bz2") # Standard library alternative

The compression format is stored within the FLM file itself, so flama.load automatically detects and applies the correct decompression, you do not need to specify the format when loading:

model_artifact = flama.load(path="model.flm")  # Decompression is automatic

This transparent handling means you can change the compression format at any point without affecting downstream code that loads and uses the model.

Load method

Flama load method is responsible for the efficient unpacking of the model file. The unpacking stage will typically happen within the context of a Flama application. If you're not planning the development of any because you'll be using Flama CLI for this, then you won't have to use the load methods at all. An example of usage of this method:

import flama
model_artifact = flama.load(path="path/to/file.flm")

The keyword argument path specifies the path to the file containing the model. The method returns a ModelArtifact object, which contains the attributes used with the dump method, plus the model itself. The model can be accessed through the model attribute of the ModelArtifact object. As you can easily check, this object contains the artifacts dictionary, which you can inspect to find the path where the artifacts were unpacked automatically. This is a very convenient feature, which allows you to keep track of the artifacts associated with your model, and access them easily, all within the same binary file.

Once we have introduced the methods which allow for packing (flama.dump) and loading (flama.load), we can proceed and introduce how the example files we've been using so far were generated. These files were:

Examples

Let's proceed showing how to pack scikit-learn, tensorflow, and pytorch models, respectively. The following examples don't intend to be complete nor functional pieces of code. The examples aim at showing the relevant steps for the purpose of packagin models, so they do not include the following natural stages: data loading and cleansing, training and testing.

Scikit-Learn

import flamafrom sklearn.neural_network import MLPClassifier
model = MLPClassifier(activation="tanh", max_iter=2000, hidden_layer_sizes=(10,))model.fit( np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), np.array([0, 1, 1, 0]),)
flama.dump(model, path="sklearn_model.flm")

TensorFlow

import flamaimport tensorflow as tf
model = tf.keras.models.Sequential( [ tf.keras.layers.Flatten(input_shape=(2,)), tf.keras.layers.Dense(10, activation="tanh"), tf.keras.layers.Dense(1, activation="sigmoid"), ])
model.compile(optimizer="adam", loss="mse")model.fit( np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), np.array([[0], [1], [1], [0]]), epochs=2000, verbose=0,)
flama.dump(model, path="tensorflow_model.flm")

PyTorch

import flamaimport torch
class Model(torch.nn.Module): def __init__(self): super().__init__() self.l1 = torch.nn.Linear(2, 10) self.l2 = torch.nn.Linear(10, 1)
def forward(self, x): x = torch.tanh(self.l1(x)) x = torch.sigmoid(self.l2(x)) return x
def _train(self, X, Y, loss, optimizer): for m in self.modules(): if isinstance(m, torch.nn.Linear): m.weight.data.normal_(0, 1)
steps = X.size(0) for i in range(2000): for j in range(steps): data_point = np.random.randint(steps) x_var = torch.autograd.Variable(X[data_point], requires_grad=False) y_var = torch.autograd.Variable(Y[data_point], requires_grad=False)
optimizer.zero_grad() y_hat = model(x_var) loss_result = loss.forward(y_hat, y_var) loss_result.backward() optimizer.step()
return self

X = torch.Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])Y = torch.Tensor([0, 1, 1, 0]).view(-1, 1)model = Model()model._train(X, Y, loss=torch.nn.BCELoss(), optimizer=torch.optim.Adam(model.parameters()))
flama.dump(model, path="pytorch_model.flm")