Introduction to The kedro-mlflow Implementation: Path Setting, Artifact Storing, and Metrics Saving

M Hamid Asn
7 min readNov 13, 2022

--

Imagine you have a large machine learning project that requires various kinds of experiments in collaboration with company A. You and your team will certainly not only think about how to make a good model; the process before modeling, after modeling, the duration of the modeling, and the flow are also important things to consider.

When presenting your team’s progress to company A, you presented many things. Start with a presentation on using the Kedro framework to make your code reproducible, maintainable, and modular. You also presented the results of various models using various algorithms with various combinations of parameters and under various conditions (different preprocessing data). You think your progress is very good and quite satisfactory, until company A asks, “With so many models and conditions to be experimented with, would it be much better if all experimental results were integrated and managed in one place?”

Of course, integrating all the results of machine learning experiments into one place will be much better and more efficient. In addition to keeping track of all experiments, this integration can also make it easier for company A to find out which experiments have been carried out. It would be great if there was a platform or tool that automatically did it all.

In this article, you will learn about the tools that can help you in your machine learning experiments journey by integrating all your experiments into one place! It is called mlflow-kedro.

MLflow Overview

MLflow is an open source platform for managing the lifecycle of machine learning end-to-end. MLflow has four main functions, including MLflow tracking, MLflow projects, MLflow models, and MLflow model registry.

In this article we will take advantage of one of the four main functions, namely MLflow tracking. As the name suggests, MLflow tracking is used to track experiments to record and compare parameters and results.

kedro Overview

Kedro is an open source framework in Python that can make machine learning code (or data science in general) reproducible, maintainable, and modular. Kedro is inspired by the software engineering pipeline concept.

By using kedro, of course, we can easily reproduce our machine learning projects, for example, compared to running one cell at a time in a Jupyter notebook to get our model. With kedro, we just need to run “kedro run” on the CLI to run all the needed steps for us to create our model!

Kedro has many plugins (kedro-plugins) that can be used for our project’s purposes. Kedro-mlflow is one of the kedro-plugins that can be used to integrate our experimental results into one place.

kedro-mlflow

Kedro-mlflow is a plugin from the kedro framework that must be used on projects that use kedro. With kedro-mlflow, we can integrate all the experimental results (parameters, artifacts, metrics, or models) of machine learning that we do.

Because, as previously mentioned, kedro-mlflow is a plugin from the kedro framework, kedro-mlflow can only be used on kedro projects. To explore and understand more about kedro-mlflow, you need to be familiar with kedro first (structure, how to use it, etc.).

In this article, a basic kedro-mlflow implementation will be carried out on a topic modelling project using the OCTIS framework (ML framework for topic modelling). Because by using OCTIS, parameter optimization for the model has been carried out, and the model will not be used for prediction, the implementation of kedro-mlflow in this article focuses on setting up the artifact and how to install metrics.

What is artifact? artifact is a very flexible and convenient way to “bind” all the data into your code. Binding the data can be done easily by saving the data locally in the desired file format, then uploading the data to the artifact store.

What is metrics? metrics is a value that is used to evaluate the model. There are various kinds of metrics with their respective functions and uses. In the case of topic modeling, coherence value (with the measurement ‘c_v’) will be used as the metrics.

To initialize the kedro-mlflow plugin on the kedro project that has been created first, of course make sure that the kedro-mlflow plugin is installed (can be easily done by running pip install kedro-mlflow on the CLI), then on the CLI change directory to the intended kedro project and run :

kedro mlflow init

There will be a message: "conf/local/mlflow.yml" has been successfully updated, as a sign that kedro-mlflow has been successfully initialized on your project.

To keep track of metrics and artifacts, we only need to edit three files in our kedro project: catalog.yml (the project’s dataset catalog), node.py (the modeling process’s node), and mlflow.yml. To make it more clear, below is the directory structure of the Kedro project on the topic of modeling using OCTIS.

#Kedro (octis_topic_modelling) Project Structure
+---conf
| +---base
| | +---parameters
| | \---catalog.yml #This File
| +---local
| | \---mlflow.yml #This File
+---data
+---docs
+---logs
+---notebooks
+---src
+---octis_tm
| +---pipelines
| | +---data_preprocessing
| | +---modelling
| | | \---nodes.py #This File
+---tests

First, before setting up the artifacts and metrics, we first need to configure the mlflow_tracking_url path in mlflow.yml. mlflow_tracking_url is the path where the runs will be recorded. If you have configured your own MLflow server, you can specify the tracking url in the mlflow_tracking_url but otherwise if you will only run this project locally, you can replace the path with the default path, namely mlruns, which will automatically create a mlruns folder to store your experimental records.

conf/local/mlflow.yml

# SERVER CONFIGURATION -------------------
.
.
.
server:
mlflow_tracking_uri: mlruns
credentials: null
.
.
.

To save metrics in kedro-mlflow, we only need to initialize MlflowMetricDataSeton the node that stores the metrics values we are using (in this case nodes.py at modelling pipeline). There are many ways to store MlflowMetricDataSet in kedro-mlflow. More details can be found here. In this case, I already have the metrics coherence value from the previous model training, so I just need to save the metrics value as follows:

src/octis_tm/pipelines/modelling/nodes.py

def run_model():
.
.
.
highest_model = _get_highest_model(model_par)
coh_score = npmi.score(highest_model)
metric_ds = MlflowMetricDataSet(key="coherence")
metric_ds.save(coh_score)
.
.
.

To store artifacts in the artifact store, we need to know in advance what data will be used as artifacts. In this case, I will save two files as artifacts, namely the files that will become the output (results.csv and model_info.txt). To save a file as an artifact, we just need to initialize the type kedro_mlflow.io.artifacts.MlflowArtifactDataSet on the data in catalog.yml. Here’s how to add that type to our data:

conf/base/catalog.yml

.
.
.
_excel: &excel
type: pandas.ExcelDataSet
load_args:
engine: openpyxl
#08_reporting
result:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
data_set:
<<: *excel
filepath: data/08_reporting/result.xlsx
model_info:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
data_set:
type: text.TextDataSet
filepath: data/08_reporting/model_info.txt
.
.
.

After setting up the MLflow path, metrics, and artifacts, we can run our kedro as usual with kedro run to run the entire code, or we can run per pipeline with kedro run --pipeline=<pipeline_name>.

To see the UI of kedro-mlflow, you can runkedro mlflow ui on the CLI, and then on the CLI there will be a link that directs you to your kedro-mlflow UI. The following is an example UI from kedro-mlflow on a topic modeling project using OCTIS.

In the figure above, it can be seen that I have run kedro 3 times, where there is a run for the data preprocessing pipeline and the rest is run for modeling. It can be seen that the summary of each kedro run is quite clear, starting from the duration, metrics, etc. To find out more information about the run as well as the artifacts that are stored, you can click on the desired start time.

The following image below is a more detailed view of the Kedro run above. On this page we can also view and save the artifacts stored on this run.

That’s it for “Introduction to The kedro-mlflow Implementation” by implementing the basic things in the plugin (path setting, artifact storing, and saving metrics). Just by implementing these basics, I personally already feel the benefits of this plugin, I can keep track with various type of experiments just by seeing the UI and it’ll hundred percent help me developing my ML projects!

There are still many Kedro-mlflow features that have not been covered in this article, such as versioning, model saving, parameter saving, etc. But these three things are a great first step to explore and take further advantage of the kedro-mlflow plugin. To understand more about the mlflow-plugin you can check their documentation here.

Learning Experience?

So what do you think about kedro-mlflow? Have you thought that this plugin could help you with your project? In my experience learning this plugin, it was not easy at first because there were a lot of new things that I needed to understand (for example, the artifact concept, making sure that my files are all connected to each other, and so on), but it was worth it. The biggest difference since I learned about this plugin is that when I conduct ML experiments, there is no need for me to note the evaluation of each model/experiment one by one in a spreadsheet or any other manual doc. I can easily run my code and it will automatically save the metrics and artifacts. Tips from me when you want to learn how to use this useful plugin, you’d better understand how to use kedro first, because it will help you with your kedro-mlflow learning process!

--

--

No responses yet