Motivation
When should I use kedro-mlflow?
Basically, you should use kedro-mlflow
in any Kedro
project which involves machine learning / deep learning. As stated in the introduction, Kedro
’s current versioning (as of version 0.16.6
) is not sufficient for machine learning projects: it lacks a UI and a run
management system. Besides, the KedroPipelineModel
ability to serve a kedro pipeline as an API or a batch in one line of code is a great addition for collaboration and transition to production.
If you do not use Kedro
or if you do pure data processing which do not involve machine learning, this plugin is not what you are seeking for ;)
Why should I use kedro-mlflow?
Benchmark of existing solutions
This paragraph gives a (quick) overview of existing solutions for mlflow integration inside Kedro projects.
Mlflow
is very simple to add to any existing code. It is a 2-step process:
add
log_{XXX}
(either param, artifact, metric or model) functions where they are needed inside the codeadd a
MLProject
at the root of the project to enable CLI execution. This file must contain all the possible execution steps (like thepipeline.py
/hooks.py
in a kedro project).
Including mlflow inside a kedro project
is consequently very easy: the logging functions can be added in the code, and the MLProject
is very simple and is composed almost only of the kedro run
command. You can find examples of such implementations:
the medium paper by QuantumBlack employees.
the associated github repo
other examples can be found on Github, but AFAIK all of them follow the very same principles.
Enforcing Kedro principles
Above implementations have the advantage of being very straightforward and mlflow compliant, but they break several Kedro
principles:
the
MLFLOW_TRACKING_URI
which registers the database where runs are logged is declared inside the code instead of a configuration file, which hinders portability across environments and makes transition to production more difficultthe logging of different elements can be put in many places in the
Kedro
template (in the code of any function involved in anode
, in aHook
, in theProjectContext
, in atransformer
…). This is not compliant with theKedro
template where any object has a dedicated location. We want to avoid the logging to occur anywhere because:it is very error-prone (one can forget to log one parameter)
it is hard to modify (if you want to remove / add / modify an mlflow action you have to find it in the code)
it prevents reuse (re-usable function must not contain mlflow specific code unrelated to their functional specificities, only their execution must be tracked).
kedro-mlflow
enforces these best practices while implementing a clear interface for each mlflow action in Kedro template. Below chart maps the mlflow action to perform with the Python API provided by kedro-mlflow
and the location in Kedro template where the action should be performed.
Mlflow action | Template file | Python API |
---|---|---|
Set up configuration | mlflow.yml |
MlflowHook |
Logging parameters | mlflow.yml |
MlflowHook |
Logging artifacts | catalog.yml |
MlflowArtifactDataset |
Logging models | catalog.yml |
MlflowModelTrackingDataset and MlflowModelLocalFileSystemDataset |
Logging metrics | catalog.yml |
MlflowMetricsHistoryDataset |
Logging Pipeline as model | hooks.py |
KedroPipelineModel and pipeline_ml_factory |
kedro-mlflow
does not currently provide interface to set tags outside a Kedro Pipeline
. Some of above decisions are subject to debate and design decisions (for instance, metrics are often updated in a loop during each epoch / training iteration and it does not always make sense to register the metric between computation steps, e.g. as a an I/O operation after a node run).
Note
You do not need any MLProject
file to use mlflow inside your Kedro project. As seen in the introduction, this file overlaps with Kedro configuration files.