How to use
kedro-mlflow in a notebook
You need to install
ipython to access notebook functionalities.
Reminder on mlflow’s limitations with interactive use
Data science project lifecycle are very iterative. Mlflow intends to track parameters changes to improve reproducibility. However, one must be conscious that being able to execute functions outside of a end to end pipeline puts a strong burden on the user shoulders because he is in charge to make the code execution coherent by running the notebooks cells in the right order. Any back and forth during execution to change some parameters in a previous notebook cells and then retrain a model creates an operational risk that the recorded parameter stored in mlflow is different than the real parameter used for training the model.
To make a long story short: forget about efficient reproducibility when using mlflow interactively.
It may still be useful to track some experiments results especially if they are long to run and vary wildly with parameters, e.g. if you are performing hyperparameter tuning.
These limitations are inherent to the data science process, not to mlflow itself or the plugin.
Setup mlflow configuration in your notebook
Open your notebook / ipython session with the Kedro CLI:
kedro jupyter notebook
Or if you are on JupyterLab,
Kedro creates a bunch of global variables, including a
context and a
catalog which are automatically accessible.
When the context was created,
loaded and setup (create the tracking uri, export credentials…) the mlflow configuration of your
mlflowwhich is now accessible in your notebook
If you change your
mlflow.yml, reload the kedro extension for the changes to take effect.
Difference with running through the CLI
savemethods works as usual. You can call
catalog.save("my_artifact_dataset", data)inside a cell, and your data will be logged in mlflow properly (assuming “my_artifact_dataset” is a
hookswhich automatically save all parameters/metrics/artifacts in mlflow will work if you run the session interactively, e.g.:
session.run( pipeline_name="my_ml_pipeline", tags="training", from_inputs="data_2", to_outputs="data_7", )
but it is not very likely in a notebook.
if you need to interact manually with the mlflow server, you can use
Guidelines and best practices suggestions
During experimentation phase, you will likely not run entire pipelines (or sub pipelines filtered out between some inputs and outputs). Hence, you cannot benefit from Kedro’s
hooks (and hence from
kedro-mlflow tracking). From this moment on, perfect reproducbility is impossible to achieve: there is no chance that you manage to maintain a perfectly linear workflow, as you will go back and forth modifying parameters and code to create your model.
I suggest to :
focus on versioning parameters and metrics. The goal is to finetune your hyperparameters and to be able to remember later the best setup. It is not very important to this stage to version all parameters (e.g. preprocessing ones) nor models (after all you will need an entire pipeline to predict and it is very unlikely that you will need to reuse these experiment models one day.) It may be interesting to use
mlflow.autolog()feature to have a easy basic setup.
transition quickly to kedro pipelines. For instance, when you preprocessing is roughly defined, try to put it in kedro pipelines. You can then use notebooks to experiment / perfom hyperparameter tuning while keeping preprocessing “fixed” to enhance reproducibility. You can run this pipeline interactively with :
res = session.run( pipeline_name="my_preprocessing_pipeline", tags="training", from_inputs="data_2", to_outputs="data_7", )
res is a python dict with the outputs of your pipeline (e.g. a “preprocessed_data”
pandas.DataFrame), and you can use it interactively in your notebook.