Versioning Kedro DataSets
What is artifact tracking?
Mlflow defines artifacts as “any data a user may want to track during code execution”. This includes, but is not limited to:
data needed for the model (e.g encoders, vectorizer, the machine learning model itself…)
graphs (e.g. ROC or PR curve, importance variables, margins, confusion matrix…)
Artifacts are a very flexible and convenient way to “bind” any data type to your code execution. Mlflow has a two-step process for such binding:
Persist the data locally in the desired file format
Upload the data to the artifact store
How to version data in a kedro project?
kedro-mlflow
introduces a new AbstractDataset
called MlflowArtifactDataset
. It is a wrapper for any AbstractDataset
which decorates the underlying dataset save
method and logs the file automatically in mlflow as an artifact each time the save
method is called.
Since it is an AbstractDataset
, it can be used with the YAML API. Assume that you have the following entry in the catalog.yml
:
my_dataset_to_version:
type: pandas.CSVDataset
filepath: /path/to/a/destination/file.csv
You can change it to:
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/LOCAL/destination/file.csv # must be a local file, wherever you want to log the data in the end
and this dataset will be automatically versioned in each pipeline execution.
Frequently asked questions
Can I pass extra parameters to the MlflowArtifactDataset
for finer control?
The MlflowArtifactDataset
takes a dataset
argument which is a python dictionary passed to the __init__
method of the dataset declared in type
. It means that you can pass any argument accepted by the underlying dataset in this dictionary. If you want to pass load_args
and save_args
in the previous example, add them in the dataset
argument:
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
load_args:
sep: ;
save_args:
sep: ;
# ... any other valid arguments for dataset
Can I use the MlflowArtifactDataset
in interactive mode?
Like all Kedro AbstractDataset
, MlflowArtifactDataset
is callable in the python API:
from kedro_mlflow.io.artifacts import MlflowArtifactDataset
from kedro_datasets.pandas import CSVDataset
csv_dataset = MlflowArtifactDataSet(
dataset={
"type": CSVDataset, # either a string "pandas.CSVDataset" or the class
"filepath": r"/path/to/a/local/destination/file.csv",
}
)
csv_dataset.save(data=pd.DataFrame({"a": [1, 2], "b": [3, 4]}))
How do I upload an artifact to a non local destination (e.g. an S3 or blog storage)?
The location where artifact will be stored does not depends of the logging function but rather on the artifact store specified when configuring the mlflow server. Read mlflow documentation to see:
how to configure an artifact store with cloud storage.
Setting the mlflow_tracking_uri
key of mlflow.yml
to the url of this server is the only additional configuration you need to send your datasets to this remote server. Note that you still need to specify a local path for the underlying dataset, mlflow will take care of the upload to the server by itself.
You can refer to this issue for further details.
Can I log an artifact in a specific run?
The MlflowArtifactDataset
has an extra attribute run_id
which specifies the run you will log the artifact in. Be cautious, because this argument will take precedence over the current run when you call kedro run
, causing the artifact to be logged in another run that all the other data of the run.
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv # must be a local filepath, no matter what is your actual mlflow storage (S3 or other)
run_id: 13245678910111213 # a valid mlflow run to log in. If None, default to active run
Can I reload an artifact from an existing run to use it in another run ?
You may want to reuse th artifact of a previous run to reuse it in another one, e.g. to continue training from a pretrained model, or to select the best model among several runs created during an hyperparamter tuning. The MlflowArtifactDataset
has an extra attribute run_id
you can use to specify from which run you will load the artifact from. Be cautious, because:
this argument will take precedence over the current run** when you call
kedro run
, causing the artifact to be loaded from another run that all the other data of the runthe artifact will be downloaded and erase the existing file at your local filepath
my_dataset_to_reload:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv # must be a local filepath, no matter what is your actual mlflow storage (S3 or other)
run_id: 13245678910111213 # a valid mlflow run with the existing artifact. It must be named "file.csv"
Can I create a remote folder/subfolders architecture to organize the artifacts?
The MlflowArtifactDataset
has an extra argument artifact_path
which specifies a remote subfolder where the artifact will be logged. It must be a relative path.
With below example, the artifact will be logged in mlflow within a reporting
folder.
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
artifact_path: reporting # relative path where the remote artifact must be stored. if None, saved in root folder.