Register a pipeline to mlflow with KedroPipelineModel
custom mlflow model
kedro-mlflow
has a KedroPipelineModel
class (which inherits from mlflow.pyfunc.PythonModel
) which can turn any kedro Pipeline
object to a Mlflow Model.
To convert a Pipeline
to a mlflow model, you need to create a KedroPipelineModel
and then log it to mlflow. An example is given in below snippet:
from pathlib import Path
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
bootstrap_project(r"<path/to/project>")
session = KedroSession.create(project_path=r"<path/to/project>")
# "pipeline" is the Pipeline object you want to convert to a mlflow model
context = session.load_context() # this setups mlflow configuration
catalog = context.catalog
pipeline = context.pipelines["<my-pipeline>"]
input_name = "instances"
# artifacts are all the inputs of the inference pipelines that are persisted in the catalog
# (optional) get the schema of the input dataset
input_data = catalog.load(input_name)
model_signature = infer_signature(model_input=input_data)
# you can optionally pass other arguments, like the "copy_mode" to be used for each dataset
kedro_pipeline_model = KedroPipelineModel(
pipeline=pipeline, catalog=catalog, input_name=input_name
)
artifacts = kedro_pipeline_model.extract_pipeline_artifacts()
mlflow.pyfunc.log_model(
artifact_path="model",
python_model=kedro_pipeline_model,
artifacts=artifacts,
conda_env={"python": "3.10.0", dependencies: ["kedro==0.18.11"]},
model_signature=model_signature,
)
Note that you need to provide the log_model
function a bunch of non trivial-to-retrieve informations (the conda environment, the “artifacts” i.e. the persisted data you need to reuse like tokenizers / ml models / encoders, the model signature i.e. the columns names and types…). The KedroPipelineModel
object has methods like extract_pipeline_artifacts
to help you, but it needs some work on your side.
Saving Kedro pipelines as Mlflow Model objects is convenient and enable pipeline serving. However, it does not does not solve the decorrelation between training and inference: each time one triggers a training pipeline, (s)he must think to save it immediately afterwards.
kedro-mlflow
offers a convenient API to simplify this workflow, as described in the following sections.