Logging Additional Data From a Kedro Node

Question

Hello, guys, I noticed that there is no support for log_table method in kedro-mlflow. So I wonder what will be the right way to log additional data from a node, something that is not yet supported by the plugin?

Right now I just do something like this at the end of the node function

mlflow.log_table(data_for_table, output_filename)

But I am concerned as I am not sure if it will always work and will always log the data to the correct run because I was not able to get retrieve the active run id from inside the node with mlflow.active_run() (it returns None all the time).

I need this because I want to use the Evaluation tab in the UI to manually compare some outputs of different runs.

Yolan Honoré-Rougé · Answer

You can just just return your table at the end of the node, and use a MlflowArtifactDataset combined with a CSVDataset in your catalog

Yolan Honoré-Rougé · Answer

It will be logged automatically

Yolan Honoré-Rougé · Answer

https://kedro-mlflow.readthedocs.io/en/stable/source/03_experiment_tracking/01_experiment_tracking/03_version_datasets.html#how-to-track-data-in-a-kedro-project

qwerty · Answer

It won't work. I mean it will log the artifact for sure but it will not be accessible in Evaluation tab.

As far as I understand it should be logged via mlflow.log_table method to appear in the datasets available for the Evaluation tab

Yolan Honoré-Rougé · Answer

Even if you use a JSON Dataset instead of a CSV one?

Yolan Honoré-Rougé · Answer

But you are right , there no support for log table right now. Please open an issue in the repo and I'll try to add it :  https://github.com/Galileo-Galilei/kedro-mlflow

qwerty · Answer

Even if you use a JSON Dataset instead of a CSV one?

That's a fair question, I tried to use pandas.JSONDataset (because I have data in DataFrame) with MlflowArtifactDataset and it produced some stringified JSON as a result so it was not available in Evaluation either. Could you recommend which of the JSON datasets to try?

qwerty · Answer

Please open an issue in the repo and I'll try to add it Sure, no problem

qwerty · Answer

Actually, maybe the JSON wasn't stringified. It might have had a different format because MLflow uses something like: {
  "columns": list[column_names], 
  "data": list[list[values]]
} whereas pandas converts a DataFrame into this format: {
  "[column_name]": list[values]
} I can't check right now, but I'm almost sure this was the problem. So, one way to address it would be to manually convert a DataFrame into MLflow’s JSON format and then save it as you advised.

Yolan Honoré-Rougé · Answer

I think it should. If I remember correctly  there is a  df.to_json(orient=...)  argument to specify how the conversion should be done

qwerty · Answer

Yeah, this might work. Let me check

qwerty · Answer

Nope, it doesn't work 😔 I will create a feature request in the repo

qwerty · Answer

@Yolan Honoré-Rougé  FYI, I've created a feature request  https://github.com/Galileo-Galilei/kedro-mlflow/issues/634

Philipp Dahlke · Answer

How about using a hook with the mlflow library? Thats what I do atm.
You will have access to the current run which the MlflowHook from kedro-mlfow, instatiated via mlflow.active_run and are able to retrieve the node outputs with the after_node_run method kedro provides.

qwerty · Answer

@Philipp Dahlke Yeah, thank you, I think it makes sense too.

I just thought that it is probably and overkill for now since the simple call to mlflow.log_table does the trick.

I just don't like this as a long term solution. So if by the time I have problems with it there will be no update in the plugin I will probably use a Hook or some other workaround

Join the Kedro community

Logging Additional Data From a Kedro Node