Join the Kedro community

Updated 4 days ago

Logging Additional Data From a Kedro Node

Hello, guys, I noticed that there is no support for log_table method in kedro-mlflow. So I wonder what will be the right way to log additional data from a node, something that is not yet supported by the plugin?

Right now I just do something like this at the end of the node function

mlflow.log_table(data_for_table, output_filename)
But I am concerned as I am not sure if it will always work and will always log the data to the correct run because I was not able to get retrieve the active run id from inside the node with mlflow.active_run() (it returns None all the time).

I need this because I want to use the Evaluation tab in the UI to manually compare some outputs of different runs.

Y
q
P
15 comments

You can just just return your table at the end of the node, and use a MlflowArtifactDataset combined with a CSVDataset in your catalog

It will be logged automatically

It won't work. I mean it will log the artifact for sure but it will not be accessible in Evaluation tab.

As far as I understand it should be logged via mlflow.log_table method to appear in the datasets available for the Evaluation tab

Even if you use a JSON Dataset instead of a CSV one?

But you are right , there no support for log table right now. Please open an issue in the repo and I'll try to add it : https://github.com/Galileo-Galilei/kedro-mlflow

Even if you use a JSON Dataset instead of a CSV one?

That's a fair question, I tried to use pandas.JSONDataset (because I have data in DataFrame) with MlflowArtifactDataset and it produced some stringified JSON as a result so it was not available in Evaluation either. Could you recommend which of the JSON datasets to try?

Please open an issue in the repo and I'll try to add it
Sure, no problem

Actually, maybe the JSON wasn't stringified. It might have had a different format because MLflow uses something like:

{
  "columns": list[column_names], 
  "data": list[list[values]]
}
whereas pandas converts a DataFrame into this format:
{
  "[column_name]": list[values]
}
I can't check right now, but I'm almost sure this was the problem. So, one way to address it would be to manually convert a DataFrame into MLflow’s JSON format and then save it as you advised.

I think it should. If I remember correctly there is a df.to_json(orient=...) argument to specify how the conversion should be done

Yeah, this might work. Let me check

Nope, it doesn't work 😔

I will create a feature request in the repo

@Yolan Honoré-Rougé FYI, I've created a feature request https://github.com/Galileo-Galilei/kedro-mlflow/issues/634

How about using a hook with the mlflow library? Thats what I do atm.
You will have access to the current run which the MlflowHook from kedro-mlfow, instatiated via mlflow.active_run and are able to retrieve the node outputs with the after_node_run method kedro provides.

@Philipp Dahlke Yeah, thank you, I think it makes sense too.

I just thought that it is probably and overkill for now since the simple call to mlflow.log_table does the trick.

I just don't like this as a long term solution. So if by the time I have problems with it there will be no update in the plugin I will probably use a Hook or some other workaround

Add a reply
Sign up and join the conversation on Slack