Sen

How Kedro Pipeline Reads Input Datasets

Hi, all. I have a question regarding how nodes/pipelines read dataset as input datasets. Take this catalog configuration in the following link as example, I assume the kedro pipeline read data from CSV file stored in Amazon S3 when you specify as inputs=["cars"] in node configuration. I was wondering if there are multiple different nodes that take "cars" as input datasets, does kedro pipeline use those datasets from memory, or does it read from Amazon S3 every time they need the datasets?

https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-multiple-datasets-with-similar-configuration-using-yaml-anchors

And if it does read the same datasets from certain data source every time it runs the various nodes, is it possible to store the dataset in memory after the first reading from whatever the data source is (Amazon S3 CSV file in this case) and reuse them from memory so that you don't need to read from the data source multiple times and possibly leading to shorter processing time?

SSen

View on Slack

Getting total execution time for a databricks workflow

Hi everyone. By using hooks I’ve succeeded to show execution time of each nodes. However, I also want to know how long the whole process takes, which is from loading data, executing nodes, and eventually to saving data to Databricks catalog.

So in the attached image, I want to know the time difference between “INFO Completed 1 out of tasks” and “INFO Loading data from ‘params: …”, not just node execution time. I surely can know the time difference simply by manually calculating, but because there are hundreds of nodes, it takes at least an hour to calculate all of them, and it would be really helpful to be able to know how long each tasks take by first glance. Is there any way to do this? Is it also possible by utilizing hooks?

https://kedro-org.slack.com/archives/C03RKP2LW64/p1728353683266369

2 comments

SSen

View on Slack

Databricks workflow performance optimization

Hi all! In my organization, we run Databricks Workflow based on GitHub repository source code using kedro pipeline daily basis.

I’d like to know which nodes take most of the time to process. What would be the best practices to know how long each nodes require to run in this scenario?

3 comments

Join the Kedro community

How Kedro Pipeline Reads Input Datasets

Getting total execution time for a databricks workflow

Databricks workflow performance optimization