Join the Kedro community

Hello! I am running into issues with Kedro 0.19.11 release while running pipelines in databricks. Specifically, I am running into an error where an imported python module for a node is unable to find active SparkSession via SparkSession.getActiveSession() (see first image). Our pipeline is comprised entirely of Ibis.TableDataset datasets & I/O with pyspark backend. What is throwing me is that other nodes use the pyspark connection and are able to perform operations properly across the spark session, but fails on this single node when leveraging an imported module that it is unable to find the spark session. This issue is not present in Kedro 0.19.10. My best guess is that it has something to do with the updated code in kedro/runner/sequential_runner.py using ThreadPoolExecutor and possible scoping issues? Apologies on the somewhat scattered explanation, there is quite a bit I don't fully understand here, so appreciate any help or guidance. Lmk if I can provide any additional info as well.

3 comments
E
J

Hey all! I'm working on tooling around running Kedro pipelines in our (pre-existing) Prefect deployment. I've been following the lead of the example from the docs and things were going pretty smoothly until I came around to logging. Logging in Prefect is a little finicky, but what I'd like to do is route the Kedro logs through to the Prefect loggers and handlers. Happy to go into more detail about what I've tried, but figured I'd first ask if anyone has experience here? Is there some other way to handle exposing Kedro logs to in the Prefect UI (which is ultimately my goal).

12 comments
M
J

Hello guys! Noticed there is a typing-annotation bug in kedro-mlflow 0.14.3 specific to python 3.9 . It seems that a fix is already merged in the repo. When would the fix be released? Thank!

4 comments
I
Y
Y

Hi all, question about the roadmap - Is there any plan to add support for LLM-agentic pipelines in kedro? E.g. I think it would be really cool to represent an agentic graph (e.g. langgraph) as a kedro-viz pipeline.

6 comments
E
D
I
C

when using the [Kedro]DataCatalog as a library, what's the best way of loading the parameters too?

in other words, what should I add to

catalog = DataCatalog.from_config(conf_loader["catalog"])

so that I can do catalog.load("params:model_size")?

11 comments
J
A
E
d

Hi guys,

Trying to run kedro viz, and I am getting some strange errors like below:

(projx) ⋊> ~/P/projx on master ⨯ uv run --with kedro-viz kedro viz run                                                          14:29:10
   Built projx @ file:///home/ftopal/Projects/projx
Uninstalled 1 package in 0.68ms
Installed 1 package in 1ms
Installed 98 packages in 109ms
[02/18/25 14:30:28] INFO     Using 'conf/logging.yml' as logging configuration. You can change this by setting the       __init__.py:270
                             KEDRO_LOGGING_CONFIG environment variable accordingly.                                                     
WARNING: Experiment Tracking on Kedro-viz will be deprecated in Kedro-Viz 11.0.0. Please refer to the Kedro documentation for migration guidance.
INFO: Running Kedro-Viz without hooks. Try `kedro viz run --include-hooks` to include hook functionality.
Starting Kedro Viz ...
[02/18/25 14:30:31] INFO     Using 'conf/logging.yml' as logging configuration. You can change this by setting the       __init__.py:270
                             KEDRO_LOGGING_CONFIG environment variable accordingly.                                                     
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ftopal/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ftopal/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/server.py", line 121, in run_server
    load_and_populate_data(
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/server.py", line 70, in load_and_populate_data
    populate_data(data_access_manager, catalog, pipelines, session_store, stats_dict)
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/server.py", line 44, in populate_data
    data_access_manager.add_pipelines(pipelines)
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 124, in add_pipelines
    self.add_pipeline(registered_pipeline_id, pipeline)
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 180, in add_pipeline
    input_node = self.add_node_input(
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 259, in add_node_input
    graph_node = self.add_dataset(
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 371, in add_dataset
    graph_node = GraphNode.create_data_node(
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/models/flowchart/nodes.py", line 140, in create_data_node
    return DataNode(
  File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 2 validation errors for DataNode
kedro_obj.is-instance[Node]
  Input should be an instance of Node [type=is_instance_of, input_value=[projx.models.llm.LLM(bac.../logs'), _logging=True)], input_type=list]
    For further information visit <a target="_blank" rel="noopener noreferrer" href="https://errors.pydantic.dev/2.10/v/is_instance_of">https://errors.pydantic.dev/2.10/v/is_instance_of</a>
kedro_obj.is-instance[AbstractDataset]
  Input should be an instance of AbstractDataset [type=is_instance_of, input_value=[projx.models.llm.LLM(bac.../logs'), _logging=True)], input_type=list]
    For further information visit <a target="_blank" rel="noopener noreferrer" href="https://errors.pydantic.dev/2.10/v/is_instance_of">https://errors.pydantic.dev/2.10/v/is_instance_of</a>

Any idea why i am getting this now? Complaining about some input values but everything work fine in kedro run

6 comments
E
F
R

Hi everyone! I'm having trouble using tensorflow.TensorFlowModelDataset with an S3 bucket. The model saves fine locally, but when I configure it to save/load directly from S3, it doesn't work.
Some key points:

  • Credentials are fine – I can load other datasets (preprocessing outputs and split data) from S3 without issues.
  • Uploading manually works – If I explicitly upload the model file using boto3 or another script, I can access it in S3 just fine.
  • Had issues with .h5 models – Initially, I could retrieve .h5 files from S3 but loading was not working properly, so I switched to the .keras format, which works fine when handling files manually.
Has anyone successfully used tensorflow.TensorFlowModelDataset with S3? Is there a recommended workaround or configuration to get it working? Any insights would be much appreciated!

To make it clearer: I am only having problems when the node output - the model - is pointed to s3. I am getting access denied even after checking credentials , IAM policies, and testing with manual script

8 comments
E
J

Guys, I would like to know if any of you guys work with vertex AI pipelines and how you guys handle MLOPs...

Is there a way to export logs without the Rich mark up syntax? Rich works perfectly fine at the terminal, the problem is I don't need them when I am not using a terminal (i.e. export to a different application, log store etc)

GH: https://github.com/kedro-org/kedro/issues/4487

11 comments
N
E
L

Also am I right that there is no way to run pipelines only if they include all of the tags listed in the run configuration via default cli ?

3 comments
E
q

Guys, could someone help with using KedroContext properly?

I want to add a --only-missing CLI parameter to kedro run so that it runs pipelines using the run_only_missing method. From what I understand, adding this parameter to the default CLI was rejected because it can be implemented via KedroContext customization.

However, I’m not sure how to do this correctly. Or maybe I am missing something 😔
Could someone share an example or a code snippet because I don't see the usage of this class in the docs (e.g. here or here)?

3 comments
E
q

Hey guys, I m having trouble to append a CSV with the datacatalog. My node is returning a DataFrame with one row and multiple metricnames as columns. It writes the results.csv to the folder accordingly but it doesnt append the rows. In addition, a blank row is created after the first row (might indicate the flaw? ) When I debugg step by step, both dataframes get written to the csv but are overwritten by each other.
Metric | Seed
--------|-------
1.0 | 42

results.update(
        {
            "seed": seed,
        }
    )
return = pd.DataFrame.from_dict([results])

My catalog has the save_arg mode set to "a"
"{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.

7 comments
q
P
N

Hello, guys, I noticed that there is no support for log_table method in kedro-mlflow. So I wonder what will be the right way to log additional data from a node, something that is not yet supported by the plugin?

Right now I just do something like this at the end of the node function

mlflow.log_table(data_for_table, output_filename)
But I am concerned as I am not sure if it will always work and will always log the data to the correct run because I was not able to get retrieve the active run id from inside the node with mlflow.active_run() (it returns None all the time).

I need this because I want to use the Evaluation tab in the UI to manually compare some outputs of different runs.

15 comments
Y
q
P

Guys, is this the right place to ask about kedro-mlflow plugin?

3 comments
Y
q

Hello guys, I am just starting to learn about Kedro and noticed that micro-packaging will is being deprecated. Could someone please suggest any alternatives to that feature?

5 comments
J
q

Good morning! We're looking for best practices to handle data quality issues within Kedro. Specifically:

1. We need to implement both manual and automated data curation
2. Ideally want to keep as much as possible within the Kedro pipeline structure
3. The current challenge is how to apply and track incoming data corrections requests

Has anyone implemented something similar? Looking for patterns/approaches that worked well.

3 comments
J
E
J

Morning! Just wondering how things work with regards to submitting bug fixes? I've read the contribution guidelines, and I have an open issue for the kedro-airflow plugin. Can I just create a fix branch and open a PR?

3 comments
J
R

Hi guys,

I am having trouble to run my kedro from a docker build. I'm using MLflow and the kedro_mlflow.io.artifacts.MlflowArtifactDataset

I followed the instructions for building the container from kedro-docker repo but when running, those artifacts want to access my local windows path instead of the containers path. Do you guys know what additional settings I have to make? All my settings in are pretty much vanilla. The mlflow_tracking_uri is set to null

"{dataset}.team_lexicon":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset  
  dataset:
    type: pandas.ParquetDataset  
    filepath: data/03_primary/{dataset}/team_lexicon.pq 
    metadata:
      kedro-viz:
        layer: primary  
        preview_args:
            nrows: 5 

Traceback (most recent call last):
  
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowParquetDataset(filepath=/home/kedro_docker/data/03_primary/D1-24-25/team_lexicon.pq, load_args={}, protocol=file, save_args={}).
[Errno 13] Permission denied: '/C:'

3 comments
P
d

Hey,
I'm using databricks.yml in conf in order to generate the yaml to deploy on databricks workflow, with the command run kedro databricks bundle , I wanted Let's say I have something like :

<my_job>:
  schedule:
    # Run at 12:50
    quartz_cron_expression: '00 30 10 * * ?'
    timezone_id: Europe/Paris
This works fine and the file is well generated
Same, if i'm doing something like :
<my_job>:
  schedule:
    # Run at 12:50
    quartz_cron_expression: '00 30 10 * * ?'
    timezone_id: Europe/Paris
  tasks:
    - task_key: default
      run_if: AT_LEAST_ONE_SUCCESS
Every tasks from my_job has the run_if conditions. However, I just wanted that a specific task inherits this run_in condition :
<my_job>:
  schedule:
    # Run at 12:50
    quartz_cron_expression: '00 30 10 * * ?'
    timezone_id: Europe/Paris
  tasks:
    - task_key: <my_task>
      run_if: AT_LEAST_ONE_SUCCES
But this is not correctly converted to my ressource file from this job

Do you have any idea on how I can solve this ? Cheers !

3 comments
J
L

Hello! :kedro:

I am on kedro 0.18.14 using a custom config loader based on TemplatedConfigLoader . Is there a way to access globals defined in globals.yml in kedro nodes?

Good morning, we have a question about Kedro dataset factories, we'd be hoping you'd be able to help. I will put the details in the thread to keep this channel tidy 🙂

4 comments
J
J

Can anyone suggest a good way of dynamically changing a catalog entries path? For example, by default I want to use local paths for my intermediate datasets, but when I deploy to production I don't want anything to be saved locally. Duplicating the catalog.yml in the conf/production/ folder is not ideal, as I will have to maintain two sets of each catalog entry.

31 comments
d
E
M
B

Hi Team!

Anyone ever played with hyperparameter tuning frameworks within kedro? I have found several scattered pieces of info related to this topic, but no complete solutions. Ultimately, I think what I would like to set up is a way to have multiple nodes running at the same time and all contributing to the same tuning experiment.

I would prefer using optuna and this is the way I would go about it based on what I have found online:

  1. Create a node that creates an optuna study
  2. Create N nodes that each run hyperparameter tuning in parallel. Each of them loads the optuna study and if using kedro-mlflow each hyperparameter trial can be logged into its own nested run.
  3. Create a final nodes that process the results of all tuning nodes

Does this sound reasonable to you? Has anyone produced such a kedro workflow already? I would love to see what it looks like.

I am also wondering:
  • I am thinking of creating an OptunaStudyDataset for the optuna study . Has anyone attempted this already?
  • For creating N tuning nodes, I am thinking of using the approach presented on the GetInData blog post on dynamic pipelines. Would this be the recommended approach?

Thanks!

8 comments
J
G
H

To keep the other thread focused : Is there a way to manage a dataset of about 1 million files in AzureML? The files are about 4k each of binary data, and are entirely independent from each other.

7 comments
A
N
P

I’m working on a big project that is about to hit it’s next phase. We are using kedro and we have a large single kedro project. To give you an idea on how big, we have about 500+ catalog entries, 500+ nodes in different kedro pipelines (we disabled the default sum of all pipelines as it is too large to use). Now I know the general guideline is to split your project in several smaller ones if it becomes too big, but I need some advice/opinions on this. I’ll explain more details in the comments. Thanks!

8 comments
M
A
R
M
J