Hello! I am running into issues with Kedro 0.19.11 release while running pipelines in databricks. Specifically, I am running into an error where an imported python module for a node is unable to find active SparkSession via SparkSession.getActiveSession()
(see first image). Our pipeline is comprised entirely of Ibis.TableDataset datasets & I/O with pyspark backend. What is throwing me is that other nodes use the pyspark connection and are able to perform operations properly across the spark session, but fails on this single node when leveraging an imported module that it is unable to find the spark session. This issue is not present in Kedro 0.19.10. My best guess is that it has something to do with the updated code in kedro/runner/sequential_runner.py
using ThreadPoolExecutor
and possible scoping issues? Apologies on the somewhat scattered explanation, there is quite a bit I don't fully understand here, so appreciate any help or guidance. Lmk if I can provide any additional info as well.
Hey all! I'm working on tooling around running Kedro pipelines in our (pre-existing) Prefect deployment. I've been following the lead of the example from the docs and things were going pretty smoothly until I came around to logging. Logging in Prefect is a little finicky, but what I'd like to do is route the Kedro logs through to the Prefect loggers and handlers. Happy to go into more detail about what I've tried, but figured I'd first ask if anyone has experience here? Is there some other way to handle exposing Kedro logs to in the Prefect UI (which is ultimately my goal).
Hello guys! Noticed there is a typing-annotation bug in kedro-mlflow 0.14.3
specific to python 3.9
. It seems that a fix is already merged in the repo. When would the fix be released? Thank!
Hi all, question about the roadmap - Is there any plan to add support for LLM-agentic pipelines in kedro? E.g. I think it would be really cool to represent an agentic graph (e.g. langgraph) as a kedro-viz pipeline.
when using the [Kedro]DataCatalog
as a library, what's the best way of loading the parameters too?
in other words, what should I add to
catalog = DataCatalog.from_config(conf_loader["catalog"])
catalog.load("params:model_size")
?Hi guys,
Trying to run kedro viz, and I am getting some strange errors like below:
(projx) ⋊> ~/P/projx on master ⨯ uv run --with kedro-viz kedro viz run 14:29:10 Built projx @ file:///home/ftopal/Projects/projx Uninstalled 1 package in 0.68ms Installed 1 package in 1ms Installed 98 packages in 109ms [02/18/25 14:30:28] INFO Using 'conf/logging.yml' as logging configuration. You can change this by setting the __init__.py:270 KEDRO_LOGGING_CONFIG environment variable accordingly. WARNING: Experiment Tracking on Kedro-viz will be deprecated in Kedro-Viz 11.0.0. Please refer to the Kedro documentation for migration guidance. INFO: Running Kedro-Viz without hooks. Try `kedro viz run --include-hooks` to include hook functionality. Starting Kedro Viz ... [02/18/25 14:30:31] INFO Using 'conf/logging.yml' as logging configuration. You can change this by setting the __init__.py:270 KEDRO_LOGGING_CONFIG environment variable accordingly. Process SpawnProcess-1: Traceback (most recent call last): File "/home/ftopal/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ftopal/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/server.py", line 121, in run_server load_and_populate_data( File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/server.py", line 70, in load_and_populate_data populate_data(data_access_manager, catalog, pipelines, session_store, stats_dict) File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/server.py", line 44, in populate_data data_access_manager.add_pipelines(pipelines) File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 124, in add_pipelines self.add_pipeline(registered_pipeline_id, pipeline) File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 180, in add_pipeline input_node = self.add_node_input( File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 259, in add_node_input graph_node = self.add_dataset( File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/data_access/managers.py", line 371, in add_dataset graph_node = GraphNode.create_data_node( File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/kedro_viz/models/flowchart/nodes.py", line 140, in create_data_node return DataNode( File "/home/ftopal/.cache/uv/archive-v0/TA93jbcQ_9KplZlKrI4mO/lib/python3.10/site-packages/pydantic/main.py", line 214, in __init__ validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) pydantic_core._pydantic_core.ValidationError: 2 validation errors for DataNode kedro_obj.is-instance[Node] Input should be an instance of Node [type=is_instance_of, input_value=[projx.models.llm.LLM(bac.../logs'), _logging=True)], input_type=list] For further information visit <a target="_blank" rel="noopener noreferrer" href="https://errors.pydantic.dev/2.10/v/is_instance_of">https://errors.pydantic.dev/2.10/v/is_instance_of</a> kedro_obj.is-instance[AbstractDataset] Input should be an instance of AbstractDataset [type=is_instance_of, input_value=[projx.models.llm.LLM(bac.../logs'), _logging=True)], input_type=list] For further information visit <a target="_blank" rel="noopener noreferrer" href="https://errors.pydantic.dev/2.10/v/is_instance_of">https://errors.pydantic.dev/2.10/v/is_instance_of</a>
Hi everyone! I'm having trouble using tensorflow.TensorFlowModelDataset
with an S3 bucket. The model saves fine locally, but when I configure it to save/load directly from S3, it doesn't work.
Some key points:
boto3
or another script, I can access it in S3 just fine..h5
models – Initially, I could retrieve .h5
files from S3 but loading was not working properly, so I switched to the .keras
format, which works fine when handling files manually.tensorflow.TensorFlowModelDataset
with S3? Is there a recommended workaround or configuration to get it working? Any insights would be much appreciated!Guys, I would like to know if any of you guys work with vertex AI pipelines and how you guys handle MLOPs...
Is there a way to export logs without the Rich mark up syntax? Rich works perfectly fine at the terminal, the problem is I don't need them when I am not using a terminal (i.e. export to a different application, log store etc)
GH: https://github.com/kedro-org/kedro/issues/4487
Also am I right that there is no way to run pipelines only if they include all of the tags listed in the run configuration via default cli
?
Guys, could someone help with using KedroContext
properly?
I want to add a --only-missing
CLI parameter to kedro run
so that it runs pipelines using the run_only_missing
method. From what I understand, adding this parameter to the default CLI was rejected because it can be implemented via KedroContext
customization.
However, I’m not sure how to do this correctly. Or maybe I am missing something 😔
Could someone share an example or a code snippet because I don't see the usage of this class in the docs (e.g. here or here)?
Hey guys, I m having trouble to append a CSV
with the datacatalog. My node is returning a DataFrame
with one row and multiple metricnames as columns. It writes the results.csv to the folder accordingly but it doesnt append the rows. In addition, a blank row is created after the first row (might indicate the flaw? ) When I debugg step by step, both dataframes get written to the csv but are overwritten by each other.
Metric | Seed
--------|-------
1.0 | 42
results.update( { "seed": seed, } ) return = pd.DataFrame.from_dict([results])
"{engine}.{variant}.results": type: pandas.CSVDataset # Underlying dataset type (CSV). filepath: data/08_reporting/{engine}/results.csv # Path to the CSV file. save_args: mode: "a" # Append mode for saving the CSV file.
Hello, guys, I noticed that there is no support for log_table
method in kedro-mlflow. So I wonder what will be the right way to log additional data from a node, something that is not yet supported by the plugin?
Right now I just do something like this at the end of the node function
mlflow.log_table(data_for_table, output_filename)But I am concerned as I am not sure if it will always work and will always log the data to the correct run because I was not able to get retrieve the active run id from inside the node with
mlflow.active_run()
(it returns None
all the time).Evaluation
tab in the UI to manually compare some outputs of different runs.Guys, is this the right place to ask about kedro-mlflow plugin?
Hello guys, I am just starting to learn about Kedro and noticed that micro-packaging will is being deprecated. Could someone please suggest any alternatives to that feature?
Good morning! We're looking for best practices to handle data quality issues within Kedro. Specifically:
1. We need to implement both manual and automated data curation
2. Ideally want to keep as much as possible within the Kedro pipeline structure
3. The current challenge is how to apply and track incoming data corrections requests
Has anyone implemented something similar? Looking for patterns/approaches that worked well.
Morning! Just wondering how things work with regards to submitting bug fixes? I've read the contribution guidelines, and I have an open issue for the kedro-airflow
plugin. Can I just create a fix
branch and open a PR?
Hi guys,
I am having trouble to run my kedro from a docker build. I'm using MLflow and the kedro_mlflow.io.artifacts.MlflowArtifactDataset
I followed the instructions for building the container from kedro-docker repo but when running, those artifacts want to access my local windows path instead of the containers path. Do you guys know what additional settings I have to make? All my settings in are pretty much vanilla. The mlflow_tracking_uri
is set to null
"{dataset}.team_lexicon": type: kedro_mlflow.io.artifacts.MlflowArtifactDataset dataset: type: pandas.ParquetDataset filepath: data/03_primary/{dataset}/team_lexicon.pq metadata: kedro-viz: layer: primary preview_args: nrows: 5
Traceback (most recent call last): kedro.io.core.DatasetError: Failed while saving data to dataset MlflowParquetDataset(filepath=/home/kedro_docker/data/03_primary/D1-24-25/team_lexicon.pq, load_args={}, protocol=file, save_args={}). [Errno 13] Permission denied: '/C:'
Hey,
I'm using databricks.yml
in conf in order to generate the yaml to deploy on databricks workflow, with the command run kedro databricks bundle
, I wanted Let's say I have something like :
<my_job>: schedule: # Run at 12:50 quartz_cron_expression: '00 30 10 * * ?' timezone_id: Europe/ParisThis works fine and the file is well generated
<my_job>: schedule: # Run at 12:50 quartz_cron_expression: '00 30 10 * * ?' timezone_id: Europe/Paris tasks: - task_key: default run_if: AT_LEAST_ONE_SUCCESSEvery tasks from
my_job
has the run_if conditions. However, I just wanted that a specific task inherits this run_in condition :<my_job>: schedule: # Run at 12:50 quartz_cron_expression: '00 30 10 * * ?' timezone_id: Europe/Paris tasks: - task_key: <my_task> run_if: AT_LEAST_ONE_SUCCESBut this is not correctly converted to my ressource file from this job
Hello! :kedro:
I am on kedro 0.18.14
using a custom config loader based on TemplatedConfigLoader
. Is there a way to access globals defined in globals.yml
in kedro nodes?
Good morning, we have a question about Kedro dataset factories, we'd be hoping you'd be able to help. I will put the details in the thread to keep this channel tidy 🙂
Can anyone suggest a good way of dynamically changing a catalog entries path? For example, by default I want to use local paths for my intermediate datasets, but when I deploy to production I don't want anything to be saved locally. Duplicating the catalog.yml in the conf/production/ folder is not ideal, as I will have to maintain two sets of each catalog entry.
Hi Team!
Anyone ever played with hyperparameter tuning frameworks within kedro? I have found several scattered pieces of info related to this topic, but no complete solutions. Ultimately, I think what I would like to set up is a way to have multiple nodes running at the same time and all contributing to the same tuning experiment.
I would prefer using optuna and this is the way I would go about it based on what I have found online:
To keep the other thread focused : Is there a way to manage a dataset of about 1 million files in AzureML? The files are about 4k each of binary data, and are entirely independent from each other.
I’m working on a big project that is about to hit it’s next phase. We are using kedro and we have a large single kedro project. To give you an idea on how big, we have about 500+ catalog entries, 500+ nodes in different kedro pipelines (we disabled the default sum of all pipelines as it is too large to use). Now I know the general guideline is to split your project in several smaller ones if it becomes too big, but I need some advice/opinions on this. I’ll explain more details in the comments. Thanks!