Join the Kedro community

Hello guys, I am just starting to learn about Kedro and noticed that micro-packaging will is being deprecated. Could someone please suggest any alternatives to that feature?

5 comments
J
q

Good morning! We're looking for best practices to handle data quality issues within Kedro. Specifically:

1. We need to implement both manual and automated data curation
2. Ideally want to keep as much as possible within the Kedro pipeline structure
3. The current challenge is how to apply and track incoming data corrections requests

Has anyone implemented something similar? Looking for patterns/approaches that worked well.

3 comments
J
E
J

Morning! Just wondering how things work with regards to submitting bug fixes? I've read the contribution guidelines, and I have an open issue for the kedro-airflow plugin. Can I just create a fix branch and open a PR?

3 comments
J
R

Hi guys,

I am having trouble to run my kedro from a docker build. I'm using MLflow and the kedro_mlflow.io.artifacts.MlflowArtifactDataset

I followed the instructions for building the container from kedro-docker repo but when running, those artifacts want to access my local windows path instead of the containers path. Do you guys know what additional settings I have to make? All my settings in are pretty much vanilla. The mlflow_tracking_uri is set to null

"{dataset}.team_lexicon":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset  
  dataset:
    type: pandas.ParquetDataset  
    filepath: data/03_primary/{dataset}/team_lexicon.pq 
    metadata:
      kedro-viz:
        layer: primary  
        preview_args:
            nrows: 5 

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/kedro/io/core.py", line 335, in save
    save_func(self, data)
  File "/usr/local/lib/python3.12/site-packages/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py", line 81, in _save
    mlflow.log_artifact(local_path, self.artifact_path)
  File "/usr/local/lib/python3.12/site-packages/mlflow/tracking/fluent.py", line 1179, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/usr/local/lib/python3.12/site-packages/mlflow/tracking/client.py", line 1969, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/usr/local/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py", line 842, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/usr/local/lib/python3.12/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 43, in log_artifact
    mkdir(artifact_dir)
  File "/usr/local/lib/python3.12/site-packages/mlflow/utils/file_utils.py", line 211, in mkdir
    raise e
  File "/usr/local/lib/python3.12/site-packages/mlflow/utils/file_utils.py", line 208, in mkdir
    os.makedirs(target, exist_ok=True)
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  [Previous line repeated 7 more times]
  File "<frozen os>", line 225, in makedirs
PermissionError: [Errno 13] Permission denied: '/C:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/kedro", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/framework/cli/cli.py", line 263, in main
    cli_collection()
  File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/framework/cli/cli.py", line 163, in main
    super().main(
  File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/framework/cli/project.py", line 228, in run
    return session.run(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/framework/session/session.py", line 399, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/runner/runner.py", line 123, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/runner/sequential_runner.py", line 78, in _run
    super()._run(
  File "/usr/local/lib/python3.12/site-packages/kedro/runner/runner.py", line 245, in _run
    node = future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/runner/task.py", line 102, in __call__
    return self.execute()
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/runner/task.py", line 88, in execute
    node = self._run_node_sequential(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kedro/runner/task.py", line 186, in _run_node_sequential
    catalog.save(name, data)
  File "/usr/local/lib/python3.12/site-packages/kedro/io/data_catalog.py", line 439, in save
    dataset.save(data)
  File "/usr/local/lib/python3.12/site-packages/kedro/io/core.py", line 827, in save
    super()._save_wrapper(save_func)(self, data)
  File "/usr/local/lib/python3.12/site-packages/kedro/io/core.py", line 340, in save
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowParquetDataset(filepath=/home/kedro_docker/data/03_primary/D1-24-25/team_lexicon.pq, load_args={}, protocol=file, save_args={}).
[Errno 13] Permission denied: '/C:'

3 comments
P
d

Hey,
I'm using databricks.yml in conf in order to generate the yaml to deploy on databricks workflow, with the command run kedro databricks bundle , I wanted Let's say I have something like :

<my_job>:
  schedule:
    # Run at 12:50
    quartz_cron_expression: '00 30 10 * * ?'
    timezone_id: Europe/Paris
This works fine and the file is well generated
Same, if i'm doing something like :
<my_job>:
  schedule:
    # Run at 12:50
    quartz_cron_expression: '00 30 10 * * ?'
    timezone_id: Europe/Paris
  tasks:
    - task_key: default
      run_if: AT_LEAST_ONE_SUCCESS
Every tasks from my_job has the run_if conditions. However, I just wanted that a specific task inherits this run_in condition :
<my_job>:
  schedule:
    # Run at 12:50
    quartz_cron_expression: '00 30 10 * * ?'
    timezone_id: Europe/Paris
  tasks:
    - task_key: <my_task>
      run_if: AT_LEAST_ONE_SUCCES
But this is not correctly converted to my ressource file from this job

Do you have any idea on how I can solve this ? Cheers !

3 comments
J
L

Hello! :kedro:

I am on kedro 0.18.14 using a custom config loader based on TemplatedConfigLoader . Is there a way to access globals defined in globals.yml in kedro nodes?

Good morning, we have a question about Kedro dataset factories, we'd be hoping you'd be able to help. I will put the details in the thread to keep this channel tidy 🙂

4 comments
J
J

Can anyone suggest a good way of dynamically changing a catalog entries path? For example, by default I want to use local paths for my intermediate datasets, but when I deploy to production I don't want anything to be saved locally. Duplicating the catalog.yml in the conf/production/ folder is not ideal, as I will have to maintain two sets of each catalog entry.

29 comments
d
E
M
B

Hi Team!

Anyone ever played with hyperparameter tuning frameworks within kedro? I have found several scattered pieces of info related to this topic, but no complete solutions. Ultimately, I think what I would like to set up is a way to have multiple nodes running at the same time and all contributing to the same tuning experiment.

I would prefer using optuna and this is the way I would go about it based on what I have found online:

  1. Create a node that creates an optuna study
  2. Create N nodes that each run hyperparameter tuning in parallel. Each of them loads the optuna study and if using kedro-mlflow each hyperparameter trial can be logged into its own nested run.
  3. Create a final nodes that process the results of all tuning nodes

Does this sound reasonable to you? Has anyone produced such a kedro workflow already? I would love to see what it looks like.

I am also wondering:
  • I am thinking of creating an OptunaStudyDataset for the optuna study . Has anyone attempted this already?
  • For creating N tuning nodes, I am thinking of using the approach presented on the GetInData blog post on dynamic pipelines. Would this be the recommended approach?

Thanks!

8 comments
J
G
H

To keep the other thread focused : Is there a way to manage a dataset of about 1 million files in AzureML? The files are about 4k each of binary data, and are entirely independent from each other.

7 comments
A
N
P

I’m working on a big project that is about to hit it’s next phase. We are using kedro and we have a large single kedro project. To give you an idea on how big, we have about 500+ catalog entries, 500+ nodes in different kedro pipelines (we disabled the default sum of all pipelines as it is too large to use). Now I know the general guideline is to split your project in several smaller ones if it becomes too big, but I need some advice/opinions on this. I’ll explain more details in the comments. Thanks!

8 comments
M
A
R
M
J

Hey there!
When using kedro azureml with the AzureMLDataset type, it seems to be using the fsspec (as described by the documentation). Is there a way to use the "mode" paramter in AzureML's command, and not have to download each file individually (through fsspec), but rather have them in mode rw_mount?

5 comments
A
A
m

Hi gang,
Is there any reasonable way to make asyncio and kedro work together? I have some external IO client that provides only async interface and I can't make it work from sync context, since there is already a running and managed asyncio loop somewhere in kedro. Do coroutines as kedro nodes make any sense?

7 comments
d
D
M
A

Hello!

I am having problems with kedro-mlflow. I am running a pipeline (pipeline-name) which terminates without giving any errors. The problem comes when I access the mlflow ui where two runs are shown, on one side the pipeline-name run and on the other side a run with a random name. In the pipeline-name run the model is logged but no parameters are shown, on the other hand in the run with the arbitrary name if the hyperparameters of the model are registered. Moreover, this run never ends even when the execution of the pipeline is finished.

Does anyone know what could be happening?

Thank you!

23 comments
A
M
Y
E

Hi team, I am looking for a real life kedro repo and its package dependencies. I've heard from different power users that they follow the convention of writing pipelines into packages for testing, versioning, etc?

I think it could be a really good thing to attract more users to see how Kedro is used in "real life"

On a related note, is the convention to have an ML pipelines be composed of other pipelines (like in the docs: preprocessing, ds, etc) or to have each step of the pipeline be a node?

2 comments
A
L

Guys, I'm having trouble while trying to run kedro on Azure Function.

The error that I'm getting is No module named recsys

Does anyone know how to make sure the package wheel is installed when publishing the function to azure?

I'm executing the following command to publish it from local to Azure:
func azure functionapp publish FUNC_APP_NAME


Further Info:

Here is my app folder

.
├── __pycache__
│   └── function_app.cpython-311.pyc
├── dist
│   ├── conf-recsys.tar.gz
│   └── recsys-0.1-py3-none-any.whl
├── function_app.py
├── host.json
├── local.settings.json
├── pyproject.toml
└── requirements.txt

The following is the function_app code:
import logging
import subprocess

import azure.functions as func

app = func.FunctionApp()

@app.route(route="DataPipeline", auth_level=func.AuthLevel.ANONYMOUS)
def DataPipeline(
    req: func.HttpRequest,
) -> func.HttpResponse:
    try:
        subprocess.run(
            [
                "python",
                "-m",
                "recsys",
                "-r",
                "ThreadRunner",
                "--conf-source=dist/conf-recsys.tar.gz",
            ],
            check=True,
            capture_output=True,
        )

        logging.info("Data successfully saved to Blob Storage.")

    except Exception as e:
        logging.error(f"Error processing data: {e}")
        return func.HttpResponse(
            f"{e}\n{e.stderr}",
            status_code=500,
        )

    return func.HttpResponse("DB Extraction Succeded")

And requirements.txt
--find-links dist
azure-functions
pyodbc
sqlalchemy
pandas
recsys 

4 comments
J
H

Hi team!

Is there any way to resolve factory datasets and access them from a DataCatalog/KeroDataCatalog instance?

I notice using the CLI to create a list of datasets kedro catalog list will automatically resolve them (for a given pipeline - see this bit of code) while doing catalog.list() in a kedro jupyter notebook will just list non-factory datasets (and parameters). Are those two returning different outputs by design or is it a bug?

Thanks!

5 comments
G
d
A

Hi all! I have a project which uses a custom hook for logging. It seems the hook is not triggered when using the ParallelRunner. Is that intended behavior?

1 comment
A

hey guys, does anyone had this issue before?

AttributeError: 'CustomDataCatalog' object has no attribute '_data_sets'

11 comments
d
T
E

Maybe this is pushing the current dataset factories too far but is it possible to parametrise a SQL Catalog entry where the SQL is read from a file?

Like:

mytable:
  type: pandas.SQLQueryDataset
  credentials: postgres_dwh
  filepath: sql/mytable.sql

basically, I'd like to pass parameters to the SQL query

7 comments
A
L
N

Hello,

I have this error : ValueError: Pipeline input(s) {'mlflow_run_id'} not found in the DataCatalog
What I am trying to do is to pass "mlflow_run_id" value from one pipeline (named training) to another pipeline (named deployement)
As attached, my kedro viz
And this gist contains my two pipelines + nodes (training) (deployment) code source : https://gist.github.com/Noobzik/cdf7a4754067e587010d4819fae671f4
Can you help me point where I did wrong ?

2 comments
R
J

Hello everyone 😄 !

I'm currently using kedro-airflow to generate my Airflow DAGs from my Kedro project. I followed the recommendation in the documentation and used a custom template to adapt the DAG for execution on Cloud Composer.
According to the documentation, it is possible to create TaskGroups if needed: Kedro-Airflow Documentation.

I’d like to group multiple nodes into TaskGroups, but I can't find any parameters that are automatically passed to the Jinja2 template to enable this grouping.

Has anyone done this before? Or does anyone know exactly what the documentation is referring to?

Thanks in advance!

8 comments
D
M
M

Hello! I’m using kedro==0.19.9 in a project, but would like to switch from conda to uv . Is there a recommended way to update an existing project? Thanks!

4 comments
J
L

Hi all, I'm having trouble getting kedro viz to run, even for an example repo. Here's the steps I've taken:

  1. run uvx kedro new (uses kedro 0.19.11 )
  2. Generate a project with options 1-5,7 (everything but pyspark)
  3. Run uv venv and uv sync
  4. Run kedro viz, I get the following output: Error: No such command 'viz'.

This is on a Mac with python 3.11. Am I missing something very obvious?

6 comments
R
J

Hello everyone, how are you? I have an application that is using the Kedro Boot library to transform part of the Kedro pipelines into an API. The thing is, recently the application started to break without any dependency changes (suspicion of an internal dependency). Has anyone encountered this before? Could you provide some support?
Here is the reference to the GitHub issue: Can't instantiate abstract class KedroBootAdapter with abstract method _get_executor · Issue #40 · takikadiri/kedro-boot

3 comments
J
T
E