Hello guys, I am just starting to learn about Kedro and noticed that micro-packaging will is being deprecated. Could someone please suggest any alternatives to that feature?
Good morning! We're looking for best practices to handle data quality issues within Kedro. Specifically:
1. We need to implement both manual and automated data curation
2. Ideally want to keep as much as possible within the Kedro pipeline structure
3. The current challenge is how to apply and track incoming data corrections requests
Has anyone implemented something similar? Looking for patterns/approaches that worked well.
Morning! Just wondering how things work with regards to submitting bug fixes? I've read the contribution guidelines, and I have an open issue for the kedro-airflow
plugin. Can I just create a fix
branch and open a PR?
Hi guys,
I am having trouble to run my kedro from a docker build. I'm using MLflow and the kedro_mlflow.io
.artifacts.MlflowArtifactDataset
I followed the instructions for building the container from kedro-docker repo but when running, those artifacts want to access my local windows path instead of the containers path. Do you guys know what additional settings I have to make? All my settings in are pretty much vanilla. The mlflow_tracking_uri
is set to null
"{dataset}.team_lexicon": type: kedro_mlflow.io.artifacts.MlflowArtifactDataset dataset: type: pandas.ParquetDataset filepath: data/03_primary/{dataset}/team_lexicon.pq metadata: kedro-viz: layer: primary preview_args: nrows: 5
Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/kedro/io/core.py", line 335, in save save_func(self, data) File "/usr/local/lib/python3.12/site-packages/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py", line 81, in _save mlflow.log_artifact(local_path, self.artifact_path) File "/usr/local/lib/python3.12/site-packages/mlflow/tracking/fluent.py", line 1179, in log_artifact MlflowClient().log_artifact(run_id, local_path, artifact_path) File "/usr/local/lib/python3.12/site-packages/mlflow/tracking/client.py", line 1969, in log_artifact self._tracking_client.log_artifact(run_id, local_path, artifact_path) File "/usr/local/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py", line 842, in log_artifact artifact_repo.log_artifact(local_path, artifact_path) File "/usr/local/lib/python3.12/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 43, in log_artifact mkdir(artifact_dir) File "/usr/local/lib/python3.12/site-packages/mlflow/utils/file_utils.py", line 211, in mkdir raise e File "/usr/local/lib/python3.12/site-packages/mlflow/utils/file_utils.py", line 208, in mkdir os.makedirs(target, exist_ok=True) File "<frozen os>", line 215, in makedirs File "<frozen os>", line 215, in makedirs File "<frozen os>", line 215, in makedirs [Previous line repeated 7 more times] File "<frozen os>", line 225, in makedirs PermissionError: [Errno 13] Permission denied: '/C:' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/kedro", line 10, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/framework/cli/cli.py", line 263, in main cli_collection() File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/framework/cli/cli.py", line 163, in main super().main( File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/framework/cli/project.py", line 228, in run return session.run( ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/framework/session/session.py", line 399, in run run_result = runner.run( ^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/runner/runner.py", line 123, in run self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/runner/sequential_runner.py", line 78, in _run super()._run( File "/usr/local/lib/python3.12/site-packages/kedro/runner/runner.py", line 245, in _run node = future.result() ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/runner/task.py", line 102, in __call__ return self.execute() ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/runner/task.py", line 88, in execute node = self._run_node_sequential( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kedro/runner/task.py", line 186, in _run_node_sequential catalog.save(name, data) File "/usr/local/lib/python3.12/site-packages/kedro/io/data_catalog.py", line 439, in save dataset.save(data) File "/usr/local/lib/python3.12/site-packages/kedro/io/core.py", line 827, in save super()._save_wrapper(save_func)(self, data) File "/usr/local/lib/python3.12/site-packages/kedro/io/core.py", line 340, in save raise DatasetError(message) from exc kedro.io.core.DatasetError: Failed while saving data to dataset MlflowParquetDataset(filepath=/home/kedro_docker/data/03_primary/D1-24-25/team_lexicon.pq, load_args={}, protocol=file, save_args={}). [Errno 13] Permission denied: '/C:'
Hey,
I'm using databricks.yml
in conf in order to generate the yaml to deploy on databricks workflow, with the command run kedro databricks bundle
, I wanted Let's say I have something like :
<my_job>: schedule: # Run at 12:50 quartz_cron_expression: '00 30 10 * * ?' timezone_id: Europe/ParisThis works fine and the file is well generated
<my_job>: schedule: # Run at 12:50 quartz_cron_expression: '00 30 10 * * ?' timezone_id: Europe/Paris tasks: - task_key: default run_if: AT_LEAST_ONE_SUCCESSEvery tasks from
my_job
has the run_if conditions. However, I just wanted that a specific task inherits this run_in condition :<my_job>: schedule: # Run at 12:50 quartz_cron_expression: '00 30 10 * * ?' timezone_id: Europe/Paris tasks: - task_key: <my_task> run_if: AT_LEAST_ONE_SUCCESBut this is not correctly converted to my ressource file from this job
Hello! :kedro:
I am on kedro 0.18.14
using a custom config loader based on TemplatedConfigLoader
. Is there a way to access globals defined in globals.yml
in kedro nodes?
Good morning, we have a question about Kedro dataset factories, we'd be hoping you'd be able to help. I will put the details in the thread to keep this channel tidy 🙂
Can anyone suggest a good way of dynamically changing a catalog entries path? For example, by default I want to use local paths for my intermediate datasets, but when I deploy to production I don't want anything to be saved locally. Duplicating the catalog.yml in the conf/production/ folder is not ideal, as I will have to maintain two sets of each catalog entry.
Hi Team!
Anyone ever played with hyperparameter tuning frameworks within kedro? I have found several scattered pieces of info related to this topic, but no complete solutions. Ultimately, I think what I would like to set up is a way to have multiple nodes running at the same time and all contributing to the same tuning experiment.
I would prefer using optuna and this is the way I would go about it based on what I have found online:
To keep the other thread focused : Is there a way to manage a dataset of about 1 million files in AzureML? The files are about 4k each of binary data, and are entirely independent from each other.
I’m working on a big project that is about to hit it’s next phase. We are using kedro and we have a large single kedro project. To give you an idea on how big, we have about 500+ catalog entries, 500+ nodes in different kedro pipelines (we disabled the default sum of all pipelines as it is too large to use). Now I know the general guideline is to split your project in several smaller ones if it becomes too big, but I need some advice/opinions on this. I’ll explain more details in the comments. Thanks!
Hey there!
When using kedro azureml with the AzureMLDataset type, it seems to be using the fsspec (as described by the documentation). Is there a way to use the "mode" paramter in AzureML's command, and not have to download each file individually (through fsspec), but rather have them in mode rw_mount?
Hi gang,
Is there any reasonable way to make asyncio and kedro work together? I have some external IO client that provides only async interface and I can't make it work from sync context, since there is already a running and managed asyncio loop somewhere in kedro. Do coroutines as kedro nodes make any sense?
Hello!
I am having problems with kedro-mlflow. I am running a pipeline (pipeline-name) which terminates without giving any errors. The problem comes when I access the mlflow ui where two runs are shown, on one side the pipeline-name run and on the other side a run with a random name. In the pipeline-name run the model is logged but no parameters are shown, on the other hand in the run with the arbitrary name if the hyperparameters of the model are registered. Moreover, this run never ends even when the execution of the pipeline is finished.
Does anyone know what could be happening?
Thank you!
Hi team, I am looking for a real life kedro repo and its package dependencies. I've heard from different power users that they follow the convention of writing pipelines into packages for testing, versioning, etc?
I think it could be a really good thing to attract more users to see how Kedro is used in "real life"
On a related note, is the convention to have an ML pipelines be composed of other pipelines (like in the docs: preprocessing
, ds
, etc) or to have each step of the pipeline be a node?
Guys, I'm having trouble while trying to run kedro on Azure Function.
The error that I'm getting is No module named recsys
Does anyone know how to make sure the package wheel is installed when publishing the function to azure?
I'm executing the following command to publish it from local to Azure:func azure functionapp publish FUNC_APP_NAME
Further Info:
Here is my app folder
. ├── __pycache__ │ └── function_app.cpython-311.pyc ├── dist │ ├── conf-recsys.tar.gz │ └── recsys-0.1-py3-none-any.whl ├── function_app.py ├── host.json ├── local.settings.json ├── pyproject.toml └── requirements.txt
import logging import subprocess import azure.functions as func app = func.FunctionApp() @app.route(route="DataPipeline", auth_level=func.AuthLevel.ANONYMOUS) def DataPipeline( req: func.HttpRequest, ) -> func.HttpResponse: try: subprocess.run( [ "python", "-m", "recsys", "-r", "ThreadRunner", "--conf-source=dist/conf-recsys.tar.gz", ], check=True, capture_output=True, ) logging.info("Data successfully saved to Blob Storage.") except Exception as e: logging.error(f"Error processing data: {e}") return func.HttpResponse( f"{e}\n{e.stderr}", status_code=500, ) return func.HttpResponse("DB Extraction Succeded")
--find-links dist azure-functions pyodbc sqlalchemy pandas recsys
Hi team!
Is there any way to resolve factory datasets and access them from a DataCatalog/KeroDataCatalog instance?
I notice using the CLI to create a list of datasets kedro catalog list
will automatically resolve them (for a given pipeline - see this bit of code) while doing catalog.list()
in a kedro jupyter notebook will just list non-factory datasets (and parameters). Are those two returning different outputs by design or is it a bug?
Thanks!
Hi all! I have a project which uses a custom hook for logging. It seems the hook is not triggered when using the ParallelRunner. Is that intended behavior?
hey guys, does anyone had this issue before?
AttributeError: 'CustomDataCatalog' object has no attribute '_data_sets'
Maybe this is pushing the current dataset factories too far but is it possible to parametrise a SQL Catalog entry where the SQL is read from a file?
Like:
mytable: type: pandas.SQLQueryDataset credentials: postgres_dwh filepath: sql/mytable.sql
Hello,
I have this error : ValueError: Pipeline input(s) {'mlflow_run_id'} not found in the DataCatalog
What I am trying to do is to pass "mlflow_run_id" value from one pipeline (named training) to another pipeline (named deployement)
As attached, my kedro viz
And this gist contains my two pipelines + nodes (training) (deployment) code source : https://gist.github.com/Noobzik/cdf7a4754067e587010d4819fae671f4
Can you help me point where I did wrong ?
Hello everyone 😄 !
I'm currently using kedro-airflow to generate my Airflow DAGs from my Kedro project. I followed the recommendation in the documentation and used a custom template to adapt the DAG for execution on Cloud Composer.
According to the documentation, it is possible to create TaskGroups if needed: Kedro-Airflow Documentation.
I’d like to group multiple nodes into TaskGroups, but I can't find any parameters that are automatically passed to the Jinja2 template to enable this grouping.
Has anyone done this before? Or does anyone know exactly what the documentation is referring to?
Thanks in advance!
Hello! I’m using kedro==0.19.9
in a project, but would like to switch from conda to uv
. Is there a recommended way to update an existing project? Thanks!
Hi all, I'm having trouble getting kedro viz to run, even for an example repo. Here's the steps I've taken:
uvx kedro new
(uses kedro 0.19.11
)1-5,7
(everything but pyspark)uv venv
and uv sync
kedro viz
, I get the following output: Error: No such command 'viz'.
Hello everyone, how are you? I have an application that is using the Kedro Boot
library to transform part of the Kedro pipelines into an API. The thing is, recently the application started to break without any dependency changes (suspicion of an internal dependency). Has anyone encountered this before? Could you provide some support?
Here is the reference to the GitHub issue: Can't instantiate abstract class KedroBootAdapter with abstract method _get_executor · Issue #40 · takikadiri/kedro-boot