Support

Join on Slack

AAlexis Drakopoulos

View on Slack

Mocking-underlying-nodes-in-kedro-pipeline-tests

In Kedro pipeline tests, what's the best way to mock the underlying nodes? we use pytest

1 comment

GGuillaume Tauzin

View on Slack

Choosing a Simple and Free Orchestrator for Kedro Pipelines on AWS

Hello Team!
So it's been a few months since we started using kedro and it's time to deploy some of the pipelines we have created.
We need to choose an orchestrator but this is not our field of expertise, so I wanted to ask for some help. We would like something simple to setup and use collaboratively. Also my company requires it is free (at least for now), our cloud provider is AWS and we already use mlflow. Here are the alternatives we found:

Prefect (open-source, seems nice to use, kedro support, but free tier imposes limitations)
Flyte (free?, open-source, seems nice to use, no kedro support)
MLRun (free and open-source, no kedro support? seems nice to use but a bit more than an orchestrator, requires python 3.9)
Kubeflow Pipelines (free and open-source, kedro plugin, and others seem to think it is complex to setup and maintain)
Airflow (free and open-source, kedro plugin)
Sagemaker (Amazon, kedro plugin, personally dislike its UI and how other AWS services are organized around it)

What would you recommend? What should we consider to make such a decision?

Thanks for your help :)

2 comments

SSen

View on Slack

How Kedro Pipeline Reads Input Datasets

Hi, all. I have a question regarding how nodes/pipelines read dataset as input datasets. Take this catalog configuration in the following link as example, I assume the kedro pipeline read data from CSV file stored in Amazon S3 when you specify as inputs=["cars"] in node configuration. I was wondering if there are multiple different nodes that take "cars" as input datasets, does kedro pipeline use those datasets from memory, or does it read from Amazon S3 every time they need the datasets?

https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-multiple-datasets-with-similar-configuration-using-yaml-anchors

And if it does read the same datasets from certain data source every time it runs the various nodes, is it possible to store the dataset in memory after the first reading from whatever the data source is (Amazon S3 CSV file in this case) and reuse them from memory so that you don't need to read from the data source multiple times and possibly leading to shorter processing time?

AAbhishek Bhatia

View on Slack

Kedro pyspark job submission issues with dataproc serverless

Kedro + GetInData Folks! :kedro:

I am following this repo to submit a kedro pyspark job to dataproc serverless: https://github.com/getindata/kedro-pyspark-dataproc-demo

On submitting the job

gcloud dataproc batches submit pyspark file:///home/kedro/src/entrypoint.py \
    --project my-project \
    --region=europe-central2 \
    --container-image=europe-central2-docker.pkg.dev/my-project/kedro-dataproc-demo/kedro-dataproc-iris:latest \
    --service-account dataproc-worker@my-project.iam.gserviceaccount.com \
    --properties spark.app.name="kedro-pyspark-iris",spark.dynamicAllocation.minExecutors=2,spark.dynamicAllocation.maxExecutors=2 \
    -- \
    run

Entry point script contains the following:

import os
from kedro.framework import cli

os.chdir("/home/kedro")
cli.main()

I am getting the following error:

[10/15/24 17:30:21] INFO     Loading data from               data_catalog.py:343
                             'example_iris_data'                                
                             (SparkDataSet)...                                  
[10/15/24 17:30:22] WARNING  There are 3 nodes that have not run.  runner.py:178
                             You can resume the pipeline run by                 
                             adding the following argument to your              
                             previous command:                                  
                                                                                
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/local/lib/python3.9/site-packages/kedro/io/core.py:186 in load          │
│                                                                              │
│   183 │   │   self._logger.debug("Loading %s", str(self))                    │
│   184 │   │                                                                  │
│   185 │   │   try:                                                           │
│ ❱ 186 │   │   │   return self._load()                                        │
│   187 │   │   except DataSetError:                                           │
│   188 │   │   │   raise                                                      │
│   189 │   │   except Exception as exc:                                       │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ message = 'Failed while loading data from data set                       │ │
│ │           SparkDataSet(file_format=csv, filepath=g'+2319                 │ │
│ │    self = <kedro.extras.datasets.spark.spark_dataset.SparkDataSet object │ │
│ │           at 0x7f4163077730>                                             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/local/lib/python3.9/site-packages/kedro/extras/datasets/spark/spark_dat │
│ aset.py:380 in _load                                                         │
│                                                                              │
│   377 │                                                                      │
│   378 │   def _load(self) -> DataFrame:                                      │
│   379 │   │   load_path = _strip_dbfs_prefix(self._fs_prefix + str(self._get │
│ ❱ 380 │   │   read_obj = self._get_spark().read                              │
│   381 │   │                                                                  │
│   382 │   │   # Pass schema if defined                                       │
│   383 │   │   if self._schema:                                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ load_path = '<a target="_blank" rel="noopener noreferrer" href="gs://aa-dev-crm-users/abhishek/misc/iris.csv">gs://aa-dev-crm-users/abhishek/misc/iris.csv</a>'               │ │
│ │      self = <kedro.extras.datasets.spark.spark_dataset.SparkDataSet      │ │
│ │             object at 0x7f4163077730>                                    │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/lib/spark/python/pyspark/sql/session.py:1706 in read                    │
│                                                                              │
│   1703 │   │   |100|Hyukjin Kwon|                                            │
│   1704 │   │   +---+------------+                                            │
│   1705 │   │   """                                                           │
│ ❱ 1706 │   │   return DataFrameReader(self)                                  │
│   1707 │                                                                     │
│   1708 │   @property                                                         │
│   1709 │   def readStream(self) -> DataStreamReader:                         │
│                                                                              │
│ ╭────────────────────────────── locals ──────────────────────────────╮       │
│ │ self = <pyspark.sql.session.SparkSession object at 0x7f4174ebcf40> │       │
│ ╰────────────────────────────────────────────────────────────────────╯       │
│                                                                              │
│ /usr/lib/spark/python/pyspark/sql/readwriter.py:70 in __init__               │
│                                                                              │
│     67 │   """                                                               │
│     68 │                                                                     │
│     69 │   def __init__(self, spark: "SparkSession"):                        │
│ ❱   70 │   │   self._jreader = spark._jsparkSession.read()                   │
│     71 │   │   self._spark = spark                                           │
│     72 │                                                                     │
│     73 │   def _df(self, jdf: JavaObject) -> "DataFrame":                    │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │  self = <pyspark.sql.readwriter.DataFrameReader object at                │ │
│ │         0x7f41631fa700>                                                  │ │
│ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4174ebcf40>      │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322 in │
│ __call__                                                                     │
│                                                                              │
│ [Errno 20] Not a directory:                                                  │
│ '/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py'       │
│                                                                              │
│ /usr/lib/spark/python/pyspark/errors/exceptions/captured.py:185 in deco      │
│                                                                              │
│   182 │   │   │   if not isinstance(converted, UnknownException):            │
│   183 │   │   │   │   # Hide where the exception came from that shows a non- │
│   184 │   │   │   │   # JVM exception message.                               │
│ ❱ 185 │   │   │   │   raise converted from None                              │
│   186 │   │   │   else:                                                      │
│   187 │   │   │   │   raise                                                  │
│   188                                                                        │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │         a = (                                                            │ │
│ │             │   'xro91',                                                 │ │
│ │             │   <py4j.clientserver.JavaClient object at 0x7f417cb199d0>, │ │
│ │             │   'o88',                                                   │ │
│ │             │   'read'                                                   │ │
│ │             )                                                            │ │
│ │ converted = IllegalArgumentException()                                   │ │
│ │         f = <function get_return_value at 0x7f417b8c0310>                │ │
│ │        kw = {}                                                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
IllegalArgumentException: The value of property spark.app.name must not be null

Almost 100% sure that this error is not due to my any mis-spec in my Dockerfile or requirements, because it works perfectly if I change the entrpoint script to the following:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

df = spark.read.csv("<a target="_blank" rel="noopener noreferrer" href="gs://aa-dev-crm-users/abhishek/misc/iris.csv">gs://aa-dev-crm-users/abhishek/misc/iris.csv</a>", inferSchema=True, header=True)
print(df.show())

2 comments

GGalen Seilis

View on Slack

Default memory dataset copy method prioritizes accuracy over efficiency

I have a question about the memory dataset's default copy method. I noticed that if the data is a pandas dataframe or a numpy array that copy rather than assignment (i.e. making a reference) is used by default. I'm wondering what the rationale for that is. Often making a reference is cheaper in terms of runtime than making either a shallow or deep copy. Why is assignment not the top priority default?

https://docs.kedro.org/en/stable/_modules/kedro/io/memory_dataset.html#MemoryDataset

8 comments

AAlexandre Ouellet

View on Slack

Kedro-azureml: Issues with using AzureMLAssetDataset with dataset factories and dataset patterns

Hey there! Quick question about kedro-azureml. We are using AzureML, and we'd like to use AzureMLAssetDataset with dataset factories.
After a lot of headach and debugging, it seems impossible to use both, as the way credentials are passed to the AzureMLAssetDataset is done through a hook (after_catalog_created), but the issue is that if you use a dataset_patterns (as in, declare your dataset as "{name}.csv" or something similar), the hook is called, but the patterned dataset is not instanciated yet.
After all that, a before_node_run is called, and then there is a AzureMLAssetDataset._load() called, but the AzureMLAssetDataset.azure_config setter hasn't been called yet (as it is called only in the after_catalog_created hook). At first glance, it seems like a kedro-azureml issue, as AzureMLAssetDataset._load() can be called without the setter being called when used as a dataset factory. But also, it might be a kedro issue, as I think there should be an obvious way to setup credentials in that specific scenario, and I don't quite see it from the docs on hook either

7 comments

VVishal Pandey

View on Slack

Error pushing data to S3: Bad Request and DatasetError

Hey Everyone

I am getting below errors while the pipeline is trying to push some data to s3. Any headsup ?

ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request
The above exception was the direct cause of the following exception:

DatasetError: Failed while saving data to data set CSVDataset(filepath=ml-datawarehouse/warehouse/extraction/doc_table_insert.csv, load_args={},
protocol=s3, save_args={'index': False}, version=Version(load=None, save='2024-10-15T15.35.46.341Z')).
[Errno 22] Bad Request

3 comments

TThomas d'Hooghe

View on Slack

Kedro run executes nodes out of order despite expected DAG behavior

Hi all,

When running uv run kedro run, the node in blue gets run before running the nodes upstream, while these are input for the blue node (it basicually unions two datasets back together). I would not expect this behavior, as I thought the entire pipeline should be executed as a DAG? Am I wrong in this assumption here? I have the following pipelines: ingestion, data_prep, feature, model_input, modeling and reporting .

7 comments

VVishal Pandey

View on Slack

Optimizing ibis code for filters and field conditions

Hi everyone

I have been exploring ibis for sometime. I just wanted to understand is there a better way to write the below code in a more optimised fashion

import ibis

con = ibis.connect(POSTGRES_CONNECTION_STRING)
training_meta_table:ir.Table = con.table("training_metadata")

filters = {
    "customer_ids" : [59] ,
    "queue_names" : ["General Lit - Misclassifications", "MoveDocs-MR"],
    "start_date" : "2024-09-5 00:00:00",
    "end_date" : "2024-09-11 00:00:00",
    "doc_types" : [],
    "fields" : ["patientFirstName", "patientLastName", "Service Date", "Doctor"]
}
field_conditions = training_meta_table.fields_present.contains(filters["fields"][0]) | training_meta_table.fields_present.contains(filters["fields"][1]) | training_meta_table.fields_present.contains(filters["fields"][2]) | training_meta_table.fields_present.contains(filters["fields"][3])

So there are many or conditions we would like to dynamically join together to create 1 final condition based on the input filters

5 comments

TThomas d'Hooghe

View on Slack

Best practice for rerunning clustering pipeline with different timestamps

Hi all!

I am working with a clustering pipeline that I regularly want to rerun to monitor cluster migrations. I am using SnowflakeTableDatasets to save data directly to the data warehouse. Now, since it is not possible to have the same input and output dataset in Kedro, I was wondering what would be best practice to rerun clustering and store to the same SnowparkTableDataset when storing on a different timestamp for example. Would appreciate your help here!

5 comments

AAkshata Patel

View on Slack

Error saving dataframe to snowpark table dataset

Hello Team,
I want to save a df back to a Snowpark Table dataset object, but im running into this error

DatasetError: Failed while saving data to data set SnowparkTableDataset(...).
'DataFrame' object has no attribute 'write'

Code snippet in thread, please let me know if there is a way to do this 😄 Thanks so much!

6 comments

AAbhishek Bhatia

View on Slack

Kedro pipeline deployment using VertexAI SDK with API endpoint triggering

Hey Kedroids! :kedro:

(Apologies in advance for the long message but would really really appreciate a good discussion on below from the kedro community! 🙂 )

I have a usecase of deploying kedro pipelines using VertexAI SDK.

In the production system (web app), I want to be able to trigger a kedro pipeline (or multiple pipelines) with specified parameters (say from the UI).
Let's say we have a API endpoint https://my.web.app/api/v1/some-task

Body includes parameters to trigger 1 or multiple kedro pipelines as a Vertex AI DAG

My VertexAI DAG has a combination of nodes (steps), and each node:

May or may not be a kedro pipeline
May be a pyspark workload running on dataproc or non spark workload running on a single compute VM
May run a bigquery job
May or may not run in a docker container

Let's take the example of submitting a kedro pipeline on Dataproc serverless running on a custom docker container using VertexAI SDK.

Questions:

Do you package the kedro code as part of the Docker container or just the dependencies?

For example, i have seen this done alot which packages the kedro code as well:

RUN mkdir /usr/kedro
WORKDIR /usr/kedro/
COPY . .

which means copying the whole project, and then in the src/entrypoint.py ,

from kedro.framework import cli
import os

os.chdir("/usr/kedro")
cli.main()

2. Do I need to package my kedro project as a wheel file and submit it with the job to Dataproc? If so, how have you seen that done with DataprocPySparkBatchOp?

3. How do you recommend to pass dynamic parameters to the kedro pipeline run?

As I understand cli.main() picks up sys.argv to infer pipeline name and parameters so one could that

kedro run --pipeline <my_pipeline> --params=param_key1=value1,param_key2=2.0

Is there a better recommended way of doing this?

Thanks alot and hoping for a good discussion! 🙂

3 comments

MMinura Punchihewa

View on Slack

Kedro Package Command Error: Importing TypeAliasType

Hey guys,
I've been experimenting with packaging a Kedro project using the kedro package command and I am running into an issue.

First off, I am attempting to running it like this:

from <my-package>.__main__ import main
main(
    ["--tags", "<my-tags>", "--env", "base"]
)

Is this correct?

When I do try to run it like this, the following error is raised:
ImportError: cannot import name 'TypeAliasType' from 'typing_extensions' (/databricks/python/lib/python3.10/site-packages/typing_extensions.py)
File <command-3656540420037005>, line 2 1 from <my-package>.__main__ import main ----> 2 main( 3 ["--tags", "int_tms_hotel_reservations", "--env", "base"] 4 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-30b382f4-147d-466f-a67b-6ce8dcc92265/lib/python3.10/site-packages/sqlalchemy/util/typing.py:56 54 from typing_extensions import TypeGuard as TypeGuard # 3.10 55 from typing_extensions import Self as Self # 3.11 ---> 56 from typing_extensions import TypeAliasType as TypeAliasType # 3.12 58 _T = TypeVar("_T", bound=Any) 59 _KT = TypeVar("_KT")

How can I overcome this? I tried upgrading the version of the typing-extensions package without any luck. The current version of this package installed on my cluster is 4.12.2.

I am running this project on Databricks and I think it is best to avoid running the package using python -m ... That is why I am looking for a Python option. I am using Kedro 0.19.4.

2 comments

HHugo Acosta

View on Slack

Using placeholders for data catalog in pipeline

Hello everyone,

I am encountering some issues regarding the use of placeholders for the data catalog and I was hoping you can shed some light on this .
I have the following pipeline:

load_date = settings.LOAD_DATE_COMPARISON.get("current")
previous_load_date = settings.LOAD_DATE_COMPARISON.get("previous")

def create_pipeline(**kwargs) -> Pipeline:


    format_data_quality = pipeline(
                [   node(
                        func= compare_id,
                        inputs=[f"maestro_indicadores_{load_date}",
                                f"maestro_indicadores_{previous_load_date}"],
                        outputs=f"compare_id_{load_date}_{previous_load_date}",
                        name="compare_id_node",
                        tags = "compare_id"
    ),]
    )
    return format_data_quality

With the corresponding catalog entry for the output:

compare_id_{load_date}_{previous_load_date}:
  type: json.JSONDataset
  filepath: reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json

The issue here is that whenever the value of load date is something like 2024_07_01, it will generate a path like:
reports/2024/id_comparison/id_comparison_ 2024_07_01_2024_05_01.json

Note that the first placeholder is not being substituted with the intended value, while the others are.
This will only happen when the value of load_date contains underscores, not happening with dots or hyphens.
Why does this happen?

12 comments

AArmin Okić

View on Slack

Combining Temporal.io with Kedro for Data Pipelines

Hi everyone!
Does it make sense to combine temporal.io with kedro? Does anyone has any experience?
Thanks!

3 comments

VVishal Pandey

View on Slack

Managing requirements.txt files to reproduce environments

Hey Everyone Interested to know how do you guys manage your requirements.txt file to reproduce the same environment. What tools do you prefer to keep the requirements.txt file updated

60 comments

SSen

View on Slack

Getting total execution time for a databricks workflow

Hi everyone. By using hooks I’ve succeeded to show execution time of each nodes. However, I also want to know how long the whole process takes, which is from loading data, executing nodes, and eventually to saving data to Databricks catalog.

So in the attached image, I want to know the time difference between “INFO Completed 1 out of tasks” and “INFO Loading data from ‘params: …”, not just node execution time. I surely can know the time difference simply by manually calculating, but because there are hundreds of nodes, it takes at least an hour to calculate all of them, and it would be really helpful to be able to know how long each tasks take by first glance. Is there any way to do this? Is it also possible by utilizing hooks?

https://kedro-org.slack.com/archives/C03RKP2LW64/p1728353683266369

2 comments

NNicolas Betancourt Cardona

View on Slack

Running node test_node with custom function and inputs

Hi everyone!
I'm trying to run the following node in Kedro:
def test(a):
print(a)
return 2+2
node(
func=test,
inputs=[ 'params:parameter'],
outputs="not_in_catalog",
name="test_node",
),
test() is in nodes.py and the node in pipeline.py. When I run kedro run --nodes test_node I get the following log:

(pamflow_kedro_env) s0nabio@hub:~/kedroPamflow$ kedro run --nodes test_node
[10/10/24 14:49:06] INFO     Using '/home/s0nabio/miniconda3/envs/pamflow_kedro_env/lib/python3.10/site-packages/kedro/framework/project/rich_logging.yml' as logging configuration.                                                                                                          __init__.py:249
[10/10/24 14:49:07] INFO     Kedro project kedroPamflow                                                                                                                                                                                                                                        session.py:327
Illegal instruction (core dumped)

I already ran Kedro in the active environment (Python 3.10.14) in a Windows machine and it Worked. Now I'm trying to run it in a Linux VM and is when I get the error. The only libraries I have installed are

birdnetlib==0.17.2
contextily==1.6.2
fsspec==2024.9.0
geopandas==1.0.1
kedro==0.19.8
kedro_datasets==4.1.0
librosa==0.10.2
matplotlib==3.6.2
numpy==1.23.5
pandas==2.2.3
pytest==8.3.3
PyYAML==6.0.2
scikit-maad==1.4.1
seaborn==0.13.2
statsmodels==0.14.4
tensorflow==2.17.0

If I run test() using python directley on the terminal instead of through Kedro I don't get the error. That's why I'm here beacause without any warnings and just when I try to run the simplest kedro node, I get the error.

2 comments

FFlavien

View on Slack

Kedro with databricks and managed table dataset

Ho , I copy your question here. My team and I are using kedro with databricks without problem. Our sources are databricks native tables which can be dealt with the specific ManagedTableDataset, see here. You can unit-test your nodes with a local spark cluster without issue too.

2 comments

VVishal Pandey

View on Slack

Accessing command line arguments in kedro pipeline source code

Hello everyone

Just wanted to know is there a way to access values of command line arguments like --env in our kedro pipeline source code.

1 comment

RRafael TEsta

View on Slack

Kедro node connection without dummy data

Hi everyone! I have a couple of questions about Kedro:

I'm using an external Java tool to convert XML to linked data in one of my nodes, and the tool produces an output, but it's created outside of the Python function. Right now, I'm using a dummy dataset as an output and then using that as an input for the next node to make Kedro Viz visualize the connection properly. However, this feels a bit clumsy. Is there a more elegant way to sequentially connect nodes in Kedro without requiring a dataset in between?
I would like to use Kedro for a project that performs the ETL for multiple institutes. I'm planning to use namespaces since the ETL process is similar for most institutes. After running the individual pipelines, there is part of the ETL that can either be run with the output from a single institute or sometimes needs to be run with the outputs from all institutes together. Currently, with a pure Python approach, we output each institute's data into a shared directory and then run the shared part using the content of that directory. However, Kedro doesn't allow multiple nodes to output to the same dataset (folder in this case). How could I connect the shared pipeline with each institute's pipeline in this case?

Thanks in advance for your help!

5 comments

AAbhishek Bhatia

View on Slack

Kedro pipeline scheduling bigquery queries in order

Hi kedroids :kedro:

We have a usecase in which we are scheduling bigquery queries to run in a specific order using a kedro pipeline.

We use the bigquery client simply to trigger the SQL query on bigquery as follows:

def trigger_query_on_bigquery(
    query: str,
):
    client = bigquery.Client()
    query_job = client.query_and_wait(query)

    return True

The kedro dag to schedule multiple queries in order looks as follows:

def create_retail_data_primary_pipeline() -> Pipeline:
    nodes = [
        node(
            func=trigger_prm_customer_on_big_query,
            outputs="prm_customer@status",
        ),
        node(
            func=trigger_prm_transaction_detail_ecom_on_big_query,
            inputs=["prm_product_hierarchy@status"],
            outputs="prm_transaction_detail_ecom@status",
        ),
        node(
            func=trigger_prm_transaction_detail_retail_on_big_query,
            inputs=["prm_product_hierarchy@status"],
            outputs="prm_transaction_detail_retail@status",
        ),
        node(
            func=trigger_prm_transaction_detail_on_big_query,
            inputs=[
                "prm_transaction_detail_ecom@status",
                "prm_transaction_detail_retail@status",
                "prm_product_hierarchy@status",
                "prm_customer@status",
            ],
            outputs="prm_transaction_detail@status",
        ),
        node(
            func=trigger_prm_incident_on_big_query,
            outputs="prm_incident@status",
        ),
        node(
            func=trigger_prm_product_hierarchy_on_big_query,
            outputs="prm_product_hierarchy@status",
        ),

    ]

since the node can't output the dataframe itself, we output a transcoded entry with @status (which is just True), and then use the actual bigquery spark.SparkDataset transcoded entry versions of these datasets in downstream pipeline to enforce the order.

So I will use prm_product_hierarchy@bigquery dataset in a downstream node, just so that kedro runs the query to trigger bigquery query first.

Is there a better way to do this?

14 comments

JJannik Wiedenhaupt

View on Slack

Csv column dtypes not being set correctly

Hey everyone, I am trying to define the column dtypes of a CSV dataset because some columns contain IDs that Kedro interprets as floats, but should be interpreted as strings instead. Setting

load_args:
  dtype:
    user_id: str

save_args:
  dtype:
    user_id: str

does not seem to work for me. Appreciate your help!

9 comments

VVishal Pandey

View on Slack

Kedro orchestration service for production environments

Hey Everyone

Interested to know from you people which orchestration service you guys prefer to run kedro in production environments and how has been the experience so far

Recently I have been trying to run kedro on kubeflow and have been facing multiple issues.

11 comments

IIonut Barbu

View on Slack

Error with Global Variable Interpolation in Kedro OmegaConfigLoader

good morning all!
We are facing an error in using global variable interpolation with the OmegaConfigLoader. The error occurs when lunching a jupyter notebook e.g with kedro jupyter lab
the issue seems very similar/identical to the one signaled here https://kedro-org.slack.com/archives/C03RKP2LW64/p1726216824633969

the full error stack is below. The global var is located in conf\globals.yml The issue also occurs for the location conf\base\globals.yml

Any help from the kedro team is very much appreciated

Traceback (most recent call last):
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\IPython\core\shellapp.py", line 322, in init_extensions
    self.shell.extension_manager.load_extension(ext)
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\IPython\core\extensions.py", line 62, in load_extension
    return self._load_extension(module_str)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\IPython\core\extensions.py", line 79, in _load_extension
    if self._call_load_ipython_extension(mod):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\IPython\core\extensions.py", line 129, in _call_load_ipython_extension
    mod.load_ipython_extension(self.shell)
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\ipython\__init__.py", line 62, in load_ipython_extension
    reload_kedro()
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\ipython\__init__.py", line 123, in reload_kedro
    catalog = context.catalog
              ^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\framework\context\context.py", line 187, in catalog
    return self._get_catalog()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\framework\context\context.py", line 223, in _get_catalog
    conf_catalog = self.config_loader["catalog"]
                   ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\config\omegaconf_config.py", line 201, in __getitem__
    base_config = self.load_and_merge_dir_config(  # type: ignore[no-untyped-call]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\config\omegaconf_config.py", line 341, in load_and_merge_dir_config
    for k, v in OmegaConf.to_container(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\omegaconf.py", line 573, in to_container
    return BaseContainer._to_content(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\basecontainer.py", line 292, in _to_content
    value = get_node_value(key)
            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\basecontainer.py", line 247, in get_node_value
    value = BaseContainer._to_content(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\basecontainer.py", line 292, in _to_content
    value = get_node_value(key)
            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\basecontainer.py", line 244, in get_node_value
    conf._format_and_raise(key=key, value=None, cause=e)
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 231, in _format_and_raise
    format_and_raise(
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\basecontainer.py", line 242, in get_node_value
    node = node._dereference_node()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 246, in _dereference_node
    node = self._dereference_node_impl(throw_on_resolution_failure=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 277, in _dereference_node_impl
    return parent._resolve_interpolation_from_parse_tree(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 584, in _resolve_interpolation_from_parse_tree
    resolved = self.resolve_parse_tree(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 764, in resolve_parse_tree
    return visitor.visit(parse_tree)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\antlr4\tree\Tree.py", line 34, in visit
    return tree.accept(self)
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar\gen\OmegaConfGrammarParser.py", line 206, in accept
    return visitor.visitConfigValue(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar_visitor.py", line 101, in visitConfigValue
    return self.visit(ctx.getChild(0))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\antlr4\tree\Tree.py", line 34, in visit
    return tree.accept(self)
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar\gen\OmegaConfGrammarParser.py", line 342, in accept
    return visitor.visitText(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar_visitor.py", line 301, in visitText
    return self._unescape(list(ctx.getChildren()))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar_visitor.py", line 389, in _unescape
    text = str(self.visitInterpolation(node))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar_visitor.py", line 125, in visitInterpolation
    return self.visit(ctx.getChild(0))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\antlr4\tree\Tree.py", line 34, in visit
    return tree.accept(self)
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar\gen\OmegaConfGrammarParser.py", line 1041, in accept
    return visitor.visitInterpolationResolver(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\grammar_visitor.py", line 179, in visitInterpolationResolver
    return self.resolver_interpolation_callback(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 750, in resolver_interpolation_callback
    return self._evaluate_custom_resolver(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\base.py", line 694, in _evaluate_custom_resolver
    return resolver(
           ^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\omegaconf\omegaconf.py", line 445, in resolver_wrapper
    ret = resolver(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\IonutBarbu\miniconda3\envs\EIT-Epsilon\Lib\site-packages\kedro\config\omegaconf_config.py", line 384, in _get_globals_value
    raise InterpolationResolutionError(
omegaconf.errors.InterpolationResolutionError: Globals key 'model_to_use' not found and no default value provided.
    full_key: performance_metrics_best_model.filepath
    object_type=dict

16 comments

Join the Kedro community

Mocking-underlying-nodes-in-kedro-pipeline-tests

Choosing a Simple and Free Orchestrator for Kedro Pipelines on AWS

How Kedro Pipeline Reads Input Datasets

Kedro pyspark job submission issues with dataproc serverless

Default memory dataset copy method prioritizes accuracy over efficiency

Kedro-azureml: Issues with using AzureMLAssetDataset with dataset factories and dataset patterns

Error pushing data to S3: Bad Request and DatasetError

Kedro run executes nodes out of order despite expected DAG behavior

Optimizing ibis code for filters and field conditions

Best practice for rerunning clustering pipeline with different timestamps

Error saving dataframe to snowpark table dataset

Kedro pipeline deployment using VertexAI SDK with API endpoint triggering

Kedro Package Command Error: Importing TypeAliasType

Using placeholders for data catalog in pipeline

Combining Temporal.io with Kedro for Data Pipelines

Managing requirements.txt files to reproduce environments

Getting total execution time for a databricks workflow

Running node test_node with custom function and inputs

Kedro with databricks and managed table dataset

Accessing command line arguments in kedro pipeline source code

Kедro node connection without dummy data

Kedro pipeline scheduling bigquery queries in order

Csv column dtypes not being set correctly

Kedro orchestration service for production environments

Error with Global Variable Interpolation in Kedro OmegaConfigLoader