Join the Kedro community

Updated 2 months ago

Confusion with Credential Configuration in Kedro 0.19 vs 0.18

At a glance

Issue Summary
Confusion with Credential Configuration in Kedro 0.19 vs 0.18
Hello Kedro team,
I have encountered an issue regarding the configuration of credentials for accessing storage via abfss in Kedro 0.19.3, which was not present in version 0.18. Here is a summary of the problem:
In Kedro 0.18, I configured the credentials for accessing storage through Spark configurations with Azure Service Principal, and everything worked fine. However, after upgrading to Kedro 0.19.3, the same setup stopped working. After spending a couple of days troubleshooting, I discovered that adding the credentials as environment variables resolved the issue.
My questions are:

  1. Does Kedro 0.19.3 read these environment variables directly?
  2. Is this behavior managed by Kedro itself or by the abfss library?

Additionally, it seems redundant to add the credentials both in the Spark configuration and as environment variables. This redundancy is confusing and feels like a bug rather than a feature. Could you please clarify if this is the intended behavior?
Execution Environment:
  • This is being executed in Databricks.
  • The Spark configurations to use Azure Service Principal are added to the Databricks cluster used. (The cluster configuration includes credentials for multiple storages.)
  • Only one storage credentials can be added as environment variables, but since the spark config authenticates the spark session just filling in these values althugh incorrect allows to access the storages.

Thank you for your assistance!

J
C
2 comments

hola @Carlos Prieto - Tomtom, thanks for the detailed explanation and sorry you had a bumpy experience. we're looking into this.

I have a few follow-up questions:

  • when on Kedro 0.18, what exact version were you using?
  • I assume in both cases you were using OmegaConfigLoader, is that correct?
  • as far as I understand (but I could be wrong), Kedro doesn't do any magic env variable loading for credentials. apart from PySpark, are there any relevant Python dependencies in your environment?

Thanks for the quick response! Here are the details :

  • I went from Kedro 0.18.6
  • In kedro 0.18 I am using TemplatedConfigLoader where as in 0.19 we use the OmegaConfigLoader in the followingway:
# Class that manages how configuration is loaded.
from kedro.config import OmegaConfigLoader  # noqa: E402

CONFIG_LOADER_CLASS = OmegaConfigLoader
# Keyword arguments to pass to the CONFIG_LOADER_CLASS constructor.
CONFIG_LOADER_ARGS = {
    "base_env": "base",
    "default_run_env": "local",
    "config_patterns": {
        "spark": ["spark*", "spark*/**"],
    }
}
  • I wonder why when then even if the the spark session is already authenticated with credentials through Databricks cluster config. When Kedro tries to instantiate the kedro Spark Dataset if no env_variable if gives out the missing credentials error for the dataset. Here are the dependencies for my project:
delta-spark==2.3.0
kedro==0.19.3
pyspark==3.3.2
azure-identity==1.12.0
azure-keyvault-secrets==4.7.0
pandas==1.5.3
country_converter==1.0.0
unidecode==1.3.6
haversine==2.8.0
rapidfuzz==3.1.2
numpy==1.23.1
azure-mgmt-network==25.2.0
azure-mgmt-compute==30.4.0
kedro-viz==8.0.1
kedro-datasets[spark-sparkdataset, spark-sparkjdbcdataset, pandas-csvdataset, pickle-pickledataset]==3.0.1
hdfs==2.7.3
s3fs==2024.3.1
postal==1.1.10
deltalake==0.16.3
opentraveldata==0.0.9.post2
fuzzywuzzy==0.18.0
python-Levenshtein==0.25.0
country-converter==1.0.0
babel==2.14.0
langchain==0.0.347
openai>=0.27.0
geopandas~=0.11.0
tiktoken==0.6.0
faiss-cpu==1.8.0

Thanks for the help 🙂

Add a reply
Sign up and join the conversation on Slack