Join the Kedro community

Hey all, I'm running into a curious situation: When running a Kedro pipeline in Databricks, and saving the results to MLflow (through kedro_mlflow plugin), occasionally some parallel code will trigger a new run on the experiment. The biggest example is running hyperparameter optimization with Optuna, when using n_jobs=-1 for parallel execution, out of 100 trials maybe ~4 will randomly trigger a new MLFlow run inside the experiment (the other trials run normally without triggering new runs).

This is driving me nuts. Any guess on possible causes for it?

5 comments
D
N

Hello team!
Where can I find a list of all hook methods available and their signatures? I checked the docs but I apologize if I somehow missed it.
Many thanks!

2 comments
A
G

Hey folks, has anyone use the kedro-azureml plugin on a Apple M1 mac? Seem to be unable to install it locally due to a dependency on packages that are unsupported on M1 chips (azureml-sdk etc,).

2 comments
m
T

Hello,
I hope that this finds you well.

one of my mate (using windows) has an issue with kedro not being recognized as a cli.
She is using anaconda prompt, created a virtual environment, installed kedro (and other deps), but when running kedro run (from the activated conda env) she gets

'kedro' is not recognized as an internal or external command, operable program or batch file.
Nb: if we try to `import kedro` using that same conda env, that works properly.

Any ideas?

11 comments
J
C
N

Is there a way to use ImageDataGenerator and flow_from_directory functions in kedro? I would like to save the dataset in memory and then use it later for model training, but I got the error message: DatasetError: Failed while saving data to data set MemoryDataset().

3 comments
J
Y
S

Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...

7 comments
L
T
D
i

Am I being dumb - is there no way to do this without doing a customer resolver like this?

20 comments
L
N
d
E

hi all, quick kedro-viz question, I have kedro Viz 10 installed, however, whenever I run kedro viz run the pipeline generated is out of date (new pipelines not shown which are part of default ) and the version shown in the top right hand corner of the rendered pipeline version shows kedro viz v7. Any ideas on how to fix this? Is it a caching issue?

22 comments
J
M
R

Hello team,
I wonder if there is a way to do the following in a proper kedro way.

"{namespace}.{variant}.anomaly_scores":
  type: polars.CSVDataset
  filepath: data/08_reporting/{namespace}/anomaly_scores/{variant}.anomaly_scores.csv
I use this catalog entry to save data from a pipeline with different namespaces. Then, I take all these CSVs at the same time, from another pipeline, with this entry:
anomaly_scores:
  type: partitions.PartitionedDataset
  path: data/08_reporting/train_evaluation/anomaly_scores
  dataset:
    type: polars.CSVDataset
  filename_suffix: ".csv"
It works but since it is not the same entry, if I execute the two pipelines as part of a bigger one, the pipeline that takes the data, which has to come after the other, some times comes before. I thought of using a dummy entry/output variable to force the order. Is there another better way?
Thank you so much!

2 comments
J
R

Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:

"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    sheet_name: "{namespace}_data"
but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time.

I have read that I will need to specify a writer if I want to write to a file that already exists (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html) but I can't get that to work.
Is what I would like to do possible?
Many thanks

4 comments
R
A
m

hi, does kedro support out of the box google cloud logging lib in logging formats? I am not clear on the documentation how advanced adding custom handlers is possible or do I have to do it manually? When would it be better to initialize? After or before kedro loading?

17 comments
A
d
N

Hi Team! :kedro:

  • I have deployed my model inference pipeline as kedro pipelines served as a Dockerized web API.
  • The implementation around input data, and parameters from input HTTP request is handled, and I am able to run the kedro pipeline by initializing the KedroSession in code ✅


However, I am concerned about kedro pipeline run time per request, which is too high (~1 minute).
Questions:

  1. Is there a way to reduce kedro startup time?
  2. My pipelines have a lot of persistent catalog entries. I have an idea, that if I convert every entry into MemoryDataSet then persistence won't be needed, saving up on I/O time. However, transcoding would be a problem in this case. Any ideas?
  3. Any other ways to speedup kedro init and general pipeline run?

Ideally want to make 0 changes between the actual kedro pipeline and the inference kedro pipeline.

Thanks! 🙂

3 comments
A
Y
A

Hey team. Looking into some advice or insights on how to think about unit testing complex nodes in kedro (or rather nodes taking in complex data with a lot of edge cases). In these cases I usually follow the approach of integrating a lot functionality into a single node, composed of several smaller private functions.
My question: How to best test the node's actual output (standard stuff like column a shouldn't have any nulls, column b should never be lower than 10))?

  • I feel like it would be impossible to create dummy data to account for all edge cases in the test function itself
  • Reading from the production input table, on the other hand, defeats the purpose of unit testing.
  • Does it make sense to generate synthetic or sample data from the input tables to the node and store it somewhere to be read at testing time?

5 comments
N
Y
P

Hello all! Is there a place where I can specify global options for my Kedro project? For instance, I'd like to preview 20 rows instead of 5 (the default) in Kedro Viz (and I don't want to do it individually for each dataset).

4 comments
R
F
R

Hi Team! :kedro:

My kedro pipeline is just stuck even before running any nodes

[11/14/24 17:09:07] WARNING  /root/.venv/lib/python3.9/site-packages/kedro/framework/startup.py:99 warnings.py:109
                             : KedroDeprecationWarning: project_version in pyproject.toml is                      
                             deprecated, use kedro_init_version instead                                           
                               warnings.warn(                                                                     
                                                                                                                  
[11/14/24 17:09:15] INFO     Kedro project project                                                  session.py:365
[11/14/24 17:09:17] WARNING  /root/.venv/lib/python3.9/site-packages/kedro/framework/session/sessi warnings.py:109
                             on.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will                 
                             be deprecated in Kedro 0.19. Please use the OmegaConfigLoader                        
                             instead. To consult the documentation for OmegaConfigLoader, see                     
                             here:                                                                                
                             <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/configuration/advanced_configuration">https://docs.kedro.org/en/stable/configuration/advanced_configuration</a>                
                             .html#omegaconfigloader                                                              
                               warnings.warn(                                                                     
                                                                                                                  
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/14 17:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[11/14/24 17:12:53] WARNING  /root/.venv/lib/python3.9/site-packages/pyspark/pandas/__init__.py:49 warnings.py:109
                             : UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not                
                             set. It is required to set this environment variable to '1' in both                  
                             driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark                 
                             will set it for you but it does not work if there is a Spark context                 
                             already launched.                                                                    
                               warnings.warn(  

  • kedro: 0.18.14
  • python: 3.9
  • Running inside a docker container (since requirements don't compile on M* macs)

I understand this is too less information to help, but I have the same problem. Is there any place I could look into to see where it is stuck?

8 comments
J
A

Hi all - I saw this was completed today https://github.com/kedro-org/kedro/pull/4263 and was very excited. Now that it's done, what should be my Kedro/uv typical initialization? How do I get them both to work together without using the copier that created? What should be my working pattern with it?

3 comments
J
G

Hello, team!
Does anyone know the best (or maybe most kedroic) way to work with a PartitionedDataset by processing the partitions individually (merging them would consume all memory). I want to aply the same operations to all partitions. Would it be a better idea to use/add namespaces for this (all my files have the format f"sessions_{YYYY-MM-DD}.parquet")? Thank you!

2 comments
A
C

Hey team,
Is it possible (or is there any workaround) to use a parameter in the catalog.yml using OmegaConfigLoader? My use case is that i want to select a parameter in Databricks Workflows and have it override a kedro param at runtime. I was trying to use a global (in globals.yml), as those can be used in the catalog.yml, but unfortunately they can not be overriden at runtime, according to the docs

13 comments
d
P
N
Y
A

Has anyone successfully implemented a custom expectation for the use with kedro-expectations? When I copy an example of a custom exception (https://github.com/great-expectations/great_expectations/blob/develop/contrib/expe[…]erimental/expectations/expect_multicolumn_values_to_be_equal.py) to gx/plugins/expectations, gx is not able to find it and throws an exception.

2 comments
N
P

Hi guys, what is the purpose of the session_store.db file after you run a pipeline? Should it be committed to version control? Is it necessary for Kedro to run or is there a way to configure so this file won't be created?

1 comment
R

Hi Team,

Is there a way to not run certain kedro hooks when kedro viz loads? I have a spark hook defined which runs everytime I run kedro viz which I want to disable.

Thanks! 🙂

19 comments
A
d
R
R

Hi folks,
We have our own MLFlow server on internal S3.
Below are the setting I used locally:

os.environ["MLFLOW_TRACKING_URI"] = "<a target="_blank" rel="noopener noreferrer" href="https://xxx.com/mlflow/">https://xxx.com/mlflow/</a>"
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "<a target="_blank" rel="noopener noreferrer" href="http://s3xxx.com">http://s3xxx.com</a>"
os.environ["S3_BUCKET_PATH"] = "<a target="_blank" rel="noopener noreferrer" href="s3://xxx/mlflow">s3://xxx/mlflow</a>"
os.environ["AWS_ACCESS_KEY_ID"] = "xxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxx"
os.environ['MLFLOW_TRACKING_USERNAME'] = 'xxx'
os.environ['MLFLOW_TRACKING_PASSWORD'] = 'xxx'
os.environ["MLFLOW_TRACKING_SERVER_CERT_PATH"] = "C:\\xxx\\ca-bundle.crt"
EXPERIMENT_NAME = "ZeMC012"
In order to use in Kedro framework, I create a mlflow.yml file in conf/local folder and the content like this:
server: 
  mlflow_tracking_uri: <a target="_blank" rel="noopener noreferrer" href="https://xxx.com/mlflow/">https://xxx.com/mlflow/</a>
  MLFLOW_S3_ENDPOINT_URL: <a target="_blank" rel="noopener noreferrer" href="http://s3xxx.com">http://s3xxx.com</a>
  S3_BUCKET_PATH: <a target="_blank" rel="noopener noreferrer" href="s3://xxx/mlflow">s3://xxx/mlflow</a>
  AWS_ACCESS_KEY_ID: xxx
  AWS_SECRET_ACCESS_KEY: xxx
  MLFLOW_TRACKING_USERNAME: xxx
  MLFLOW_TRACKING_PASSWORD: xxx
  MLFLOW_EXPERIMENT_NAME: ZeMC012
  MLFLOW_TRACKING_SERVER_CERT_PATH: C:/xxx/ca-bundle.crt
But I got error ValidationError: 8 validation errors for KedroMlflowConfig
How should I modify it?

6 comments
D
S

Question on project setup.

My workflow usually looks like:

mkdir new-project
cd new-project
uv venv --python 3.xx
source .venv/bin/activate
uv pip install kedro
kedro new --name new-project 
Then my directories look like:
new-project/
    .venv/
    new-project/
        ... kedro stuff ...
but really i wanted the <i>current</i> directory to be my kedro project (at the level where .venv is)
Is there a good way to do this?

of course I could just create the venv a directory up, like so:
new-project/
    ... kedro stuff ...
.venv/
but I was things all in the same directory without having to move all the kedro project files one directory up

2 comments
J
I

Hey team, how can I dynamically overwrite an existing dataset in the Kedro catalog with a new configuration or data (e.g., changing the file path or dataset content) when running a pipeline from a Jupyter notebook on databricks? Same for dynamically overwriting a parameter. This would be as a one time test run so currently trying to change the notebook on Databricks and then would delete the added code for future runs. Any help on this would be great!

6 comments
D
M
E