Hey all, I'm running into a curious situation: When running a Kedro pipeline in Databricks, and saving the results to MLflow (through kedro_mlflow plugin), occasionally some parallel code will trigger a new run on the experiment. The biggest example is running hyperparameter optimization with Optuna, when using n_jobs=-1 for parallel execution, out of 100 trials maybe ~4 will randomly trigger a new MLFlow run inside the experiment (the other trials run normally without triggering new runs).
This is driving me nuts. Any guess on possible causes for it?
Hello team!
Where can I find a list of all hook methods available and their signatures? I checked the docs but I apologize if I somehow missed it.
Many thanks!
Hey folks, has anyone use the kedro-azureml plugin on a Apple M1 mac? Seem to be unable to install it locally due to a dependency on packages that are unsupported on M1 chips (azureml-sdk etc,).
Hello,
I hope that this finds you well.
one of my mate (using windows) has an issue with kedro
not being recognized as a cli.
She is using anaconda prompt, created a virtual environment, installed kedro (and other deps), but when running kedro run
(from the activated conda env) she gets
'kedro' is not recognized as an internal or external command, operable program or batch file.Nb: if we try to `import kedro` using that same conda env, that works properly.
Is there a way to use ImageDataGenerator and flow_from_directory functions in kedro? I would like to save the dataset in memory and then use it later for model training, but I got the error message: DatasetError: Failed while saving data to data set MemoryDataset().
Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...
Am I being dumb - is there no way to do this without doing a customer resolver like this?
hi all, quick kedro-viz question, I have kedro Viz 10 installed, however, whenever I run kedro viz run
the pipeline generated is out of date (new pipelines not shown which are part of default
) and the version shown in the top right hand corner of the rendered pipeline version shows kedro viz v7. Any ideas on how to fix this? Is it a caching issue?
Hello team,
I wonder if there is a way to do the following in a proper kedro way.
"{namespace}.{variant}.anomaly_scores": type: polars.CSVDataset filepath: data/08_reporting/{namespace}/anomaly_scores/{variant}.anomaly_scores.csvI use this catalog entry to save data from a pipeline with different namespaces. Then, I take all these CSVs at the same time, from another pipeline, with this entry:
anomaly_scores: type: partitions.PartitionedDataset path: data/08_reporting/train_evaluation/anomaly_scores dataset: type: polars.CSVDataset filename_suffix: ".csv"It works but since it is not the same entry, if I execute the two pipelines as part of a bigger one, the pipeline that takes the data, which has to come after the other, some times comes before. I thought of using a dummy entry/output variable to force the order. Is there another better way?
Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:
"{namespace}.spreadsheet_data": type: pandas.ExcelDataset filepath: data/03_primary/all_data_sources.xlsx save_args: sheet_name: "{namespace}_data"but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time.
Anyone tried out combining SQLModel and pydantic-kedro?
hi, does kedro support out of the box google cloud logging lib in logging formats? I am not clear on the documentation how advanced adding custom handlers is possible or do I have to do it manually? When would it be better to initialize? After or before kedro loading?
Hi Team! :kedro:
KedroSession
in code ✅ MemoryDataSet
then persistence won't be needed, saving up on I/O time. However, transcoding would be a problem in this case. Any ideas?Hey team. Looking into some advice or insights on how to think about unit testing complex nodes in kedro (or rather nodes taking in complex data with a lot of edge cases). In these cases I usually follow the approach of integrating a lot functionality into a single node, composed of several smaller private functions.
My question: How to best test the node's actual output (standard stuff like column a shouldn't have any nulls
, column b should never be lower than 10
))?
Hello all! Is there a place where I can specify global options for my Kedro project? For instance, I'd like to preview 20 rows instead of 5 (the default) in Kedro Viz (and I don't want to do it individually for each dataset).
Hi Team! :kedro:
My kedro pipeline is just stuck even before running any nodes
[11/14/24 17:09:07] WARNING /root/.venv/lib/python3.9/site-packages/kedro/framework/startup.py:99 warnings.py:109 : KedroDeprecationWarning: project_version in pyproject.toml is deprecated, use kedro_init_version instead warnings.warn( [11/14/24 17:09:15] INFO Kedro project project session.py:365 [11/14/24 17:09:17] WARNING /root/.venv/lib/python3.9/site-packages/kedro/framework/session/sessi warnings.py:109 on.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here: <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/configuration/advanced_configuration">https://docs.kedro.org/en/stable/configuration/advanced_configuration</a> .html#omegaconfigloader warnings.warn( Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/11/14 17:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [11/14/24 17:12:53] WARNING /root/.venv/lib/python3.9/site-packages/pyspark/pandas/__init__.py:49 warnings.py:109 : UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. warnings.warn(
0.18.14
3.9
Hi all - I saw this was completed today https://github.com/kedro-org/kedro/pull/4263 and was very excited. Now that it's done, what should be my Kedro/uv typical initialization? How do I get them both to work together without using the copier that created? What should be my working pattern with it?
Hello, team!
Does anyone know the best (or maybe most kedroic) way to work with a PartitionedDataset by processing the partitions individually (merging them would consume all memory). I want to aply the same operations to all partitions. Would it be a better idea to use/add namespaces for this (all my files have the format f"sessions_{YYYY-MM-DD}.parquet")? Thank you!
Hey team,
Is it possible (or is there any workaround) to use a parameter in the catalog.yml using OmegaConfigLoader
? My use case is that i want to select a parameter in Databricks Workflows and have it override a kedro param at runtime. I was trying to use a global (in globals.yml), as those can be used in the catalog.yml, but unfortunately they can not be overriden at runtime, according to the docs
Has anyone successfully implemented a custom expectation for the use with kedro-expectations? When I copy an example of a custom exception (https://github.com/great-expectations/great_expectations/blob/develop/contrib/expe[…]erimental/expectations/expect_multicolumn_values_to_be_equal.py) to gx/plugins/expectations, gx is not able to find it and throws an exception.
Hi guys, what is the purpose of the session_store.db
file after you run a pipeline? Should it be committed to version control? Is it necessary for Kedro to run or is there a way to configure so this file won't be created?
Hi Team,
Is there a way to not run certain kedro hooks when kedro viz loads? I have a spark hook defined which runs everytime I run kedro viz which I want to disable.
Thanks! 🙂
Hi folks,
We have our own MLFlow server on internal S3.
Below are the setting I used locally:
os.environ["MLFLOW_TRACKING_URI"] = "<a target="_blank" rel="noopener noreferrer" href="https://xxx.com/mlflow/">https://xxx.com/mlflow/</a>" os.environ["MLFLOW_S3_ENDPOINT_URL"] = "<a target="_blank" rel="noopener noreferrer" href="http://s3xxx.com">http://s3xxx.com</a>" os.environ["S3_BUCKET_PATH"] = "<a target="_blank" rel="noopener noreferrer" href="s3://xxx/mlflow">s3://xxx/mlflow</a>" os.environ["AWS_ACCESS_KEY_ID"] = "xxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxx" os.environ['MLFLOW_TRACKING_USERNAME'] = 'xxx' os.environ['MLFLOW_TRACKING_PASSWORD'] = 'xxx' os.environ["MLFLOW_TRACKING_SERVER_CERT_PATH"] = "C:\\xxx\\ca-bundle.crt" EXPERIMENT_NAME = "ZeMC012"In order to use in Kedro framework, I create a mlflow.yml file in conf/local folder and the content like this:
server: mlflow_tracking_uri: <a target="_blank" rel="noopener noreferrer" href="https://xxx.com/mlflow/">https://xxx.com/mlflow/</a> MLFLOW_S3_ENDPOINT_URL: <a target="_blank" rel="noopener noreferrer" href="http://s3xxx.com">http://s3xxx.com</a> S3_BUCKET_PATH: <a target="_blank" rel="noopener noreferrer" href="s3://xxx/mlflow">s3://xxx/mlflow</a> AWS_ACCESS_KEY_ID: xxx AWS_SECRET_ACCESS_KEY: xxx MLFLOW_TRACKING_USERNAME: xxx MLFLOW_TRACKING_PASSWORD: xxx MLFLOW_EXPERIMENT_NAME: ZeMC012 MLFLOW_TRACKING_SERVER_CERT_PATH: C:/xxx/ca-bundle.crtBut I got error
ValidationError: 8 validation errors for KedroMlflowConfig
Question on project setup.
My workflow usually looks like:
mkdir new-project cd new-project uv venv --python 3.xx source .venv/bin/activate uv pip install kedro kedro new --name new-projectThen my directories look like:
new-project/ .venv/ new-project/ ... kedro stuff ...but really i wanted the <i>current</i> directory to be my kedro project (at the level where
.venv
is)new-project/ ... kedro stuff ... .venv/but I was things all in the same directory without having to move all the kedro project files one directory up
Hey team, how can I dynamically overwrite an existing dataset in the Kedro catalog with a new configuration or data (e.g., changing the file path or dataset content) when running a pipeline from a Jupyter notebook on databricks? Same for dynamically overwriting a parameter. This would be as a one time test run so currently trying to change the notebook on Databricks and then would delete the added code for future runs. Any help on this would be great!