Hi Team! :kedro:
My kedro pipeline is just stuck even before running any nodes
[11/14/24 17:09:07] WARNING /root/.venv/lib/python3.9/site-packages/kedro/framework/startup.py:99 warnings.py:109 : KedroDeprecationWarning: project_version in pyproject.toml is deprecated, use kedro_init_version instead warnings.warn( [11/14/24 17:09:15] INFO Kedro project project session.py:365 [11/14/24 17:09:17] WARNING /root/.venv/lib/python3.9/site-packages/kedro/framework/session/sessi warnings.py:109 on.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here: <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/configuration/advanced_configuration">https://docs.kedro.org/en/stable/configuration/advanced_configuration</a> .html#omegaconfigloader warnings.warn( Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/11/14 17:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [11/14/24 17:12:53] WARNING /root/.venv/lib/python3.9/site-packages/pyspark/pandas/__init__.py:49 warnings.py:109 : UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. warnings.warn(
0.18.14
3.9
Yes, that's correct. I think it's probably got to do with docker container on my m3 mac. The same container environment works perfectly on GCP VM.
But what's weird is, that this only gets stuck with kedro pipeline and not other python scripts doing similar things (but without kedro).
Summary: Problem only in my local mac in the docker container when I run kedro pipeline
in case it's related, could you try to address the WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
?
Thanks for pointing this out! Let me find some fixes for this and see if the problem solves.
Update: It took 5 minutes to start the 1st node. Which means kedro was stuck at something even before starting the pipeline. And then it got stuck at another node for hours and never finished.
Update: Solved ✅
The cores allocated to the spark driver were too less (only 1). I increased the spark.driver.cores
to 4, and then it ran smoothly.
Might be related to how pyspark installation on different environment set driver cores differently, hence the difference. (not sure)