Join the Kedro community

Updated 6 days ago

My Kedro Pipeline Is Just Stuck Even Before Running Any Nodes

Hi Team! :kedro:

My kedro pipeline is just stuck even before running any nodes

[11/14/24 17:09:07] WARNING  /root/.venv/lib/python3.9/site-packages/kedro/framework/startup.py:99 warnings.py:109
                             : KedroDeprecationWarning: project_version in pyproject.toml is                      
                             deprecated, use kedro_init_version instead                                           
                               warnings.warn(                                                                     
                                                                                                                  
[11/14/24 17:09:15] INFO     Kedro project project                                                  session.py:365
[11/14/24 17:09:17] WARNING  /root/.venv/lib/python3.9/site-packages/kedro/framework/session/sessi warnings.py:109
                             on.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will                 
                             be deprecated in Kedro 0.19. Please use the OmegaConfigLoader                        
                             instead. To consult the documentation for OmegaConfigLoader, see                     
                             here:                                                                                
                             <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/configuration/advanced_configuration">https://docs.kedro.org/en/stable/configuration/advanced_configuration</a>                
                             .html#omegaconfigloader                                                              
                               warnings.warn(                                                                     
                                                                                                                  
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/14 17:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[11/14/24 17:12:53] WARNING  /root/.venv/lib/python3.9/site-packages/pyspark/pandas/__init__.py:49 warnings.py:109
                             : UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not                
                             set. It is required to set this environment variable to '1' in both                  
                             driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark                 
                             will set it for you but it does not work if there is a Spark context                 
                             already launched.                                                                    
                               warnings.warn(  

  • kedro: 0.18.14
  • python: 3.9
  • Running inside a docker container (since requirements don't compile on M* macs)

I understand this is too less information to help, but I have the same problem. Is there any place I could look into to see where it is stuck?

J
A
8 comments

hi ! to clarify, after the [11/14/24 17:12:53] log nothing happens?

Yes, that's correct. I think it's probably got to do with docker container on my m3 mac. The same container environment works perfectly on GCP VM.

But what's weird is, that this only gets stuck with kedro pipeline and not other python scripts doing similar things (but without kedro).

Summary: Problem only in my local mac in the docker container when I run kedro pipeline

in case it's related, could you try to address the WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ?

Thanks for pointing this out! Let me find some fixes for this and see if the problem solves.

Update: It took 5 minutes to start the 1st node. Which means kedro was stuck at something even before starting the pipeline. And then it got stuck at another node for hours and never finished.

Attachment
Screenshot 2024-11-15 at 7.25.23 AM.png

Update: Solved ✅
The cores allocated to the spark driver were too less (only 1). I increased the spark.driver.cores to 4, and then it ran smoothly.

Might be related to how pyspark installation on different environment set driver cores differently, hence the difference. (not sure)

glad you could solve it!

Add a reply
Sign up and join the conversation on Slack