My Kedro Pipeline Is Just Stuck Even Before Running Any...

At a glance

The community member is having issues with their Kedro pipeline getting stuck even before running any nodes. They are seeing various warnings related to Kedro deprecations and issues with the Spark environment. The community members have identified that the problem is specific to their local Docker container on an M3 Mac, as the same container environment works fine on a GCP VM. They have tried addressing the "WARN NativeCodeLoader" issue but the pipeline still gets stuck, sometimes taking 5 minutes to start the first node and then getting stuck at another node for hours. The issue was eventually solved by increasing the spark.driver.cores from 1 to 4, which allowed the pipeline to run smoothly.

Useful resources

AAbhishek Bhatia

Hi Team! :kedro:

My kedro pipeline is just stuck even before running any nodes

[11/14/24 17:09:07] WARNING  /root/.venv/lib/python3.9/site-packages/kedro/framework/startup.py:99 warnings.py:109
                             : KedroDeprecationWarning: project_version in pyproject.toml is                      
                             deprecated, use kedro_init_version instead                                           
                               warnings.warn(                                                                     
                                                                                                                  
[11/14/24 17:09:15] INFO     Kedro project project                                                  session.py:365
[11/14/24 17:09:17] WARNING  /root/.venv/lib/python3.9/site-packages/kedro/framework/session/sessi warnings.py:109
                             on.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will                 
                             be deprecated in Kedro 0.19. Please use the OmegaConfigLoader                        
                             instead. To consult the documentation for OmegaConfigLoader, see                     
                             here:                                                                                
                             <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/configuration/advanced_configuration">https://docs.kedro.org/en/stable/configuration/advanced_configuration</a>                
                             .html#omegaconfigloader                                                              
                               warnings.warn(                                                                     
                                                                                                                  
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/14 17:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[11/14/24 17:12:53] WARNING  /root/.venv/lib/python3.9/site-packages/pyspark/pandas/__init__.py:49 warnings.py:109
                             : UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not                
                             set. It is required to set this environment variable to '1' in both                  
                             driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark                 
                             will set it for you but it does not work if there is a Spark context                 
                             already launched.                                                                    
                               warnings.warn(

kedro: 0.18.14
python: 3.9
Running inside a docker container (since requirements don't compile on M* macs)

I understand this is too less information to help, but I have the same problem. Is there any place I could look into to see where it is stuck?

8 comments

JJuan Luis Cano Rodríguez

hi ! to clarify, after the [11/14/24 17:12:53] log nothing happens?

AAbhishek Bhatia

Yes, that's correct. I think it's probably got to do with docker container on my m3 mac. The same container environment works perfectly on GCP VM.

But what's weird is, that this only gets stuck with kedro pipeline and not other python scripts doing similar things (but without kedro).

AAbhishek Bhatia

Summary: Problem only in my local mac in the docker container when I run kedro pipeline

JJuan Luis Cano Rodríguez

in case it's related, could you try to address the WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ?

AAbhishek Bhatia

Thanks for pointing this out! Let me find some fixes for this and see if the problem solves.

AAbhishek Bhatia

Update: It took 5 minutes to start the 1st node. Which means kedro was stuck at something even before starting the pipeline. And then it got stuck at another node for hours and never finished.

Attachment

AAbhishek Bhatia

Update: Solved ✅
The cores allocated to the spark driver were too less (only 1). I increased the spark.driver.cores to 4, and then it ran smoothly.

Might be related to how pyspark installation on different environment set driver cores differently, hence the difference. (not sure)

JJuan Luis Cano Rodríguez

glad you could solve it!

Add a reply

Join the Kedro community

My Kedro Pipeline Is Just Stuck Even Before Running Any Nodes