Issues With Kedro 0.19.11 Release While Running Pipelines In Databricks

Question

Hello! I am running into issues with Kedro 0.19.11 release while running pipelines in databricks. Specifically, I am running into an error where an imported python module for a node is unable to find active SparkSession via  SparkSession.getActiveSession()   (see first image). Our pipeline is comprised entirely of Ibis.TableDataset datasets  &  I/O with pyspark backend. What is throwing me is that other nodes use the pyspark connection and are able to perform operations properly across the spark session, but fails on this single node when leveraging an imported module that it is unable to find the spark session. This issue is not present in Kedro 0.19.10. My best guess is that it has something to do with the updated code in  kedro/runner/sequential_runner.py  using  ThreadPoolExecutor  and possible scoping issues? Apologies on the somewhat scattered explanation, there is quite a bit I don't fully understand here, so appreciate any help or guidance. Lmk if I can provide any additional info as well.

Elena Khaustova · Answer

Hi  @Jacob Pieniazek , indeed there were some changes in the sequential runner that made it use  ThreadPoolExecutor  with one thread. But we’ve figured out that it affects non-thread safe runs: https://github.com/kedro-org/kedro/issues/4486 So currently we’re rolling back to the old approach. The quick way to check if that’s the case for you is to see whether you’re getting the same error for Kedro 0.19.10 if you use  ThreadRunneer  instead of the default  SequentialRunner

Jacob Pieniazek · Answer

Hey  @Elena Khaustova , thanks for the quick reply! Running the pipeline with  ThreadRunner  in Kedro 0.9.10 did indeed throw the same error.

Elena Khaustova · Answer

Then I would recommend to stay on Kedro 0.19.10 for some time. Fix will be available in Kedro 0.19.12

Join the Kedro community

Issues With Kedro 0.19.11 Release While Running Pipelines In Databricks