Hello! I am running into issues with Kedro 0.19.11 release while running pipelines in databricks. Specifically, I am running into an error where an imported python module for a node is unable to find active SparkSession via SparkSession.getActiveSession()
(see first image). Our pipeline is comprised entirely of Ibis.TableDataset datasets & I/O with pyspark backend. What is throwing me is that other nodes use the pyspark connection and are able to perform operations properly across the spark session, but fails on this single node when leveraging an imported module that it is unable to find the spark session. This issue is not present in Kedro 0.19.10. My best guess is that it has something to do with the updated code in kedro/runner/sequential_runner.py
using ThreadPoolExecutor
and possible scoping issues? Apologies on the somewhat scattered explanation, there is quite a bit I don't fully understand here, so appreciate any help or guidance. Lmk if I can provide any additional info as well.
Hi @Jacob Pieniazek, indeed there were some changes in the sequential runner that made it use ThreadPoolExecutor
with one thread. But we’ve figured out that it affects non-thread safe runs:
https://github.com/kedro-org/kedro/issues/4486
So currently we’re rolling back to the old approach.
The quick way to check if that’s the case for you is to see whether you’re getting the same error for Kedro 0.19.10 if you use ThreadRunneer
instead of the default SequentialRunner
Hey @Elena Khaustova, thanks for the quick reply! Running the pipeline with ThreadRunner
in Kedro 0.9.10 did indeed throw the same error.
Then I would recommend to stay on Kedro 0.19.10 for some time. Fix will be available in Kedro 0.19.12