Join the Kedro community

Updated 36 minutes ago

Running Kedro Pipelines in a Prefect Deployment

Hey all! I'm working on tooling around running Kedro pipelines in our (pre-existing) Prefect deployment. I've been following the lead of the example from the docs and things were going pretty smoothly until I came around to logging. Logging in Prefect is a little finicky, but what I'd like to do is route the Kedro logs through to the Prefect loggers and handlers. Happy to go into more detail about what I've tried, but figured I'd first ask if anyone has experience here? Is there some other way to handle exposing Kedro logs to in the Prefect UI (which is ultimately my goal).

M
J
N
49 comments

Prior to using Kedro (currently working on the first "real" pilot), our pattern would be to grab the logger via Prefect's get_logger methods and then inject that down into our pipeline code.

But, this doesn't really seem like what we'd want with Kedro (at least I don't think).

Am I thinking about this right?

more context: I've implemented two custom datasets and it's the logs from that code that I want expose via the prefect logger

the closest I've gotten was defining a custom handler for the kedro logger, and having that code instantiate the prefect run logger and pass the logs through, but this wasn't capturing all of the logs from the datasets.

i sort of suspect the issue was conflicting logging configuration between kedro and prefect, but haven't gone down that rabbit hole yet

or maybe something wrong with how I configured this, but it felt like a too deep a rabbit hole to go down before asking y'all for help 😅

I also didn't pursue that farther because I switched to trying to get Prefect to pick up the kedro logs using https://docs.prefect.io/v3/develop/logging#include-logs-from-other-libraries but I haven't been able to get that to work either (no matter how I try to configure this I don't see the kedro logs in the UI)

I found the same thing last time I tried running Kedro in Prefect. I tweaked the logging a bit, does this help? https://github.com/astrojuanlu/workshop-from-zero-to-mlops/blob/main/conf/logging_prefect.yml

I see, this is much simpler than what I did, just tell the kedro logger to use the prefect handler. it isn't resolving the issue, but it does give a clue:

[02/20/25 16:34:16] WARNING  /usr/local/lib/python3.12/site-packages/kedro/framework/project/__init__.py:270: UserWarning: Logger 'kedro.framework.project' warnings.py:110
                             attempted to send logs to the API without a flow run id. The API log handler can only send logs within flow run contexts
                             unless the flow run id is manually provided.
my handler was eating these warning :facepalming: .... I think this means that the logs that I'm looking for are being generated outside a prefect flow/task

well, it definitely means this

i'm just surprised by it

yeah I don’t know why logging in Prefect is so special

Probably due to the distributed nature? not so sure tho

yeah I think that's right, but it does feel a bit over engineered

in order to get the logs to the UI (which is not where the actual tasks are run) the logs are sent through an API

and in order for that to work properly the prefect loggers need to be used so that the appropriate metadata (task run id, etc...) can be attached

but after digging deeper, something is happening down inside Kedro that is OUTSIDE the Prefect context

hard to say exactly what without knowing/learning more about the implementation details of Kedro

it looks like the actual execution of the node implementations are not inheriting the prefect context

i assume because of async things or threading things or something

but as far as I can tell there are places in kedro (the load and save methods of a Dataset being an example) where there's no practical way of getting logs to the prefect UI

because the execution of that code is happening outside the prefect flow/task context and you can't call get_run_runner

It may be possible to manually track the necessary metatdata (task run ids, etc...) and then manually inject that before calling the prefect api log handler

but i am very hesitant to go down that road because it feels like it would be very brittle (ie, if prefect changes internal implementations of the logging api, this code will break)

in our particular case we have things setup so that all logs from any container in our cluster get shipped to logzio, so I think we'll have to live with a world where we have to go there to see all the logs (as opposed to the prefect UI) which is not ideal, but fine

i think the tldr for this thread is that:
1. it appears kedro is not preserving context when doing threading or something similar
2. maybe this could be a nice contribution? attempt to, in general, preserve any active context through to all parts of the pipeline execution (assuming this is even possible/feasible)

what does "context" mean in this... ehm... context? 😬

I think I mean "context" in the python context management sense

but a very fair question 😆

(I say I think because I would still consider everything I've said in this thread conjecture)

and I also don't REALLY know the details of what kedro is doing

lol. or what prefect is really doing i guess

ie, it may actually be some prefect specific notion of context

i'm going to dig a little deeper and see if this really does come down to how kedro is handling Python Context variables.

please keep us posted with whatever you find 🙏 I have no idea if Kedro is doing anything funky here

i sort of suspect it's more like kedro isn't looking for and passing through context variables rather than kedro doing something funky

if that's that case, it's a totally reasonable decision to not worry about it without some specific need for it

which this may be

this = handling things like prefect logging

confirmed that Prefect is using context variables to store the flow/task context information https://github.com/PrefectHQ/prefect/blob/main/src/prefect/context.py#L129

well, at this point this is more like a github issue than a slack thread, so i'm going to collect notes and if it seems like something worth considering i'll open an issue

ok one more post to close the loop.... I think this may be a fast and good enhancement after all

it looks like Kedro is using concurrent futures to actually run things https://github.com/kedro-org/kedro/blob/main/kedro/runner/runner.py#L243

which means adding a context = contextvars.copy_context() and passing that into submit may do the trick

going to try this and if it works i'll open an issue

There are some changes around this, so be careful of which version of Kedro that you are using. This is added only in the most recent release (or the one before)

Add a reply
Sign up and join the conversation on Slack