Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:
"{namespace}.spreadsheet_data": type: pandas.ExcelDataset filepath: data/03_primary/all_data_sources.xlsx save_args: sheet_name: "{namespace}_data"but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time.
Hello all,
I have a question about datasets and namespaces. I am not even sure if what I am asking for is possible.
Here is a simplified version of the issue:
I have postgresql database which updates daily with data (predictions from some other models but that is beside the point)
One of the columns in the results_table is called "run_date". So if I want today's results I can do this:
(in catalog.yml):oneday_of_data:
type: pandas.SQLQueryDataset
credentials: db_credentials
sql: "select * from results_table where run_date = %(run_date)s"
load_args:
params:
run_date: 2024-11-01
this dataset combined with this one node pipeline lets me get the data from the database into my local drive.
(in pipeline.py)pipeline([
node(
func=lambda x: x,
inputs="database_data",
outputs="local_data",
name="sql_to_local_node"
),]
)
now, if I wanted more than one day's data as different datasets, it seems like this a great candidate for namespacing because nothing changes except for the run date. Like this:
(in catalog.yml)_run_dates:
run_1: 2024-11-01 #today
run_2: 2024-10-30 #yesterday
run_3: 2024-10-25 #a week ago
"{run_name}.oneday_of_data":
type: pandas.SQLQueryDataset
credentials: db_credentials
sql: "select * from results_table where run_date = %(run_date)"
load_args:
params:
run_date: ${_run_dates.{run_name}}
but no matter what I try I can't get this to work. I know I can specify {run_name} in the filepath field (if it was a csv dataset say) but is it possible to use inside a templated/variable-interpolated field like this?
I have tried writing my own custom resolver (called "picker") defined as:
(in settings.py)def pick_from_rundates(dict1, key):
return dict1[key]
CONFIG_LOADER_ARGS = {
"base_env": "base",
"default_run_env": "local",
"custom_resolvers": {"picker": lambda x, y: pick_from_rundates(x, y)}
}
and then tried this...which also failed:
(in catalog.yml)"{run_name}.oneday_of_data":
type: pandas.SQLQueryDataset
credentials: db_credentials
sql: "select * from results_table where run_date = %(run_date)"
load_args:
params:
run_date: ${picker:${_run_dates},"{run_name}"}
So am I missing something simple here or is this fundamentally not allowed? Ideally the run dates would be specified in from the globals.yml instead of directly in the catalog.yml but I am trying to walk before I run here.
I will be grateful for any advice here.
minesh