Join the Kedro community

Updated 4 weeks ago

Enforcing Schema Data Type on Catalog

Hey Guys, quick question:

Is there a way to enforce on the catalog the schema data type?

Like:

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .
    schema:

T
M
N
11 comments

so that I can specify the columns that I want to be in certain data type

When reading, you can use the dtype arg of pandas’ read_csv method in the load_args field. Is that what you are looking for? So:

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
    dtype: {}
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

not on loading but when saving it

Unfortunately no… pandas does not have such am option as it infers type from df.dtypes when writing to csv. You can always cast to the correct dtype from within the node

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html

table_schema list of dicts, optional
List of BigQuery table fields to which according DataFrame columns conform to, e.g. [{'name': 'col1', 'type': 'STRING'},...]. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field.
New in version 0.3.1 of pandas-gbq.

so it seems that it is possible

to be clear, it's not a pandas issue. CSV does not have type because it is plain text and thus always known as a bad format for any data processing pipeline.

The reason why you can do that to BiqQuery because it is a typed system that stores your dataframe as a table.

If you want to preserve type, use something like Parquet

so theres no way to feed a BQ table with specific data type for the columns using type: pandas.GBQTableDataSet

You can definitely do that with bigquery, I said you cannot save CSV with types as that was the original example that you provided.

oh yeah, my bad, I'm using
type: pandas.GBQTableDataSet

but for some reason I got

kedro.io.core.DatasetError: Failed while saving data to data set GBQTableDataset
Could not convert DataFrame to Parquet.

Add a reply
Sign up and join the conversation on Slack