Optimizing ibis code for filters and field conditions

VVishal Pandey

Hi everyone

I have been exploring ibis for sometime. I just wanted to understand is there a better way to write the below code in a more optimised fashion

import ibis

con = ibis.connect(POSTGRES_CONNECTION_STRING)
training_meta_table:ir.Table = con.table("training_metadata")

filters = {
    "customer_ids" : [59] ,
    "queue_names" : ["General Lit - Misclassifications", "MoveDocs-MR"],
    "start_date" : "2024-09-5 00:00:00",
    "end_date" : "2024-09-11 00:00:00",
    "doc_types" : [],
    "fields" : ["patientFirstName", "patientLastName", "Service Date", "Doctor"]
}
field_conditions = training_meta_table.fields_present.contains(filters["fields"][0]) | training_meta_table.fields_present.contains(filters["fields"][1]) | training_meta_table.fields_present.contains(filters["fields"][2]) | training_meta_table.fields_present.contains(filters["fields"][3])

So there are many or conditions we would like to dynamically join together to create 1 final condition based on the input filters

5 comments

VVishal Pandey

Any experience with ibis

ddatajoely

https://ibis-project.org/how-to/analytics/chain_expressions.html

VVishal Pandey

just referred it , i could not make much use of it.

if you see we need to kind of iterate through all field conditions(List) present in the filters dictionary and create ibis expression for each of them.

ddatajoely

A bit of playing with chatgpt, I think this is the most readable:

import ibis
from functools import reduce
import operator

# Establish connection
con = ibis.connect(POSTGRES_CONNECTION_STRING)
training_meta = con.table("training_metadata")

# Define filters
filters = {
    "customer_ids": [59],
    "queue_names": ["General Lit - Misclassifications", "MoveDocs-MR"],
    "start_date": "2024-09-05 00:00:00",
    "end_date": "2024-09-11 00:00:00",
    "fields": ["patientFirstName", "patientLastName", "Service Date", "Doctor"]
}

# Helper function to build filter conditions
def build_filters(table, filters):
    conditions = [
        table.customer_id.isin(filters["customer_ids"]),
        table.queue_name.isin(filters["queue_names"]),
        table.date.between(filters["start_date"], filters["end_date"]),
        table.fields_present.overlaps(filters["fields"])
    ]
    return reduce(operator.and_, conditions)

# Apply filters and select fields using the helper function
result = (
    training_meta
    .filter(build_filters(training_meta, filters))
    .select(filters["fields"])
)

# Execute the query (uncomment when ready)
# fetched_results = result.execute()

VVishal Pandey

Thanks will test and let you know. But I am curious to know the use of overlaps() here.

Just a bit of context here - The fields_present column in the table contains comma separated strings , attaching some examples as well here -

Document Date, Dr. Name, Doctor, Sender Free Text, Sender, Sender Logo, Sender Signature

Document Type, patientDOB, patientFirstName, patientLastName, providerEntity, referringPhysician, serviceDate

and we have to return the row if any of the value present in filters["fields"] is found as a substring in the fields_present column and that is why I used the code below :

training_meta_table.fields_present.contains(filters["fields"][0])

contains() comes from ibis string expressions.

But you see it is difficult to manually create such ibis expressions for the whole list of values

Add a reply

Join on Slack

Join the Kedro community

Optimizing ibis code for filters and field conditions