Avoid triggering expensive computations prematurely in your code. Operations like collect(), intensive IO operations, or algorithms with quadratic complexity should be deferred until absolutely necessary to prevent wasted computational resources.
For example, instead of:
# This could trigger heavy computation that gets discarded
if strict:
nrows = first.select(F.len()).collect()[0, 0]
Consider deferring the operation:
# Defer the expensive operation to when it's actually needed
pl.scan_iceberg(iceberg_path, snapshot_id=1234567890).collect()
When optimizing performance-critical code:
This approach minimizes resource utilization and significantly improves performance, especially when working with large datasets and complex query plans.
Enter the URL of a public GitHub repository