When improving performance, make sure code neither (a) fetches too much data, nor (b) does the same expensive work repeatedly, nor (c) runs unbounded workloads.
Practical rules
get_by_id-style lookups over broad tenant-scoped queries when you already have the identifier.doc_id / source_id in the search condition).top_k so a bad request can’t trigger massive retrieval.Examples
top = int(req.get("top_k", 1024))
top = min(top, 200) # cap user input
results = SearchService.search(..., top_k=top)
batch_size = max(1, int(os.getenv("PDF_PARSER_PAGE_BATCH_SIZE", "50")))
for page_from in range(from_page, to_page, batch_size):
page_to = min(page_from + batch_size, to_page)
__images__(fnm, page_from=page_from, page_to=page_to)
chunk_boxes = parse_window_into_boxes()
all_boxes.extend(to_global(chunk_boxes))
# include source_id/doc_id in the search condition so the store filters
cond = {"source_id": [doc_id], **other_filters}
res = SearchService.search(dataset_id=dataset_id, condition=cond, limit=1)
Enter the URL of a public GitHub repository