When implementing parsing/storage/retrieval logic (or any pipeline with downstream queries), design the algorithm so that representations, keys, and batching decisions are stable and consistent:

Example pattern (parallel token/raw + bounded, deterministic detection):

# 1) Parallel values: keep token fields tokenized
if role in ("vectorize", "both"):
    chunk[f"{typed_key}_tks"] = tokenize(value)
if role in ("metadata", "both"):
    chunk[f"_raw_{col}"] = str(value)  # human-readable for aggregation/UI

# 2) Bounded windowing
for page_from in range(0, total_pages, batch_size):
    page_to = min(page_from + batch_size, total_pages)
    load_pages(page_from=page_from, page_to=page_to)

# 3) Deterministic detector sampling
sample = page_chars[:200]  # not random.sample
is_garbled = detect(sample, threshold=0.3)

Adopting this standard prevents subtle retrieval/search regressions, reduces heuristic flakiness, and improves correctness/performance of chunked algorithms.