Invariant Diff Matching

When implementing matching/diff logic (especially for code/entities), define and test the algorithmic invariants you rely on, and avoid fragile heuristics.

copy reviewer prompt

Prompt

Reviewer Prompt

When implementing matching/diff logic (especially for code/entities), define and test the algorithmic invariants you rely on, and avoid fragile heuristics.

Apply:

  • Use a staged matcher: (1) exact identity, (2) invariant structural equivalence (e.g., rename/formatting-insensitive hash), (3) fuzzy similarity for remaining candidates.
  • Define structural hashes by masking the exact varying span (e.g., replace the name bytes with whitespace before hashing), rather than using brittle line-based assumptions.
  • For “moved”/positional classifications, prefer local/context-based reasoning over global thresholds, and add regression tests for common false positives (like insertion above without reordering).
  • If you generate candidates via multiple discovery paths (e.g., tree indexing vs filesystem probing), ensure both paths use the same filtering rules and dedupe inputs before matching.

Example (structural hash rename-invariant):

fn compute_body_hash_invariant(content: &str, start: usize, end: usize, body_bytes: &[u8]) -> u64 {
    // Mask the entity name span so renames don’t change the structural hash.
    let mut bytes = body_bytes.to_vec();
    // Replace non-whitespace characters in [start, end) with spaces (conceptually).
    for b in &mut bytes {
        if *b != b' ' && *b != b'\n' && *b != b'\t' { *b = b' '; }
    }
    let mut hasher = std::collections::hash_map::DefaultHasher::new();
    bytes.hash(&mut hasher);
    hasher.finish()
}

Outcome: fewer misclassified diffs (renames vs modifications vs moves), deterministic behavior, and safer results under edge cases and degraded indexing.

Source discussions