Avoid duplicate function calls, repeated tensor operations, and redundant computations that can significantly impact performance. Implement caching mechanisms for expensive operations and pre-allocate tensors when possible.
Avoid duplicate function calls, repeated tensor operations, and redundant computations that can significantly impact performance. Implement caching mechanisms for expensive operations and pre-allocate tensors when possible.
Key optimization strategies:
get_normalized_target_modules()
or is_npu()
to avoid repeated executiontorch.arange()
or torch.zeros()
in hot pathsExample of caching expensive operations:
# Before: Redundant calls
for lora_id, config in self.configs.items():
user_normalized_modules = get_normalized_target_modules(config.target_modules)
# After: Cache and reuse
normalized_cache = {}
for lora_id, config in self.configs.items():
if config.target_modules not in normalized_cache:
normalized_cache[config.target_modules] = get_normalized_target_modules(config.target_modules)
user_normalized_modules = normalized_cache[config.target_modules]
Example of pre-allocation:
# Before: Creating tensors in hot path
q_indptr = torch.arange(0, bs + 1, dtype=torch.int32, device=device)
# After: Pre-allocate during initialization
self.q_indptr_decode = torch.arange(0, max_bs + 1, dtype=torch.int32, device=device)
# Use pre-allocated buffer: q_indptr = self.q_indptr_decode[:bs + 1]
Performance impact can be substantial - caching reduced execution time from 133μs to 5μs in one measured case. Always profile critical paths and eliminate redundant work through strategic caching and pre-allocation.
Enter the URL of a public GitHub repository