Ensure concurrent execution is race-free by (a) not storing per-inference mutable state in shared layer objects, (b) keeping OpenMP loop temporaries thread-local, and (c) using the correct synchronization primitive for the target platform.

Apply these rules: 1) State must be owned by the running instance, not by the shared model/layer.

Example pattern:

// Instead of: mutable Mat hidden, cell; hidden=...; cell=...; in forward()
// Do:
int forward(const std::vector<Blob*>& bottom_blobs,
            std::vector<Blob*>& top_blobs) {
    const Mat& hidden_in = bottom_blobs[HIDDEN_BLOB_INDEX]->data;
    const Mat& cell_in   = bottom_blobs[CELL_BLOB_INDEX]->data;
    Mat hidden_out, cell_out;
    // compute hidden_out/cell_out...
    top_blobs[HIDDEN_OUT]->data = hidden_out;
    top_blobs[CELL_OUT]->data   = cell_out;
    return 0;
}

2) In parallel loops, never let per-iteration data become shared state.

3) Locking primitives must match platform semantics.