but that's manageable. this was nightly, so we could totally push updates that would introduce additional checks/logging that would negatively impact performance. Which is what we did. it was like printf debugging with a turnaround cycle of 2+ days as you wait for new crashes
1
1
46
We tweaked the hashmap code to use special canary bit patterns on empty and freed entries. This let us verify that the bug was basically that the hashmap's entry occupancy buffer was experiencing a bit flip where it would think that a never-occupied entry was occupied
1
25
this would cause the hashmap destructor to attempt to destroy a struct from uninitialized memory, which of course caused segmentation faults
1
21
we added a bunch of journaling and did integrity checks when mutating the map. the checks were failing before we entered hashmap code; further indication that the bug was not in rust's hashmap
1
21
alright, so the bug is probably in firefox code that's deciding to stomp on this hashmap we tweaked the code to mprotect the buffer outside of hashmap API calls. this would ideally move the segfault to the stompy code
1
29
we found nothing. so it seems like the hashmap itself is not breaking invariants, and code outside the hashmap is not stomping on it, but invariants are broken anyway ?? by whom, we have eliminated all possible actors, right?
1
22
eventually someone noticed that there were similar crashes at a reduced rate in similar C++ hashmaps in the old style system
1
23
the hypothesis we ended up with was that Rust hashmaps are designed in a way (with a separate occupancy buffer) that makes them more vulnerable to random bitflips causing corruption, especially when the hashmap is huge, like it was here
1
31
and the reason random bitflips happen? There's basically only one thing that can cause bitflips that won't be detected by mprotect: "Some People Just Have Bad RAM" It's bound to happen at least little bit with a deploy base as large as Firefox.
11
8
2
106
I remember standing in the SFO office with @davidbaron looking at multiple different kinds of crashes that could only be explained by bad RAM.
1
2
Replying to @khuey_ @ManishEarth
I also remember one that was pretty clearly bad disk -- a repeated startup crash on a single machine that was clearly explained by a single bitflip in the instructions being executed.

May 27, 2020 · 5:01 PM UTC

1