At Recall.ai we process a lot of real-time media. We build bots that join video-conferences across Zoom, Teams, Google Meet and more. These bots ingest, transcode and analyze the audio and video in real-time. This comes with a number of engineering challenges and this is the story of how we identified and fixed a segfault in our audio processing pipeline that was triggered by roughly 1 in every 36,000,000 seconds of audio.
A few months ago we began to notice a strange pattern in some of our bots. Very occasionally, a bot would abruptly crash without any prior indication of anything going awry. We use Kubernetes to orchestrate our bots and Karpenter to manage our elastic scaling. kubectl describe pod showed that the bots exited with code 139 = 128 + 11(signal). Signal 11 corresponds to SIGSEGV, a segmentation fault or segfault for short, this occurs when a program tries to access memory it is not supposed to. This mistake is bad enough that the execution of the program should halt immediately.
Getting a core dump
So how do we determine the cause of a SIGSEGV? The answer is often a core dump file. A core dump lets us inspect the state of the program at the instant of the crash, this lets us see many useful things such as the stacktrace, variable/memory contents and more.
Core dumps are output as files on disk when a process crashes, however because our bots are run as containers in Kubernetes, the disk is ephemeral and becomes inaccessible when the container exits and Karpenter terminates the underlying VM. In order to diagnose the cause of this segfault we need to start reliably collecting core dumps across nodes in our cluster.
We want something lightweight and that could be easily integrated into existing docker images, so we built Garbage Truck which detects and uploads core dump files to S3 before the pod terminates.
It is not often I've been eager to see segfaults occur in production but this was one such occasion. After deploying Garbage Truck we waited until the next crash so we could further the investigation.
Days passed, then a week and then two. At this point we started to wonder if Garbage Truck itself had changed the execution environment in such a way that prevented the segfaults from happening. However, on the 16th day we observed another crash and Garbage Truck diligently transported the core dump to S3 for us.
Inspecting the stack trace
Excitedly, I spun up a new pod on our cluster, using the same image where the crash occurred to ensure all binaries and system libraries are where the core dump expects them to be. We use a debian base image, so I installed gdb and loaded up the core dump:
Now we have the stacktrace telling us where our program crashed. We use GStreamer as our media processing framework so it is unsurprising see the crash originating there. The immediate location of the crash occured in /lib/x86_64-linux-gnu/libvo-aacenc.so.0 which was our AAC audio encoding library. The majority of the stack frames are designated with ?? meaning gdb does not have the symbol information for the associated binaries. Without this, besides staring blindly at the disassembly, there is little we can do to better understand the crash. At this point I suspected the following in the order of probability:
- We are subtly misusing some GStreamer API causing a use-after-free or corrupting the encoder state
- The voaacenc GStreamer plugin has a bug causing the crash
- libvo-aacenc has a bug causing the crash
From the above we know that this crash occurs in voAACEnc_pow2_xy in libvo-aacenc. A quick google search does not reveal much either. Only one poor soul, using the same encoder via ffmpeg, potentially ran into this over 9 years ago.
It turns out that GStreamer is relying on v0.1.3 of this library. This release was over ten years old. Perhaps this is not the most well-maintained GStreamer plugin...
The source code of voAACEnc_pow2_xy is:
It seems that we are indexing out-of-bounds of the statically-allocated pow2Table variable. The function itself is an optimised implementation of 2 ^ (x/y) with some invariants. So perhaps we are breaking these invariants (x<=0, y > 0) somehow.
Digging (much) deeper
As the next step in my investigation, I looked at the code that called this pow2_xy function. Here is an annotated version:
To try to decipher which invariant (if any) was violated, I inspected the variables and registers through gdb.
Because this is a heavily-optimised release build, the core dump is stubbornly refusing to reveal most of the variables state. From the above we know y is 28 and fPart is an absurdly large 0xfffffffa. This points to some kind of overflow. We don't know x.
To figure out what x might have been, I modified the function and fuzzed over a range of x:
Prints the following:
So it is probable that we are invoking pow2_xy with x = 6 causing our bot to crash abruptly.
I next looked at a simplified version of the calling code from earlier (with the for loop unrolled), to see if that sheds any light:
That seems to suggest that if enDiff = -1 it would reproduce our crash exactly. In general if enDiff < 0 for any reason, this would trigger a segfault in pow2_xy. Because enDiff = iLog4(avgEn) - iLog4(minEn);, if enDiff < 0 then iLog4(avgEn) < iLog4(minEn) which implies that avgEn < minEn. If average(energy) < minimum(energy) something else has gone very wrong.
If we could reproduce this locally, then we could step the code and (hopefully) pinpoint exactly where things started to go awry. Reading through the libvo-aacenc source some more, it seems that the encoding functions are pure over the encoder state. Or put simply, they only rely on the current state of the encoder to operate. So what if we dump the encoder state from our core dump resume the execution locally? gdb's dump binary memory is our friend here. Inspecting our encoder structure in gdb:
Yikes! It seems that our encoder state is a complicated structure. Oh well, we should still be able to reconstruct it...
Now we can hack together some C to restore our encoder state from the memory dumps:
Now we should be able to execute the encoder and reproduce the crash!
Done!? Wait, where is our burning pile of invalid memory dereferences? It seems that the resumed encoder exited cleanly. What if we add some logging to pow2_xy?
There are no invalid x values here either. Stepping through the code using gdb makes it clear as to why. Many of the internal functions mutate the encoder state throughout the lifetime of the GetOutputData entrypoint. So the state of the the encoder that we restored was the state at the instant of the segfault and not the actual state when the bot called GetOutputData prior to the crash. Unfortunately we cannot recover the original encoder state from the core dump either.
So let us go deeper then. How about if we restore the state at the frame above pow2_xy? Lets dump the memory from those arguments and try again:
Now revise our main:
Finally, we can run the program:
Success! It seems we can now reproduce the crash, even if it is directly from the above stack frame. Now we can modify our compiler flags with -O0 to disable optimisations and inspect the local state during execution.
We have just confirmed our previous conjecture that avgEn < minEn! So why does this happen? Lets step through the relevant code.
So now we know at least roughly where things have gone wrong. The fixed-point calculation of the average energy avgEn is introducing an error causing avgEn < minEn in rare cases.
But what is actually happening in norm_l, Div_32 and fixmul that leads to this error? The source code mentions these operations are in "double precision" however most of the values are stored in 32-bit integers. What is going on here? Google leads us to this relic, last modified in 1996 by the France Telecom:
Ahhh, so is a special kind of "double precision" split across 32 bits. Now with this in mind we can annotate the relevant code:
So where are we losing precision then? Lets step through our scenario in gdb:
There are a couple of things that do not look right. Firstly, we are executing Div_32($num, $denom) where $num = $demon breaking the num < denom invariant. We then lose some precision in our Div_32 call and this propagates to our 64-bit multiplication of $avgEn, at this stage the floating point equivalent is ~255 and our error is -1. The final left-shift then doubles the error to -2 and we are left with 510.
What if Div_32 did not lose precision?
Voila! avgEn is now exactly what we expect it to be after dividing it by 1.
So how do we fix this? Could we modify Div_32 to never lose precision? Probably not without harming its performance characteristics.
In this particular case the local invariant is minEn < avgEn. So lets just add implement a lower-bound for avgEn:
Now running the program again:
That seems to have worked! So lets push this fix to prod shall we?
Uncovering a fossil
Not so fast.... At this point I was struggling to believe that nobody else had encountered this issue previously. I scoured Google, GitHub and finally Bing (of all places) to find this mailing list thread. A quick read reveals that it is almost certainly referring to the same bug. And that it was fixed upstream nearly 9 year ago. However, the fixes were never released to the gstreamer plugin.
Although it was satisfying to understand and fix the bug that had been causing these sporadic crashes, pressing on with unmaintained codec is not a liability we want to take on moving forwards. Rather than relying on an unmaintained AAC encoder from gst-plugins-bad how about we find something used more broadly and hopefully more stable? The libav AAC encoder used in ffmpeg and available as a gstreamer plugin seems like a reasonable choice.
UPDATE: We have since rolled out this shiny new codec and are yet to spot another segfault! Kudos to the ffmpeg maintainers!