Over July and August, we received several reports about always the same cryptic bug happening quite sparsely. We were methodically disproving every hypothesis about this bug and slowly accepting our feat of creating a very specific and complex test setup to reproduce andproperly analyze the bug.
Last week (and the weekend before), this changed significantly. Over just 48 hours we received more crash reports than in the last two months. Yet, expecting that most of these are going to be the same bug over and over again, we calmly opened them one after another to find out that … we have never seen any single one of these.
One should not be surprised by this. People run a wild range of very different configurations, and in September, everybody starts deploying new versions. But as always, even though anticipated and expected, we never actually account for this so we were caught by surprise a little bit.
Fortunately, all the bugs had quite obvious solutions, so we chewed through them easily, until I opened the core dump sent by Rob Lister, and froze. Just the backtrace told me enough. This failed assertion was in a completely legitimate code path, and anybody could crash a remote BIRD just by closing their BGP session with an additional message. In BIRD words, running disable bgp1 "lol bye" was not only a bye for this one session, but the BIRD on the other side would crash on this assertion. Always.
There is no doubt why we haven’t caught this exact crash much earlier. This feature is not covered by our autotests. Will be soon but it isn’t now.
Well, but you haven’t come for the story, have you? Why is there an assertion in the first place?
The Assert
BIRD 3 is multithreaded, and therefore needs locking. To achieve this, we have split BIRD 3 into different domains. If we were writing BIRD 3 from scratch, we would have probably used a memory model where data are passed between domains always asynchronously, through some message queues and other mechanisms. But we are carrying all the BIRD 2 code all along because nobody wanted to reimplement all the protocols.
BIRD 2 is very liberal in what can be accessed synchronously from which place. You may end up with multiple protocols and tables on the stack, and the function importing a route into a table does not return until all the export routines have been processed. It’s a genial approach indeed, but a deadlock hell for multithreading.
To handle everything without needing to rewrite all the internal APIs at once, we sorted the locking domains so that they are partially ordered, and the rule is that every routine is allowed to lock another domain if all the currently held domains are strictly before that one being locked. To show an example, a protocol instance is allowed to enter a table context because all tables are strictly after all protocols. But a table routine can not enter a protocol instance, and has to use some notification mechanism. Also, BGP can not directly enter a BMP context and dump the messages synchronously, which brought quite some pain when merging BMP implementation into BIRD 3.
What happened here, is simple. The BGP protocol routine handling the Notification message asked for allocation from proto_pool which is a global resource pool, a parent pool for all the protocol instances. This pool is locked by the the_bird domain, which is the toplevel domain nobody is allowed to enter; it is always held by the main thread which handles reconfiguration, toplevel operations, CLI and several protocols not yet threadified. And because that routine is not running inside the the_bird context, the deadlock prevention mechanism crashed BIRD, instead of risking deadlock.
Deadlock Prevention
Honestly, in this exact situation, a deadlock was very improbable, so crashing the whole BIRD might look like an overkill. Until you have to deal with it in real life. First, even detecting a deadlock is hard. BIRD handles amounts of work large enough to simply suspect some performance issue first. Worse than that, not crashing means not being restarted by SystemD.
Also getting a meaningful dump from the deadlocked BIRD means you have to kill the process with the right signal (e.g. ABRT) to produce the core dump. Are you really thinking about this when fixing a stuck router? Maybe if it gets stuck fifth time in a row, but here it would be a rare bug, nobody cares, just KILL and restart. The deadlock prevention mechanism produces the core dump always.
Last but not least, the deadlock actually wouldn’t be so improbable if you executed a targeted shutdown against an IXP where the protocol and route states are periodically harvested via CLI. Suddenly, if you timed your shutdown well enough, just right after the harvesting routine started, you would very easily get in a situation where one thread requests the_bird lock from the BGP, while the CLI, running in the_bird, requests that exact protocol lock because of show proto all, and the whole BIRD gets stuck. It’s better to just outright crash.
But why does the protocol allocate from the global proto_pool at all? Well, all the resource pools available to protocols are bound to the protocol’s instance up-lifetime, therefore it is flushed when the instance goes DOWN. We need this specific message to persist over the protocol shutdown though to display it to the user when queried for it. Thus we added another global resource pool for just these shutdown messages, but not at the the_bird level but ordered after the protocol instances themselves, to allow it to be entered from the protocol routines.
Future Development
For BIRD 3.2, we expect to resolve this in a different way. There is actually a little bit more data which is required to be kept over protocol instance downtime, and the protocol pool system will change to accomodate this, but that’s an update too complex for this kind of bug where the fix has to be targeted. Refactoring is not a desirable thing when dealing with a remote exploit.
I’d like to thank Rob Lister from LONAP for reporting and then sending the crashdump in private, so that we could fix this bug under an embargo and execute some amount of limited disclosure about this bug.
The limited disclosure processes are like fire drills. We all laugh at them, mock them and neglect them, until something starts to burn. This was our first real security bug with some denial-of-service opportunity. It was probably low-impact because many people stick with BIRD 2 which is not affected. Yet, it was scary enough and we found some problems inside our processes which we have to fix.
Let’s hope that the next time we use the limited disclosure processes, it will be just a drill.