BIRD Journey to Threads. Chapter 3½: Route server performance

All the work on multithreading shall be justified by performance improvements. This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8, showing some data and thinking about them.

BIRD is a fast, robust and memory-efficient routing daemon designed and implemented at the end of 20th century. We’re doing a significant amount of BIRD’s internal structure changes to make it run in multiple threads in parallel.

Testing setup

There are two machines in one rack. One of these simulates the peers of a route server, the other runs BIRD in a route server configuration. First, the peers are launched, then the route server is started and one of the peers measures the convergence time until routes are fully propagated. Other peers drop all incoming routes.

There are four configurations. Single where all BGPs are directly connected to the main table, Multi where every BGP has its own table and filters are done on pipes between them, and finally Imex and Mulimex which are effectively Single and Multi where all BGPs have also their auxiliary import and export tables enabled.

All of these use the same short dummy filter for route import to provide a consistent load. This filter includes no meaningful logic, it’s just some dummy data to run the CPU with no memory contention. Real filters also do not suffer from memory contention, with an exception of ROA checks. Optimization of ROA is a task for another day.

There is also other stuff in BIRD waiting for performance assessment. As the (by far) most demanding setup of BIRD is route server in IXP, we chose to optimize and measure BGP and filters first.

Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8 physical cores, two hyperthreads on each. Memory is 32 GB RAM.

Test parameters and statistics

BIRD setup may scale on two major axes. Number of peers and number of routes / destinations. (There are more axes, e.g.: complexity of filters, routes / destinations ratio, topology size in IGP)

Scaling the test on route count is easy, just by adding more routes to the testing peers. Currently, the largest test data I feed BIRD with is about 2M routes for around 800K destinations, due to memory limitations. The routes / destinations ratio is around 2.5 in this testing setup, trying to get close to real-world routing servers.

BIRD can handle much more in real life, the actual software limit is currently a 32-bit unsigned route counter in the table structure. Hardware capabilities are already there and checking how BIRD handles more than 4G routes is certainly going to be a real thing soon.

Scaling the test on peer count is easy, until you get to higher numbers. When I was setting up the test, I configured one Linux network namespace for each peer, connecting them by virtual links to a bridge and by a GRE tunnel to the other machine. This works well for 10 peers but setting up and removing 1000 network namespaces takes more than 15 minutes in total. (Note to myself: try this with a newer Linux kernel than 4.9.)

Another problem of test scaling is bandwidth. With 10 peers, everything is OK. With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak which is just about the bandwidth of the whole setup. I’m planning to design a better test setup with less chokepoints in future.

Hypothesis

There are two versions subjected to the test. One of these is 2.0.8 as an initial testpoint. The other is version 3.0-alpha0, named bgp as parallel BGP is implemented there.

The major problem of large-scale BIRD setups is convergence time on startup. We assume that a multithreaded version should reduce the overall convergence time, at most by a factor equal to number of cores involved. Here we have 16 hyperthreads, in theory we should reduce the times up to 16-fold, yet this is almost impossible as a non-negligible amount of time is spent in bottleneck code like best route selection or some cleanup routines. This has become a bottleneck by making other parts parallel.

Data

Four charts are included here, one for each setup. All axes have a logarithmic scale. The route count on X scale is the total route count in tested BIRD, different color shades belong to different versions and peer counts. Time is plotted on Y scale.

Raw data is available in Git, as well as the chart generator. Strange results caused by testbed bugs are already omitted.

There is also a line drawn on a 2-second mark. Convergence is checked by periodically requesting birdc show route count on one of the peers and BGP peers have also a 1-second connect delay time (default is 5 seconds). All measured times shorter than 2 seconds are highly unreliable.

Single-table setup

Data plot for Single. See raw data in Git: doc/threads/stats.csv

Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to 2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is still some room for improvement, yet 8-fold speedup on 8 physical cores and 16 hyperthreads is good for me now.

The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my configuration converges after almost two hours on 2.0.8, with the speed of route processing steadily decreasing until only several routes per second are done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and couldn’t fit into 32G RAM. There is definitely some work ahead to stabilize BIRD behavior with extreme setups.

Multi-table setup

Data plot for Multi. See raw data in Git: doc/threads/stats.csv

Multi-table setup got the same speedup as single-table setup, no big surprise. Largest cases were not tested at all as they don’t fit well into 32G RAM even with 2.0.8.

Single-table setup with import and export tables

Data plot for Imex. See raw data in Git: doc/threads/stats.csv

Multi-table setup with import and export tables

Data plot for Mulimex. See raw data in Git: doc/threads/stats.csv

Setups with import / export tables are also sped up by a factor about 6-8. Data on largest setups (2M routes) are showing some strangely ineffective behaviour. Considering that both single-table and multi-table setups yield similar performance data, there is probably some unwanted inefficiency in the auxiliary table code.

Conclusion

BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is some speedup in every testcase and code stability is enough to handle typical use cases. Some test scenarios went out of available memory and there is definitely a lot of work to stabilize this, yet for now it makes no sense to postpone this alpha version any more.

We don’t recommend upgrading a production machine to this version yet, anyway if you have a test setup, getting version 3.0-alpha0 there and reporting bugs is much welcome.

Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn’t limit its number of threads. It will spawn at least one thread per every BGP, RPKI and Pipe protocol, one thread per every routing table (including auxiliary tables) and possibly several more. It’s up to the machine administrator to setup a limit on CPU core usage by BIRD. When running with many threads and protocols, you may need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per every thread for internal messaging.

It’s a long road to the version 3. By releasing this alpha version, we’d like to encourage every user to try this preview. If you want to know more about what is being done and why, you may also check the full blogpost series about multithreaded BIRD. Thank you for your ongoing support!

Autor:

Komentáře (9)

  1. Douglas Fischer říká:

    Hello Maria!
    I am very happy and grateful to see these evolutions!

    I’ll take the liberty of making some comments and suggestions:

  2. Douglas Fischer říká:

    The 1st one is about the memory limitation in your test environment.
    If we do a Crowdfunding to quickly and easily get a memory upgrade in this scenario to something like 256GB of memory:
    – Would that help you?
    – Would it be accepted by your project?

    P.S. Do you believe that if there were more resources for these test environments, it would help the evolution of the IBRD project?
    Is this a limiting factor?

    • Maria Matejka říká:

      Thank you! First, there is a larger machine on the way so this problem should be solved in several months, yet I’ll try to find what machinery would help us most if there was some crowdfunding, it looks like a good idea indeed.

      On the other hand, the single major problem we face is lack of developers, all the development and testing simply takes lots and lots of people-days. We are hiring: https://www.nic.cz/page/321/kariera-v-cznic/#prog_bird

  3. Douglas Fischer říká:

    The 2nd question – BGP filters and the impact on the performance – Specially multithreaded

    I could be wrong, but I can’t imagine any scenario where the complexity of the BGP filters is greater than in the IXP Route-Servers. Route-servers operators have to be real soothsayers to foresee the types of mistakes and mischief that can be made by IXP participants.

    Another practical information that I can share is that the impact on the performance of route-servers and route-reflector is very noticeable when using stricter route filtering rules.

    I’m not a programmer, and I know almost nothing about it. But I imagine that the more complex the filtering tests on the BGP routes, the greater the chance of some kind of lock in the mutex.

    Could it be possible that these filtering rules used in comparative performance tests were heavier?
    Things like: next-hop validation, per-peer as-path regex, per-peer prefix-lists including different passes depending on marked communities and prefix length come to my mind.

    Just thinking about what I just mentioned above, I can imagine that randomly creating these validation criteria for a load test must be something so complicated that it might make implementation unfeasible…
    But… As they say here in Brazil: “The answer “No” I already have before I even ask.”

    • Maria Matějka říká:

      Good questions!

      Of course the IXP’s are driving the multithreaded initiative at most, nobody else has a 20-minute startup time until everything converges. There are other kinds of usage (route collectors) where BIRD is also scaled up quite a lot, yet in a different way. These should also benefit from multithreading.

      Regarding the filtering rules, we are quite lucky. Almost everything is lockless, with exceptions of ROA checks and nexthop resolving; both of these should get improvements in some of the future versions. Therefore, the more heavy your filtering rules are, the better speedup you should get.

      I think the data with heavy filtering rules is already deleted as they were measured on some development version, yet there was (on 8-core 16-thread cpu) a speedup almost 16x. Will measure that in future, of course.

      (Pseudo)Randomly creating validation criteria is something I somehow do, yet it definitely needs a lot more effort to polish and be done better. (It is a nasty combination of Bash and Perl. Needs a deep breath and a bottle of wine before reading.)

  4. Douglas Fischer říká:

    3rd Question – Tests beyond performance.

    I imagine the biggest drivers for BIRD multi-threading support are IXPs with route-servers in really huge scenarios.

    On the other hand, trying to think with the head of those who operate IXPs, I wonder how worrying and difficult it is to consider the possibility of a version change as significant as the one proposed from IBRD v2 to v3.
    And this concern is compounded and confirmed if we look at the fact that some of the biggest IXPs in the world have not even upgraded from v1.6 to v2.0 of BIRD.

    I’m convinced that the BIRD development flow already has a good test-oriented development component.
    But it is known that code tests are different from production tests in the real world.

    So I would like to ask whether, in addition to focusing on performance in these tests that are being carried out, it would also be possible to include stability tests with criteria suggested collaboratively by IXP operators.

    • Maria Matějka říká:

      We already have some functionality tests, you may see their results in our gitlab instance here, the rightmost column: https://gitlab.nic.cz/labs/bird/-/pipelines/95892

      (you may ignore the freebsd build failure, it is a temporary problem with our test setup, BIRD builds well on freebsd)

      My colleagues are working heavily on these tests and the source code of that suite is in the bird-tools repository. https://gitlab.nic.cz/labs/bird-tools/

      We should move these tests to the main repository. It is in the plan. If there were more stability tests, it would be deeply appreciated; we’re open to discuss how these test should look like.

  5. lubo říká:

    Just curious, what does peer mean? Is it virtual DNS user (sending some random DNS requests), or it is some other bird server? Why there is problem with 1000 of them? Is it problem in reality (that it does not reflect real world use-case)? I ask, because I like simulating many clients. My last project in this area was simulating 250k IoT devices with tcp connections (on single virtual server).

    • Maria Matějka říká:

      BIRD is not involved in DNS at all. The peers are other routers (BIRD, Juniper, Cisco, whatever) connected by BGP, sending and receiving routes. BGP connections are TCP, they are kept alive for long time, updating neighbors as the changes happen. BIRD keeps quite a lot of state information about every peer together with some TX buffers (these are probably the reason of memory-bloating on cold start).

      1000 peers is a real problem, there are route servers with even more peers and we have to be stable and memory-efficient in such cases. For now, my preferred solution to this is limiting per-peer BGP TX buffers, yet there are some other tasks to do before thoroughly looking into this.

Leave a Reply to Douglas Fischer Cancel reply

Všechny údaje jsou povinné. E-mail nebude zobrazen.