Knot Resolver 6 News: DoS protection – technical solution

In the previous article of this series, we have outlined how Knot Resolver 6 and Knot DNS 3.4 protect themselves as well as other participants on the Internet against denial-of-service attacks, from a high-level point of view. Let us now dive deeper into the implementation and take a look at the actual technical solution used to achieve this kind of protection.

In this article we shall introduce a custom-designed data structure that we use to keep track of potential attackers.

Once again, this effort is a part of the ongoing DNS4EU project, co-funded by the European Union¹. All of the code described here is free and open source under the GPL license.

KRU: The keeper of counters

Let us first start with a correction of the most glaring inaccuracy we have left in the last article. Quite obviously, it is not possible to accurately keep track of the query counts and/or CPU time for every single IP address in the IPv4 space, much less in the IPv6 space, in a performant manner. However, we do not actually need the counting to be perfectly accurate. Since we are working with multiple levels of accuracy anyway – measuring multiple prefix lengths at once to be able to protect from attacks coming from different network sizes – we can make do with an approximation quite well, trading some reduction of precision for increased performance and manageable memory usage.

The KRU² is a custom data structure we use to make the counter data fit into the memory of an actual physical computer. It was designed as a combination of several known techniques, adapted and optimized for the particular requirements of DoS protection in Knot Resolver.

Top-level structure

Let us now describe KRU’s design in an iterative manner, with each step adding on top of what we already know.

Space-saving algorithm

The basis of KRU is a very simple “space-saving” algorithm³, which has surprisingly good properties on practical inputs⁴. The algorithm keeps a fixed number of key-counter pairs, where key is the source IP address a query came from, or its prefix.

One interesting situation to note here is inserting while the table is full. At that point, we find the lowest counter in the structure and overwrite its key, but we keep the counter value and increment it as usual (assume equal instant limit for all counters for now).

Keeping the counter value instead of zeroing it is important – it improves the behavior in the edge case where two or more different keys keep evicting each other. This makes the keys effectively share the counter, instead of evicted counters getting repeatedly degraded to zero.

Sharding by hashing

Running the space-saving algorithm with a very large number of keys would not be very efficient. To solve this, why not create a hash-indexed array, in which each entry is a bucket doing the space-saving algorithm inside instead?

This way, we can first make an O(1) search for a bucket in the array, using a calculated hash of the key, then find the correct key-counter pair in the much smaller bucket.

Load-balancing the buckets

Hashing like this has one disadvantage – the (pseudo-)random distribution of items caused by the hashing leads to a good level of utilization in most buckets, but a small fraction of them tends to get significantly overloaded. To counter this, we utilize the 2-choice paradigm, which brings a very significant load-balancing effect fairly cheaply.

This way a key actually points us to two buckets, each in a separate table, and we search for the matching key-counter pair in both buckets at once. If it is already present in either one, we use it and increment the counter. If not, we select the lowest counter of the whole set of counters from both buckets, evict its key, and replace it with the new one. In some sense, we thus always choose the “less full” bucket of the two available.

The important thing to note here is that each distinct key will pair the buckets in a different way, since we use different hashes to index each of the tables (more on this below), i.e. there is no direct relationship between any two buckets from each of the tables. This provides us with the strong load-balancing effect we want to achieve.

Saving memory, saving time

Now this is a good basis for our structure, but we are not done just yet. To minimize memory usage and maximize performance, we still have some optimizations to do.

Lazy decay

As described in the previous article, we do not just add to the counters, but we also apply exponential decay.

Decaying the whole KRU periodically would be costly in terms of memory accesses.

Instead, we could keep a last-access timestamp along with each counter, then scale the decay by the difference between the timestamp and the current time each time we access a counter (and update said timestamp). But that would take lots of memory.

As a sort of middle ground, we store a shared timestamp in each bucket. Whenever we access a bucket, we first ensure that all of its counters are decayed according to the time difference, update the timestamp, then do the rest of the intended operations on it.

This decay update happens at most once per millisecond for each bucket, as that’s the time granularity that we’ve chosen for our use case.

No keys, all hashes

Storing entire keys would be relatively costly for memory usage, as IPv6 addresses take 16 bytes each, but as it turns out, we actually do not need to. Instead, we store only 16-bit hashes – and let items collide. 16 bits may look like too little at a first glance, but remember that collisions have already been reduced significantly by the sharding and therefore each bucket only “sees” a small fraction of all keys already.

You may be wondering how we calculate all the hashes we need. Well, for each key, we actually calculate a single 64-bit hash, then cut three parts out of it, bitwise. One 16-bit piece is used as the key for the space-saving algorithm and two pieces are used as indices to the hash tables. The bit-length of said indices is variable, based on the table capacity, which is a user-configurable parameter.

Counter precision

To further save memory, our counters are only 16-bit, which also gives us an opportunity to improve speed through vectorization (more on that below).

This may appear to cause the mechanism to be very imprecise, but we use a little trick to minimize the impact. All calculations are actually performed in 32-bit arithmetics, with basically fixed-point real numbers, where the point is exactly in the middle with 16 bits on either side. After a calculation is done, before the result is written into memory, the result is rounded probabilistically using the fractional part of the number (i.e. the 16 least significant bits) as the chance that the value will be rounded up. This way, even if we have counter increments smaller than “one”, the overall value will keep summing up very accurately. Since for limiting we are only interested in recognizing whether the summed up values have reached the limit, this is a good way to keep memory usage low while keeping the precision high.

Furthermore, for rate-limiting we scale the prices of DNS operations depending on the configured instant limit L_I. A single query actually does not increment the counter by one. Rather, the increments and respective limits are scaled using the ratio of 2¹⁶/L_I, so that the range of counter values corresponds to the range allowed by the configured limits. This way we better utilize the scarce counter bits and values of counters with different limits become comparable. Depending on the configured limit, we may end up scaling either up or down. Scaling up gives us a higher counter precision, while scaling down lets the limit to be higher at the cost of a precision decrease (but still using the aforementioned probabilistic rounding to make up for some of the precision loss).

By the way, in the previous article we have mentioned that counters never increase beyond the hard instant limit. This is the reason. Since the limits and the increments are scaled up to the counters’ 16 bits, the limit is actually just above the maximum value the counter can ever technically represent.

CPU cache awareness

For better efficiency, especially on concurrent accesses from many CPU cores, it is good to limit the amount of cache lines accessed while processing a unit of data. Therefore, we choose that buckets map exactly to x86 CPU cache lines, which are 64 bytes.

When we put all the described optimizations together, we only need 16+16 bits per hash-counter pair and a single (32-bit) timestamp per bucket. This way, we manage to fit 15 entries into a single cache-line-mapped bucket.

Summarizing picture

Low-level optimizations

Concurrency

The KRU structure needs to be shared by multiple threads (in the case of Knot DNS) or processes (in the case of Knot Resolver). For example, source port randomization typically leads to UDP requests of a single client being distributed among multiple threads/processes of the machine; as opposed to a single connection (TCP, TLS, DoH), which is bound to a single thread/process.

Accesses to the KRU are lock-free. The hashing scheme itself already reduces collisions from concurrent accesses, and we use atomic instructions to further reduce inaccuracies that might arise from races.

Cache line access optimization

While evaluating a query, we need to work with data from two KRU buckets for each of its source address prefixes. To speed these accesses up, we utilize an ability in x86_64 CPUs to prefetch specific cache lines before we actually start working with the data.

We then perform decays on the buckets and check whether any of the considered counters would exceed their limits if increased. If they would, we refrain from increasing any of them; otherwise, we increase all of them.

This way, we often need only read accesses for restricted queries (except for the decay once per millisecond), which provides us with more speed-ups. Apart from simply reducing the number of operations, this also gives us a speed-up because read-only operations do not require evicting cache lines from other cores’ dedicated caches⁵, which may further save us some cycles during intensive attacks.

SIMD instructions

Most of somewhat recent x86_64 processors support so-called SIMD⁶ instructions as part of the SSE*, AVX2, and AES instruction set extensions. These are able to perform the same operations on multiple values at once. If these instructions are available, we use their intrinsics⁷ in the code to get better performance.

The bulk of these hand-optimizations has been made in searching arrays of 16-bit numbers by using SSE* and AVX2. This is the vectorization mentioned above. We also use AES acceleration to get faster hashing with “good enough” properties (not crypto-strength, but good enough for hash table accesses).

Conclusion

This concludes the second part of DoS protection in Knot Resolver 6.

We have introduced our new data structure, the KRU, which we use to store counters per source IP address. We have described the top-level structure, starting from the space-saving algorithm, which is used as the base. Then we moved on to sharding via hash-indexed arrays and load balancing via splitting the data into two distinctly hashed tables.

To save memory and CPU usage, we needed to take some precautions since the very beginning of the structure’s design. We have described how we apply exponential decay to the KRU’s entries lazily to avoid processing the whole table every millisecond; how all keys are actually only ever matched by their hash, without exact equality matching to save some CPU time; how we use counters with reduced precision to save memory (but save some of the precision via probabilistic rounding); and how the structure is CPU cache-aware to save time while loading data from memory.

In the last section, we described some of the more advanced optimizations, like lock-free concurrency; cache line optimizations; and the usage of SIMD instructions.

Project number: 101095329 21-EU-DIG-EU-DNS
Project name: DNS4EU and European DNS Shield.
This project is co-funded by the European Union.↩︎
Short for (k)not recently used.↩︎
Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.↩︎
Cormode, G., & Hadjieleftheriou, M. (2010). Methods for finding frequent items in data streams. The VLDB Journal—The International Journal on Very Large Data Bases, 19(1), 3–20.↩︎
See MESI and similar protocols.↩︎
Single instruction, multiple data.↩︎
Function-like constructs built into the C compiler, which are then replaced with their respective CPU instructions during machine code generation.↩︎

Authors: Vladimír Čunát, Lukáš Ondráček, Oto Šťáva

Autor: Lukáš Ondráček

Blog zaměstnanců CZ.NIC

@ IN SOA domains.dns.enum.mojeid.internet. nic.cz.