New Statistics – What’s under the Hood?

CZ.NIC has quite a long tradition of acquiring, processing and publishing data about the operation of authoritative DNS servers, public resolver ODVR, CZ domain register and mojeID service. Whilst previous (and still active) web pages with statistics offered many graphs and extensive options for setting their parameters, they tended to evoke the refrain from the popular Czech song by Zdeněk Svěrák and Jaroslav Uhlíř: “Statistics is boring, albeit having valuable data …”. One of very few positive outcomes of the covid pandemic was, thanks to the efforts of Johns Hopkins University and many other institutions, a considerably raised standard of statistical visualisations that our old statistics certainly don’t fulfil.

After more than a year of development and testing in the ADAM project we now introduce new web pages with graphs and statistics. Their dashboard-like organization together with inovative types of graphs allow for communicating, on a relatively small space, even quite complex information about the status and operation of our major services.

In fact, our old statistics turned out to be far from boring whenever we needed to change or update them. This invariably required us to look into multiple places, some of them with limited accessibility, and modify various obscure scripts written in several languages. Therefore, our goal was not only to make the contents and form of the statistics more comprehensive and attractive, but also to improve and consolidate the underlying data-processing system. Its current schema is depicted in the following figure.

ADAM architecture

Main components and data-processing workflow

I already described the principal tools for raw data acquisition in my earlier blogposts:

  • DNS Probe is now installed on all authoritative DNS servers operated by CZ.NIC, as well as ODVR resolvers. One of the apparent benefits is the improved ability to pair DNS queries with replies, which is especially significant for DNS-over-TCP (50–100% increase).
  • DNS Crawler has been in routine operation for more than one year. Apart from supplying data for our statistics, it is also used for other purposes, mostly related to the activities of the CSIRT.CZ team.

Both sources mentioned above generate huge data volumes that are stored in the specialized database Apache Hadoop. My colleagues Pavel Doležal and Maciej Andziński wrote about it in their recent report. In addition, we also use data from CZ domain registry and other internal databases. All data used for the statistics are first pre-processed and stored into tables od the service database (PostgreSQL). Using the amazing software PostgREST, we automatically generate a REST API on top of it, which provides easy access to the data not only to us but also to other internal or external users.

Our tool of choice for statistical data processing and visualization is the R language, which seems to be a perfect fit for such tasks. Thanks to the efforts of Hadley Wickham and RStudio company, a set of highly useful and nicely documented open-source libraries are now available as the Tidyverse package, which makes the R system accessible to mere mortals.

For the preparation of our dashboards, as well as other ADAM deliverables such as reports, we use R Markdown. This is another remarkable invention of RStudio that allows for combining text in the popular Markdown format with R scripts. Web pages are generated from R Markdown using the flexdashboard package, also developed by RStudio.

The contents of the pages with statistics are created simultaneously in the Czech and English language versions. Textual parts of every page are kept in two separate files, whereas R code blocks are shared – we use GNU gettext for their internationalization. This is particularly useful in our case because our R experts are both foreigners, so the Czech translations have to be prepared later by somebody else.

A great advantage of the current setup is that all code related to the statistics (R Markdown, but also other R scripts, CSS stylesheets etc.) is stored in GitLab, in the Adam Dashboard project. For generating and publishing the web pages we utilize continuous integration and deployment (CI/CD) in GitLab. After pushing updated source files to the devel branch of GitLab, a new development version of the pages is generated in a matter of minutes. The new contents are verified there, merged to the main branch, and eventually appear at https://stats.nic.cz.

The GitLab project repository is public (for reading), so interested readers may find all technical details there. We will appreciate error reports and requests for changes or enhancements, preferably via the Issues page.

Autor:

Zanechte komentář

Všechny údaje jsou povinné. E-mail nebude zobrazen.