A daemon to build and keep fedmsg statistics.
Motivation: we have a neat service called datagrepper with which you can query the history of the fedmsg bus. It is cool, but insufficient for some more advanced reporting and analysis that we would like to do. Take, for example, the releng-dash. In order to render the page, it has to make a dozen or more requests to datagrepper to try and find the 'latest' events from large awkward pages of results. Consequently, it takes a long time to load.
Enter, statscache. It is a plugin to the fedmsg-hub that sits listening in our infrastructure. When new messages arrive, it will pass them off to plugins that will calculate and store various statistics. If we want a new kind of statistic to be kept, we write a new plugin for it. It will come with a tiny flask frontend, much like datagrepper, that allows you to query for this or that stat in this or that format (csv, json, maybe html or svg too but that might be overkill). The idea being that we can then build neater smarter frontends that can render fedmsg-based activity very quickly.. and perhaps later drill-down into the details kept in datagrepper.
It is kind of like a data mart.
Create a virtualenv, and git clone this directory and the statscache_plugins repo.
Run python setup.py develop
in the statscache
dir first and then run it
in statscache_plugins
.
In the main statscache repo directory, run fedmsg-hub
to start the
daemon. You should see lots of fun stats being stored in stdout. To launch
the web interface (which currently only serves JSON and CSV responses), run
python statscache/app.py
in the same directory. You can now view a list of
the available plugins in JSON by visiting
localhost:5000/api/, and you can retrieve the
statistics recorded by a given plugin by appending its identifier to that same
URL.
You can run the tests with python setup.py test
.
When a message arrives, a fedmsg consumer receives it and hands a copy to each loaded plugin for processing. Each plugin internally caches the results of this message processing until a polling producer instructs it to update its database model and empty its cache. The frequency at which the polling producer does so is configurable at the application level and is set to one second by default.
There are base sqlalchemy models that each of the plugins should use to store
their results (and we can add more types of base models as we discover new
needs). But the important thing to know about the base models is that they are
responsible for knowing how to serialize themselves to different formats for
the REST API (e.g., render .to_csv()
and .to_json()
).
Even though statscache is intended to be a long-running service, the occasional reboot is inevitable. However, having breaks in the processed history of fedmsg data may lead some plugins to produce inaccurate statistics. Luckily, statscache comes built-in with a mechanism to transparently handle this. On start-up, statscache checks the timestamp of each plugin's most recent database update and queries datagrepper for the fedmsg data needed to fill in any gaps. On the other hand, if a plugin specifically does not need a continuous view of the fedmsg history, then it may specify a "backlog delta," which is the maximum backlog of fedmsg data that would be useful to it.