Uh, yet another collector/grapher. That's nice but..
We have tons of collectors. And tons of graphers. What we have not is a little bit of smarts in that tools. Ability to predict and ability to react.
Predict. We have Holt-Winters Forecasting Algorithm implemented in RRDTool from 2005 and a couple of papers.
React. I'm not talking about 'fix it automagically'. But everyone wants to know 'wtf was that peak on this graph last night?'. Usually your never know, except the simplest cases. Because you cannot collect everything about everything all the time. But monitoring system could enable 'collect everything we can' for short period of time when it detects something. Something wrong or something strange, something out of the pattern. Does anybody hear about system with something like that?
Is what you're referring to something like Growl that pops up and says 'Hey, metrics.sessions.active just dropped by 70%", or something pre-configured to spin up additional VMs/instances/dynos when some metrics misbehave?
TL;DR: How autonomous is it?
It's certainly not available out-of-box, and it would take a bit of work, but it should be possible to build something like this with Heka. You could write a dynamic filter plugin that watched for peaks in a specific graph (or any other arbitrary trigger) and, if found, generate a message to trigger the activation of a bunch of new dynamic filters, quickly turning on a more comprehensive set of data analyzers.
The question is whether reacting to peaks identified this way can be helpful or not. If you have hundreds of thousands of metrics, how many peaks get detected per minute and how many of those indicate something that actually requires attention?
There are many ways to sort peaks out. For instance, there is a script for RRDtool that removes obviously irrelevant spikes [1].
The vast majority of hundreds of thousands metrics are common for any node/server actually, so predefined recipes/settings should be used for them. Only aplication-level metrics for only in-house applications have to be tuned manually.
We have tons of collectors. And tons of graphers. What we have not is a little bit of smarts in that tools. Ability to predict and ability to react.
Predict. We have Holt-Winters Forecasting Algorithm implemented in RRDTool from 2005 and a couple of papers.
React. I'm not talking about 'fix it automagically'. But everyone wants to know 'wtf was that peak on this graph last night?'. Usually your never know, except the simplest cases. Because you cannot collect everything about everything all the time. But monitoring system could enable 'collect everything we can' for short period of time when it detects something. Something wrong or something strange, something out of the pattern. Does anybody hear about system with something like that?