Historical Reporting for your PowerFlex cluster with Grafana

Ever wished you could drill down back to any point in time on your PowerFlex / VxFlex / ScaleIO cluster? Then please read on!

Using nothing but Open Source tools and the native REST API functionality of PowerFlex, you can unlock a tremendous amount of visibility into your environment:

  • Capacity utilisation – free space, thin, thick, compression
  • IOPS / BW / latency metrics
  • Rebuilds / Rebalances
  • Different views for the components, SDS, SDC, Volumes, etc.
  • Extensible – add-on email alert triggers, ping monitoring, and so on.

Enough talk though, please download the the virtual appliance and documentation here. It’s a very straight forward setup which should take you less than 1 hour in total. (A big shout out to Brian Dean for making this so simple).

If you do get stuck with anything, or have any other wish list items, please do let us know on the comments section of this blog below.

Email Alerting:

Proper monitoring is critical to the success of any system, and PowerFlex is no exception in this regards. With the email alerting functionality in Grafana, it’s easy to setup critical thresholds that will trigger depending on what you need to see.

Setup Instructions:

1. Edit the grafana.ini file:

[root@vxflexos-monitor ~]# vim /etc/grafana/grafana.ini

2. Search for the SMTP section:

/smtp

3. Edit this section of the file according to your environment:

######## SMTP / Emailing ########

[smtp]
enabled = true
host = your.mail.server:25
;user =
;# If the password contains # or ; you have to wrap it with trippel quotes. Ex """#password;"""
;password =
;cert_file =
;key_file =
skip_verify = true
from_address = your.email@company.com
from_name = Grafana PowerFlex Production
EHLO identity in SMTP dialog (defaults to instance_name)
ehlo_identity = grafana.company.com
[emails]
welcome_email_on_sign_up = true

4. Save the file (Escape, :wq)

5. Restart the grafana server:

[root@vxflexos-monitor ~]# systemctl restart grafana-server

6. If you ever need to see the traffic hitting your email server for troubleshooting, use this string in tcpdump (you will need to install tcpdump first, 'yum install tcpdump':

[root@vxflexos-monitor ~]# tcpdump -vv -x -X -s 1500 -i eth0 'port 25'

7. As another test you can / should also verify that you can send an email directly to your mail server via telnet (yum install telnet) - There is a good guide on this here.

Jump back into your Grafana web console, and go to the Alerting section:

Next, setup a new Notification channel and your email address(es):

Copy any table into a new Dashboard:

And then remove the variable (Grafana alerts do not like variables):

Next set your alert. In the below example I have set it to send an alert if the Read or Write IOPS exceed 5000 for more than 5 minutes.

Probably a more useful one, is to be alerted if any forward rebuilds are occurring:

Do also remember to Save your new dashboard!

While this email alerting functionality is very handy – It should also be setup in conjunction with ESRS, syslog and SNMP traps that PowerFlex also supports. You should also periodically verify that your alerting system is working correctly.

Log Retention Period

By default, the Grafana appliance has an indefinite retention period – meaning that eventually you’ll either need to increase the disk space or simply run out of it!

So first things first, you can check how much space your current DB is using:

[root@vxflexos-monitor ~]# du -sh /var/lib/influxdb/data/telegraf/
1.5G    /var/lib/influxdb/data/telegraf/

Based on this, you can estimate how much time has passed since you began logging and the size of the DB to determine how much capacity you might need in the future.

Next, if you know that you will not need data after ‘x’ amount of time, you can configure InfluxDB to drop that older data:

[root@vxflexos-monitor ~]# influx
Connected to http://localhost:8086 version 1.7.6
InfluxDB shell version: 1.7.6
Enter an InfluxQL query

> use telegraf

Using database telegraf

> show retention policies

name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

> ALTER RETENTION POLICY “autogen” on “telegraf” DURATION 52w

> show retention policies

name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 8736h0m0s 168h0m0s           1        true

> exit

There are other tricks like down-sampling the data too to keep less information — but I haven’t tried that myself yet:

https://docs.influxdata.com/influxdb/v1.7/query_language/database_management/#retention-policy-management

Clearing Influx DB

If you’ve got yourself into a situation where you want to wipe your Influx DB and start your metrics from scratch:

[root@vxflexos-monitor telegraf-vxflex]# influx
> use telegraf
> DROP SERIES FROM /.*/
> exit
[root@vxflexos-monitor telegraf-vxflex]# reboot

Do let us know in the comments section how you get on with this and what other monitoring features you’d like to see in your PowerFlex environment.