Grafana monitoring itself for PowerFlex

Lisa: Dad, don’t you see you’re abusing your power like all vigilantes? I mean, if you’re the police, who will police the police?
Homer: I dunno. Coast Guard?

Recently my Grafana appliance no longer wanted to boot 🙁 (would only go to grub-rescue mode). Fortunately, I was able to spin up a replacement very quickly by following my own instructions I’d written earlier.

This time around though, I thought I should be a bit more prudent and put in some additional safety guards to hopefully ensure that whatever happened, doesn’t happen again. My best guess of what happened last time is that the Grafana appliance itself ran out of space.

I was able to find a nice dashboard of roughly what I wanted here:
https://grafana.com/grafana/dashboards/1443

Along with some basic instructions, but I have modified them slightly to fit my environment.

Edit your telegraf.conf file: vim /etc/telegraf/telegraf.conf
Paste the below into the file, and then save quit — :wq

####################################
#My Host metrics for Grafana server#
####################################

[[inputs.net]]
## By default, telegraf gathers stats from any up interface (excluding loopback)
## Setting interfaces will tell it to gather these explicit interfaces,
## regardless of status.
#interfaces = ["eth0,lo"]

[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics.
collect_cpu_time = false

# Read metrics about disk usage by mount point
[[inputs.disk]]
By default, telegraf gather stats for all mountpoints.
## Setting mountpoints will restrict the stats to the specified mountpoints.
# mount_points = ["/"]
## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
## present on /run, /var/run, /dev/shm or /dev).
ignore_fs = ["tmpfs", "devtmpfs"]
# Read metrics about disk IO by device

[[inputs.diskio]]
## By default, telegraf will gather stats for all devices including
## disk partitions.
Setting devices will restrict the stats to the specified devices.
# devices = ["sda", "sdb"]
## Uncomment the following line if you need disk serial numbers.
# skip_serial_number = false

Get kernel statistics from /proc/stat
[[inputs.kernel]]
# no configuration

Read metrics about memory usage
[[inputs.mem]]
# no configuration

Get the number of processes and group them by status
[[inputs.processes]]
# no configuration

Read metrics about swap memory usage
[[inputs.swap]]
[[inputs.system]]
[[inputs.netstat]]

3. Restart your telegraf service — systemctl restart telegraf

This should now start capturing the local host data into the InfluxDB automatically for you.

4. Create new Dashboard > Import, Load or Paste the .json file available from: https://grafana.com/grafana/dashboards/1443

And with any luck, you should be able to start seeing some fresh data in your new Dashboard:

Next steps from you, you will probably also want to setup an email alert for the disk capacity portion. This was a bit trickier as the person that wrote the original dashboard used a slightly odd format (to me) which made modifying it more difficult than usual. To help you out, please try my Dashboard instead below:

Alerting-Objects-1608542487372 Alerting Objects Dashboard

You may still need to edit the Alerting section of the new Panel – this is what has worked for me: