Monitor Megaraid controller using storcli and snmp extend

This article provides information how to get information about RAID and disk status out of Storcli and how to configure OpenNMS to monitor it. An alarm will appear when one or more RAID volumes aren’t in an optimal state or RAID disks have issues.

The scripts mentioned in this article can be found in GitHub repository.

SNMP permissions

The Net-SNMP agents runs as unprivileged user snmp and isn’t allowed to run storcli.
With creating a sudoers file it is possible to let snmp just run the necessary commands with sudo instead of running the whole Net-SNMP agent with root privileges.

Create a file in /etc/sudoers.d/snmp_storcli from snmp_storcli.

RAID volume status

Storcli command for RAID status in detail

Storcli command can be downloaded from Thomas Krenn.
The following steps are a demonstration which information we need and how to get it. The important part is that a server can have multiple RAID volumes on one or more controllers.
The script below is able to identify this and provides information which volume on which controller has problems.

To get the number of controllers:

root@megaraid:~# storcli show 
Status Code = 0
Status = Success
Description = None

Number of Controllers = 1
Host Name = megaraid
Operating System  = Linux4.4.0-77-generic

System Overview :
===============

------------------------------------------------------------------------------------
Ctl Model                   Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS  EHS ASOs Hlth 
------------------------------------------------------------------------------------
  0 AVAGOMegaRAIDSAS9361-8i     8  24   4     0   4     0 Opt On  1&2 Y      4 Opt  
------------------------------------------------------------------------------------

Ctl=Controller Index|DGs=Drive groups|VDs=Virtual drives|Fld=Failed
PDs=Physical drives|DNOpt=DG NotOptimal|VNOpt=VD NotOptimal|Opt=Optimal
Msng=Missing|Dgd=Degraded|NdAtn=Need Attention|Unkwn=Unknown
sPR=Scheduled Patrol Read|DS=DimmerSwitch|EHS=Emergency Hot Spare
Y=Yes|N=No|ASOs=Advanced Software Options|BBU=Battery backup unit
Hlth=Health|Safe=Safe-mode boot
root@megaraid:~# storcli show | grep "Number of Controllers"
Number of Controllers = 1
root@megaraid:~# storcli show | grep "Number of Controllers" | awk '{print $5}'
1

To get the states of all RAID volumes of a controller (Attention: Counter begins with 0!):

[19:56]root@megaraid:~# storcli /c0 /vall show 
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

------------------------------------------------------------
DG/VD TYPE   State Access Consist Cache sCC       Size Name 
------------------------------------------------------------
0/0   RAID1  Optl  RW     Yes     RWBD  -   446.625 GB      
1/1   RAID10 Optl  RW     Yes     RWBD  -     1.744 TB      
2/2   RAID5  Optl  RW     Yes     RWBD  -     4.364 TB      
3/3   RAID5  Optl  RW     Yes     RWBD  -     4.364 TB      
------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Some Bash Kung-Fu:

[19:56]root@megaraid:~# storcli /c0 /vall show | grep RAID 
0/0   RAID1  Optl  RW     Yes     RWBD  -   446.625 GB      
1/1   RAID10 Optl  RW     Yes     RWBD  -     1.744 TB      
2/2   RAID5  Optl  RW     Yes     RWBD  -     4.364 TB      
3/3   RAID5  Optl  RW     Yes     RWBD  -     4.364 TB 
[19:58]root@megaraid:~# storcli /c0 /vall show | grep RAID | tr -s ' ' | cut -d " " -f-3
0/0 RAID1 Optl
1/1 RAID10 Optl
2/2 RAID5 Optl
3/3 RAID5 Optl

snmpd configuration

The snmpd has to be extended.

/etc/snmp/snmpd.conf

extend storcliRaid /bin/bash -c 'sudo /usr/local/bin/check_storcli_raid.sh'

The snmpd needs to be reloaded!

Poller configuration

The poller configuration is very easy. Just use the snmp monitor to verifiy our specific OID:

${OPENNMS_HOME}/etc/poller-configuration.xml

<service name="Storcli-Raid" interval="43200000" user-defined="true" status="on">
 <parameter key="oid" value=".1.3.6.1.4.1.8072.1.3.2.4.1.2.11.115.116.111.114.99.108.105.82.97.105.100.1"/>
 <parameter key="retry" value="1"/>
 <parameter key="timeout" value="3000"/>
 <parameter key="port" value="161"/>
 <parameter key="operator" value="="/>
 <parameter key="operand" value="0"/>
</service>
<monitor service="Storcli-Raid" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

Add the service on your nodes and restart OpenNMS.

Script to check RAID volume states

This script check_storcli_raid.sh does the same stuff like explained above and provides information about issues which will be used in the poller outage message.

Disk failure prediction

Storcli command for disk failure prediction

Storcli is able to predict disk failures (similar to SMART technology). Harddisks in this state aren’t broken yet. But it’s possible to get performance issues and they are maybe going to break in future. So you may want to be informed about this.
Also for this case an example how to get the required information out of Storcli. As mentioned before servers can have one or more controllers, additional in this case is the enclosure id which is (afaik) random and also a various amount of disks. Again some explanations on how to get the required information out of storcli.

Show all disks and detailed information:

[19:58]root@megaraid:~# storcli /c0/eALL/sALL show all

The output is very long so it’s not posted here. Here the short version to get all disk paths:

[19:58]root@megaraid:~# storcli /c0/eALL/sALL show all | grep -e '^Drive.*State :' | awk {'print $2'}
/c0/e8/s0
/c0/e8/s1
/c0/e8/s2
/c0/e8/s3
/c0/e8/s4
/c0/e8/s5
/c0/e8/s6
/c0/e8/s7
/c0/e8/s8
/c0/e8/s9
/c0/e8/s10
/c0/e8/s11
/c0/e8/s12
/c0/e8/s13
/c0/e8/s14
/c0/e8/s15
/c0/e8/s16
/c0/e8/s17
/c0/e8/s18
/c0/e8/s19
/c0/e8/s20
/c0/e8/s21
/c0/e8/s22
/c0/e8/s23

Get the disk status. As an example: Controller 0, Enclosure ID 8, Disk 1

[19:08]root@megaraid:/etc/snmp# /usr/sbin/storcli /c0/e8/s1 show all 
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e8/s1 :
===============

-----------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model                      Sp 
-----------------------------------------------------------------------------------
8:1      15 Onln   0 446.625 GB SATA SSD N   N  512B SAMSUNG MZ7LM480HCHP-00003 U  
-----------------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded


Drive /c0/e8/s1 - Detailed Information :
======================================

Drive /c0/e8/s1 State :
=====================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 0
Drive Temperature =  28C (82.40 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e8/s1 Device attributes :
=================================
SN = S1YJNX0H504112      
Manufacturer Id = ATA     
Model Number = SAMSUNG MZ7LM480HCHP-00003
NAND Vendor = NA
WWN = 5002538c402e796f
Firmware Revision = GXT3003Q
Raw size = 447.130 GB [0x37e436b0 Sectors]
Coerced size = 446.625 GB [0x37d40000 Sectors]
Non Coerced size = 446.630 GB [0x37d436b0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 12.0Gb/s
NCQ setting = Enabled
Write cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B


Drive /c0/e8/s1 Policies/Settings :
=================================
Drive position = DriveGroup:0, Span:0, Row:1
Enclosure position = 0
Connected Port Number = 0(path0) 
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
SED Capable = No
SED Enabled = No
Secured = No
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address        
-----------------------------------------
   0 Active 12.0Gb/s  0x500304801e95b501 
-----------------------------------------


Inquiry Data = 
40 00 ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00 
00 00 00 00 31 53 4a 59 58 4e 48 30 30 35 31 34 
32 31 20 20 20 20 20 20 00 00 00 00 00 00 58 47 
33 54 30 30 51 33 41 53 53 4d 4e 55 20 47 5a 4d 
4c 37 34 4d 30 38 43 48 50 48 30 2d 30 30 33 30 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80 
00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00 
3f 00 10 fc fb 00 10 d1 ff ff ff 0f 00 00 07 00 

Again use grep to get the important value:

[19:09]root@megaraid:/etc/snmp# /usr/sbin/storcli /c0/e8/s1 show all | grep "Predictive Failure Count"
Predictive Failure Count = 0
[19:14]root@megaraid:/etc/snmp# /usr/sbin/storcli /c0/e8/s1 show all | grep "Predictive Failure Count" | awk {'print $5'} 0 ```

snmpd configuration

The snmpd has to be extended with these commands.

extend storcliDisk /bin/bash -c 'sudo /usr/local/bin/check_storcli_disk.sh'

The snmpd needs to be reloaded!

Poller configuration

The poller configuration is very easy. Just use the snmp monitor to verify our specific OID. But as mentioned above, these are the configs for enclosure ID 8!

$OPENNMS_HOME/etc/poller-configuration.xml

<service name="Storcli-Disk" interval="43200000" user-defined="true" status="on">
 <parameter key="oid" value=".1.3.6.1.4.1.8072.1.3.2.4.1.2.11.115.116.111.114.99.108.105.68.105.115.107.1"/>
 <parameter key="retry" value="1"/>
 <parameter key="timeout" value="3000"/>
 <parameter key="port" value="161"/>
 <parameter key="operator" value="="/>
 <parameter key="operand" value="0"/>
</service>
<monitor service="Storcli-Disk" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

OpenNMS needs to be restarted to load the configuration!

Script to check disk predictive failure states

This script check_storcli_disk.sh does the same stuff like explained above and provides information about issues which will be used in the poller outage message.

1 Like