NMIS Documentation
Last updated 1 December 2004

Online Version

NMIS Home Page

Introduction

NMIS stands for Network Management Information System.  It is a Network Management System which performs multiple functions from the OSI Network Management Functional Areas, those being, Performance, Configuration, Fault.  A primary function of NMIS is to make information about your network available quickly and instantly.  Some of this network is provided "raw" other information is provided in a related manner.

It started as a SNMP polling and statistics viewer front-end to Tobi Oetiker's RRDToolRRDTool replaces MRTG but doesn't include a front end and backend to handle SNMP polling and display resulting web pages etc.  The original NMIS evolved quite rapidly to meet demands of production environments.  

The backend, polling engine, uses SNMP to collect interface and health statistics for Cisco Routers, certain Cisco Catalyst Switches and Generic SNMP devices every 5 minutes.   The collected statistics are stored in RRD's (Round Robin Databases) and ensures that devices are up, issues alerts, etc.  The front end accesses the information stored in the RRD's and displays statistics the resulting graphs, reports, etc.  

Both the front and back ends are highly extensible and features are easy to add as the structure is learnt.  For example the backend was just collecting interface statistics every poll cycle, it was easy to add collection of health (cpu, memory, buffer, etc) and response time, availability.

NMIS uses a backend to collect data and maintain the information about the data.  It relies on RRD for databases, additional tables are text based configuration information.  The frontend is independant, it just reads information from the RRD's and text tables and displays the information.  Simple.

It is intended that NMIS be low maintenance once it is running, it should just go and go.  More work needs to be done on this but I think it going well so far.

Concepts

The basic concept is that NMIS collects interface, CPU, Memory, buffer and packet statistics from Cisco Routers and Switches, it is also capable of supporting generic SNMP MIB 2 collection.  Getting slightly deeper, NMIS pings a device every poll cycle verifies that it is "up", this is called "reachability", it holds this in memory.  

If no system information is available for the device it must be a new device so perform a capabilities discovery on the device, this is the subroutine getNodeInfo.  Otherwise load the cached system information with the loadSystemFile then run the updateUptime subroutine which gets sysObjectID, sysUpTime and ifNumber, NMIS compares this with the cached information to see if the same number of interfaces are present, that the uptime has increased and that the sysObjectID is the same.  

If the number of interfaces has changed run the createInterfaceFile subroutine to update this information.  (This should send an configuration change event.)

If the sysObjectID has changed run the getNodeInfo subroutine. (This should send an configuration change event.)

If the sysUptime is less then the cached information sysObjectID has changed run the getNodeInfo subroutine. (This should send an node reload event.)

The runHealth subroutine is run, this collects CPU, Memory, buffers, etc, whatever is deemed necessary for that device type and stick it all in an RRD. 

Then the runInterfaces subroutine is run, it loads the cached interface information, if none exists it creates it with createInterfaceFile.  Then for each interface it collects ifDescr, ifOperStatus, ifInOctets and ifOutOctets.  If the ifDescr is different, the cached interface information must be out of date (this is how shifting ifIndex is handled) create it again with createInterfaceFile.  If the ifOperStatus shows down when the interface is supposed to be up, raise an event.  Otherwise store ifOperStatus, ifInOctets and ifOutOctets in an RRD, adding ifOperStatus to the total interface availability of the device.

After the interfaces are complete, calculate the response time for the device with another ping and store some health metrics in another RRD, we store the reachability of the device, the interface availability of the device, the responsetime and create a health metric from a simple algorithm which weights various collections and makes up a metric to indicate the overall health of that device, more on this in the health section.

Roles and Groups

The ability exists to put nodes into two types of groups, the first group is a role which is core, distribution and access, the second group is used to group devices together for reports, and general information.  It is logical hat the second group be something like the building name or city/suburb of the device as this helps identify problem areas.

Roles play an important part in NMIS, they allow things to be weighted for events and various other functions.  The concept of weighting according to role is simple, if it is a core device then it is important and should be treated as such, if it an access device then it is less important.  The idea is to try and remove the noise, ie all events coming in at critical and which ones really are.

Health

The following statistics are considered part of the health of the device:

  • Reachability - is it up or not; 
  • Availability - interface availability of all interface which are supposed to be up; 
  • Response Time; 
  • CPU; 
  • Memory; 

All of these metrics are weighted and a health metric is created.  This metric when compared over time should always indicate the relative health of the device.  Interfaces which aren't being used should be shutdown so that the health metric remains realistic.  The exact calculations can be seen in the runReachability subroutine.

Events
  • Escalation
  • Events based on device role
  • Stateful
Thresholds

The thresholds routine runs whenever you like, it process the collected statistics in the RRDs and compares the numbers to stored thresholds and if exceeded raises an event for that device.  The thresholds use the device role to weight the events.

Updates

Updates ensures that all the cached system and interface information is kept up to date.  If the network is constantly changing then it should be run frequently, otherwise it could be run less frequently.

Interfaces

Interfaces which aren't in use should be shutdown (admin down) so that NMIS doesn't think it is supposed to manage them.  A simple lookup is done on interface types to determine if NMIS should collect statistics on them.  This is done during the createInterfaceFile subroutine.

Capacity Planning

The Capacity Planning tool is on the plugin menu, and displays the 95% percentile data point recorded weekly by bin/cplan.pl for each routed link. The 95th percentile data is the most accurate data point to look at as a barometer for the planning of link upgrades. The 95th percentile is the 95% point of the cumulative distribution of the weekly 5 minute average utilisation data points. This metric acts as a filtering mechanism against sometimes irregular spiking that occurs in peak traffic data measurements. For planning purposes, a 95% point at 60% of line rate is approximately the utilisation that router packet queues begin to form on the outbound interface which will affect user response times. A more conservative view suggests that link upgrades should be planned when sustained 95th percentile data approaches a utilisation rate of 40% to 45%.

The Capacity Planning tool page has some options, to display by node or group, to filter by %utilisation threshold, and to adjust the percentile to 95%, 90%, 85% values. 95 percentile data values are presently recorded weekly for the previous week. Once a history has built up, I would suggest that the collection period be moved to monthly to display more long term trend patterns.