NMIS - Features
Last updated 21 June 2001

Online Version

NMIS Home Page

Introduction

NMIS stands for Network Management Information System.  It is a Network Management System which performs multiple functions from the OSI Network Management Functional Areas, those being, Performance, Configuration, Fault.  A primary function of NMIS is to make information about your network available quickly and instantly.  Some of this network is provided "raw" other information is provided in a related manner.

It started as a SNMP polling and statistics viewer front-end to Tobi Oetiker's RRDToolRRDTool replaces MRTG but doesn't include a front end and backend to handle SNMP polling and display resulting web pages etc.  The original NMIS evolved quite rapidly to meet demands of production environments.  

The backend, polling engine, uses SNMP to collect interface and health statistics for Cisco Routers, certain Cisco Catalyst Switches and Generic SNMP devices every 5 minutes.   The collected statistics are stored in RRD's (Round Robin Databases) and ensures that devices are up, issues alerts, etc.  The front end accesses the information stored in the RRD's and displays statistics the resulting graphs, reports, etc.  

Both the front and back ends are highly extensible and features are easy to add as the structure is learnt.  For example the backend was just collecting interface statistics every poll cycle, it was easy to add collection of health (cpu, memory, buffer, etc) and response time, availability.

NMIS uses a backend to collect data and maintain the information about the data.  It relies on RRD for databases, additional tables are text based configuration information.  The frontend is independant, it just reads information from the RRD's and text tables and displays the information.  Simple.

It is intended that NMIS be low maintenance once it is running, it should just go and go.  More work needs to be done on this but I think it going well so far.

Concepts

The basic concept is that NMIS collects interface, CPU, Memory, buffer and packet statistics from Cisco Routers and Switches, it is also capable of supporting generic SNMP MIB 2 collection.  Getting slightly deeper, NMIS pings a device every poll cycle verifies that it is "up", this is called "reachability", it holds this in memory.  

If no system information is available for the device it must be a new device so perform a capabilities discovery on the device, this is the subroutine getNodeInfo.  Otherwise load the cached system information with the loadSystemFile then run the updateUptime subroutine which gets sysObjectID, sysUpTime and ifNumber, NMIS compares this with the cached information to see if the same number of interfaces are present, that the uptime has increased and that the sysObjectID is the same.  

If the number of interfaces has changed run the createInterfaceFile subroutine to update this information.  (This should send an configuration change event.)

If the sysObjectID has changed run the getNodeInfo subroutine. (This should send an configuration change event.)

If the sysUptime is less then the cached information sysObjectID has changed run the getNodeInfo subroutine. (This should send an node reload event.)

The runHealth subroutine is run, this collects CPU, Memory, buffers, etc, whatever is deemed necessary for that device type and stick it all in an RRD. 

Then the runInterfaces subroutine is run, it loads the cached interface information, if none exists it creates it with createInterfaceFile.  Then for each interface it collects ifDescr, ifOperStatus, ifInOctets and ifOutOctets.  If the ifDescr is different, the cached interface information must be out of date (this is how shifting ifIndex is handled) create it again with createInterfaceFile.  If the ifOperStatus shows down when the interface is supposed to be up, raise an event.  Otherwise store ifOperStatus, ifInOctets and ifOutOctets in an RRD, adding ifOperStatus to the total interface availability of the device.

After the interfaces are complete, calculate the response time for the device with another ping and store some health metrics in another RRD, we store the reachability of the device, the interface availability of the device, the responsetime and create a health metric from a simple algorithm which weights various collections and makes up a metric to indicate the overall health of that device, more on this in the health section.

Roles and Groups

The ability exists to put nodes into two types of groups, the first group is a role which is core, distribution and access, the second group is used to group devices together for reports, and general information.  It is logical hat the second group be something like the building name or city/suburb of the device as this helps identify problem areas.

Roles play an important part in NMIS, they allow things to be weighted for events and various other functions.  The concept of weighting according to role is simple, if it is a core device then it is important and should be treated as such, if it an access device then it is less important.  The idea is to try and remove the noise, ie all events coming in at critical and which ones really are.

Health

The following statistics are considered part of the health of the device:

  • Reachability - is it up or not; 
  • Availability - interface availability of all interface which are supposed to be up; 
  • Response Time; 
  • CPU; 
  • Memory; 

All of these metrics are weighted and a health metric is created.  This metric when compared over time should always indicate the relative health of the device.  Interfaces which aren't being used should be shutdown so that the health metric remains realistic.  The exact calculations can be seen in the runReachability subroutine.

Events
  • Escalation
  • Events based on device role
  • Stateful
Thresholds

The thresholds routine runs whenever you like, it process the collected statistics in the RRDs and compares the numbers to stored thresholds and if exceeded raises an event for that device.  The thresholds use the device role to weight the events.

Updates

Updates ensures that all the cached system and interface information is kept up to date.  If the network is constantly changing then it should be run frequently, otherwise it could be run less frequently.

Interfaces

Interfaces which aren't in use should be shutdown (admin down) so that NMIS doesn't think it is supposed to manage them.  A simple lookup is done on interface types to determine if NMIS should collect statistics on them.  This is done during the createInterfaceFile subroutine.

The following is a list of NMIS features.  This is by no means comprehensive but provides and idea of what NMIS can do.

General

  • The entire network is summarised into a single metric, which indicates reachability, availability and health of all network devices being managed by NMIS.
  • Summary page for entire network with reachability, availability, health, response time metrics.
  • Summary pages of devices including device information, health graph, and interface summary.
  • Can be distributed across multiple "polling servers" by using included programs.
  • Policy based event and escalation.

Performance and Fault

  • Integrated Fault and Performance Management.
  • Color coded events, status for at a glance interpretation.
  • Graphing of Interface, CPU, Memory stats for Cisco Routers and Switches.
  • Graphs can be drilled into.
  • Graphs produced on the fly.
  • Graphs can have varying lengths from 2hours to 1 year.
  • Interface statistics are returned in Utilisation and/or bits per second.
  • Response time graphed and metrics for health and availability generated from statistics collected.
  • Threshold engine which send alerts on certain thresholds.
  • Escalation subsystem based on device groups which provides a great deal of granularity.
  • Varying event levels for different device types.
  • Alert events are issued for device down or interface down.
  • Event levels are set according to how important the device is.
  • Events are "State full" including thresholds, meaning that an event is only issued once. 
  • Notification engine can be expanded to handle any "command line" notification method, including email, paging, signs, speakers, etc.
  • Integrated logging facility to view NMIS events and syslog messages.
  • A list of current events is available and there is an escalation level and time the event has been active.
  • Event logging
  • Outage time calculated for each down event
  • Planned outages can be put in so alerts are not issued

Configuration

  • Find function which searches interface information for node name, interface name, description, type, IP address, for matching interfaces.
  • Interface information includes IP address information.
  • Dynamic handling of ifIndex changes and difficult SNMP interface handling
  • Checking of changes to device details.
  • NMIS stores contacts and location information which links to the SNMP sysContact and sysLocation MIBS.
  • Produces DNS and Host records from the collected IP addressing information
  • Produces DNS LOC records for "visible" traceroute utilities.

Reporting

  • Reports for utilisation, outages, etc
  • Snapshot and dynamic reporting for metrics on all devices and groups of devices.