7.3. Intelligent network monitoring

It seems that almost all active network monitoring tools that are currently available are, at best, rule based systems. This lack of expert information or "intelligence" results in the symptomatic reporting described in Section 7.1. To improve this situation, it is necessary to provide network monitoring agents with the ability to interpret the information obtained from scanning the network.

7.3.1. Gathering data

As the first approach to this problem pattern recognition techniques were employed. For example, attempts were made to write software that determined if a particular sequence of events preceeded a network failure. The intention was to progress from these techniques to full artificial intelligence, of the sort provided by the neural network approach.

While attempting to obtain a knowledge base for this system, however, it was discovered that diagnosing the vast majority of network faults is a fairly straightforward, logical process. The "grey areas" commonly associated with neural networks simple do not exist for the bulk of the faults such a system is likely to encounter.

Part of the reason for this is that, in general, the sorts of probes done by networking monitoring have a binary outcome — the results are clear cut, the probe either succeeded or failed. For example, if an ICMP echo request is sent to a machine to determine if it is actively participating in the network, the machine will either respond with an ICMP echo response or it will not respond at all. The protocol does not allow for any other responses to be valid.

This sort of binary logic lends itself to a decision tree or graph based approach, where each node in the graph is characterised by some form of "intelligent" question. In other words, the system requires the type of expert knowledge mentioned in Section 7.2.2. Thus the idea of a pattern based system was discarded in favour of a simpler expert system.

The most important piece of knowledge required by an expert system doing network monitoring is knowledge of the layer two and layer three topology of the network being monitored. In particular, the monitoring machine needs to know the normal route taken by packets between itself and the machine or service it is monitoring. Chapter 4 discusses methods for obtaining this information at both layer two and three of the OSI network stack.

Knowledge of the dependencies between various computers is another useful piece of information that can be supplied to expert network monitoring systems. To take the example presented in Figure 7-1, the mail server relies on the DNS server in order to be able to deliver outgoing mail (since it has to perform a DNS lookup in order to determine the IP address of the machine it wishes to deliver mail to). If the DNS server is down, the mail server must necessarily be running in a degraded state.

Using this information, a network-specific expert system was constructed. The set of rules for such a system turns out to be remarkably simple. Part of the reason for this is that the same set of rules gets applied to each host that is queried, resulting in multiple recursive calls to the same set of of rules. Figure 7-2 shows just such a recursive set of rules forming the basis of an expert network monitoring system.

Figure 7-2. Decision process for determining network faults



As can be seen from the flow chart, the monitoring of any specific service begins with a check to see whether that service is running on the machine that it is supposed to be running on, and whether it is performing as expected. At any stage during the tests a conclusion can be reached, depending on the results of the specific test that was run.

If the service is faulty, any dependencies of that service (for example, DNS in the case of mail) should be checked. Provided all the dependencies are working correctly, a check should be made on any other services that are known to be running on the machine. This will determine whether it is a specific service on the machine that is faulty, or if there is a more general problem.

Assuming that no services can be reached, a test should be performed on the reachability of the host in general — this is simply an ICMP echo request to determine if the machine is available on the network. It cannot be assumed at this point that the machine itself is faulty (this is the mistake many current network monitoring tools make).

Before that can happen, any upstream hosts that provide connectivity between the monitoring station and the host being monitored need to be tested. As was shown in Section 4.1, there is always at least one hop between any two machines (the network interface on the machine receiving the packet). If there is more than one hop, each of these hops needs to be tested in turn before a conclusion can be drawn about where the fault lies.

The process for testing each of these upstream hosts (and each of the dependencies for that matter) is identical to the one outlined above, which is what makes this approach to testing recursive.

In practice, implementing this recursive expert system proved to be challenging. The main problem that needed to be addressed was a method for deciding when recursion should stop. While this may seem straightforward from the decision tree in Figure 7-2, real network topologies with redundant routing create loops. These loops need to be detected and dealt with in such a way that the system does not recurse ad infinitum.

Once these diffculties were overcome, a proof-of-concept implementation of the decision tree represented by Figure 7-2 showed that such a system could, for all the test cases at least, accurately diagnose faults on the network.

7.3.2. Testing services

One of the tests mentioned in Section 7.3.1 involves connecting to a service and testing that it functions correctly. This is perhaps the most difficult task this system has to perform, since accurately testing a network service involves an understanding of the underlying protocol that the service uses.

Supporting the commonly used protocols such as those used by e-mail, web pages, et cetera is a fairly straightforward task. The protocols are well understood and well documented, so all that is required is a minimal implementation of the client side of the protocol. The problem arises when trying to support the less common protocols.

One method for doing this is to simply provide a basic connect test for those protocols which have not explicitly been covered. In this test, being able to connect to the port running the service is assumed to be sufficient — the service is considered to work in this case.

This is the approach taken by the proof-of-concept system that was developed to test these ideas. A services test module was implemented in Perl, and this module was called by the expert system to test the availability of services. Test subroutines were named after the protocol's assigned port keyword, as maintained by the Internet Assigned Numbers Authority (IANA) [RFC 3232]. Examples of such keywords would be "telnet" for the Telnet protocol, "http" for the HTTP protocol, and "domain" for the DNS protocol.

Any protocol which does not have an explicitly specified test routine is handled by Perl's special AUTOLOAD sub-routine, which allows programs to simulate the existence of missing sub-routines. When this routine is called by the test module, it uses the name by which it was invoked (in other words, the name of the missing sub-routine) as an index into the services(5) database, from where it obtains the IANA-assigned port number for the service. The system then attempts to connect to this port on the monitored host in order to determine whether any service is listening on that port.

As has already been mentioned, simply connecting to the port is not a particularly accurate way of determining whether a service is functioning correctly. A better approach would be for the system to learn new protocols automatically, and in that way compare the results of previous tests against the current one.

The problem with this approach is that most protocols allow implementors some latitude in the way their services announce themselves or respond to queries. For example, Figure 7-3 shows the welcome banner from four different Internet Message Access Protocol (IMAP) servers. The IMAP protocol requires that an IMAP server identifies itself on connection with * OK. The rest of the line is ignored by any client, and is simply provided for informational purposes [RFC 2060].

Figure 7-3. Various IMAP implementations

    * OK Courier-IMAP ready. Copyright 1998-2002 Double Precision, Inc.  See COPYING for distribution information.
      
    * OK imap.ru.ac.za Cyrus IMAP4 v2.1.9 server ready
     
    * OK Microsoft Exchange IMAP4rev1 server version 5.5.2653.23 (stork.ict.ru.ac.za) ready
    
    * OK GroupWise IMAP4rev1 Server Ready

Any application that attempts to automatically learn, for example, the IMAP protocol would need to recognise that it is only the * OK that is important, and discard all other information. This problem is a "fuzzy" one, and is perhaps best suited to higher forms of artificial intelligence, such as a neural network.

While this approach seems to be the logical extension of the system currently employed to test services, due to time constraints, an implementation was not forthcoming. It remains to be seen if a neural network could be used to accurately determine whether services are functioning correctly, and indeed, if the results from such a system would be better than the straightforward connection approach currently employed.

7.3.3. Reporting faults

It is all very well being able to accurately determine what the cause of a network fault is, but without the ability to report the fault to a relevant person, the process is futile. Thus it is important that an intelligent method of reporting network faults is developed alongside the intelligent monitoring system.

Specifically, this reporting system needs to be able to determine who needs to be made aware of a particular fault.

In order to do this, a reporting system needs to have knowledge of which people are responsible for which services and areas of the network, and their preferred method of communication. In addition, the system needs to know the dependencies between various services, in order that it might be able to determine which other operational areas may be affected by the fault.

Two of the major shortfalls of the current symptomatic reporting are that administrators receive reports for faults outside their area of operational control, and they often receive multiple, different reports for the same fault. These problems can be solved by the use of a rule based or expert reporting system. Such a system should necessarily be fed information by an intelligent monitoring system.

The system was developed to report such faults used the Short Message Service available on GSM cellular networks in order to notify the responsible people of the existence of faults on the network. (The method by which such reports were made is discussed in Appendix A.) Reports were divided into two categories: those that were purely informational, and those which required some action on the part of the recipient.

When a particular service failed, a message was sent to the administrator of the service in question. This message explained in as much detail as possible the location and nature of the fault. It also listed the dependencies of the service in question so that the administrator could make a judgement call on how urgently the problem needed resolving. An example of a message that could be generated by this system is "sws-pgrad.ict failed (ping,telnet,snmp), rm 42, depends diablo.ict,sears.ict (ssh,smtp,http,netbios-ns)".

At the same time, an informational message was sent to the operators of each of the dependencies informing them of the failed service. This message was not intended to invoke any response on their part, but rather to keep them aware of the fact that their service was running in a degraded state. Once a fault was resolved, a further informational message was sent to each dependency administrator informing them of the return to normality.

The system was designed in such a way that no one person received more than one message alerting them to a particular problem. To take the example in Figure 7-1 and assume the webmaster and the hostmaster are the same person (that is, the person who runs the web server also runs the DNS server). If the DNS server was to fail, the system would notify the hostmaster of the failure. DNS is a dependency of the web server, so the webmaster should receive an informational message informing them of the DNS fault. Since the responsible person is the same, however, this message will never be sent. Should there be more than one responsible person for the web server, only those people who have not been notified already will be sent an informational message.

In the same way, the system was designed so that at any one time, no person should receive more than one message. Carrying on with the example from the previous paragraph, take the case where the webmaster and the postmaster are the same person (that is, the person who runs the web server also runs the mail server, but not the DNS server). In this case, there are two dependencies about which this person needs to be informed when the DNS server fails. The system will concatenate these messages into a single message containing information about both dependencies.

The reason for this policy of never sending more than one message to any particular operator is a simple social engineering technique. Experience with other notification systems dictates that when people receive multiple messages notifying them of a problem, they tend to ignore all but the first. If this first message contains all the appropriate information, there is a better chance that it will actually be read and acted upon.