Part of the fault management section of the OSI network model covers the timeous reporting of faults and error conditions to the network administrators. Such reports are necessary in order to allow network administrators to quickly respond to and diagnose problems. The more accurate the information that is provided to the administrator, the better their ability to respond. Unfortunately many existing network monitoring systems (such as Big Brother, which was mentioned in Section 2.2.6) are limited in their ability to provide fault reports — they report on the symptoms of the problem rather than the cause of it.
It would be useful if this limitation could be overcome and more intelligent reporting systems could be developed. These systems could make use of various forms of artificial intelligence in order to make their own diagnosis of faults rather than relying on the network administrators to perform this tedious task. Such intelligent systems form the focus of this chapter.
Big Brother provides a good example of typical traditional network monitoring utilities. It can be configured to periodically connect to various network services, and check that these services are still functioning as intended. It has various notification abilities and can be configured to alert the network manager if these services are not working as they are supposed to [BigBrother, 2002].
The biggest limitation of systems such as this is that they can only symptomatically report error conditions, and the symptoms are limited to those services that have been configured to monitor. The best way to illustrate this is via an example. Consider the network in Figure 7-1.
In this network there are two completely separate subnets that are interconnected by a layer three router (this could also be a layer three core switch). On one subnet there is a DNS server and the machine responsible for network monitoring. On the other subnet there is a web server and a mail server. The router in between the two subnets also knows how to route packets to the Internet.
The network monitoring station is configured to check that the web server is correctly functioning as a web server, the mail server is correctly functioning as a mail server, et cetera. It does this by, for example, connecting to the web server and trying to request a web page. Should any of these services go down, it will report the issue to the network administrator in some way (the actual method of reporting is not significant to this example).
This works very well in many cases. If the web server ceases to function, the monitoring application will pick this problem up and report it.
What happens, however, when the switch serving subnet 2 stops passing packets? The monitoring machine attempts to contact the web server and fails because it is no longer reachable. It reports this matter to the network administrator. Shortly afterwards, it attempts to contact the mail server and discovers that it too is not accessible. It sends another report to the network administrator alerting them of this fact. In both cases, the reported faults are symptoms of the real fault. It is up to the network administrator to figure out where the real fault lies.
Now consider what happens when the subnet 2 supports a large number of servers, each monitored, each performing a different function. The network administrator may be overwhelmed with problem reports.
In large networks where there is a distribution of responsibility, these symptomatic fault reports can be even more problematic. Say for example, that the web server is managed by the webmaster  and the mail server is managed by the postmaster (and these are two different people). When switch 2 fails, the webmaster will receive a report about the web server and the postmaster will receive a report about the mail server. To continue the example, say the network infrastructure is managed by a third person, the hostmaster. In this case, the hostmaster receives no reports of any faults and relies instead on the postmaster and webmaster informing him of the problem.
It is fairly easy to see from these simple examples why symptomatic reporting is problematic. As the complexity of the network increases, these problems can only be compounded. What is needed to improve the situation is some form of intelligence within the network monitoring software — it needs to be able to diagnose simple network faults by itself and report them more appropriately to a person who is in a position to resolve the problems.
The names "webmaster", "postmaster" and "hostmaster" are derived from RFC 2142, Mailbox Names for Common Services, Roles and Functions.