As part of its research equipment, the Department has several Rate Adaptive DSL Access Multiplexers (DSLAMs). These DSLAMs provide remote access to several members of staff and senior students. The notable thing about these, from a network monitoring perspective, is that they run the line with asymmetric bandwidth. The upstream RADSL link runs at anything up to 1Mb/s and the downstream bandwidth is anything up to 7Mb/s. It is possible for the bandwidth in each direction to be the same, but in the general case it is not.
Section 3.2 discussed the implications of such asymmetric bandwidth on normal SNMP monitoring techniques. For this reason, a custom application had to be written to monitor these lines. As was mentioned in Section 3.2, this application was written to test the feasibility of using an XML-based abstraction layer.
Since its inception, this monitoring tool has undergone many revisions to increase its functionality and make it more useful. Several problems were encountered in its development, some of which took significant amounts of time to resolve.
The most significant of these was locating appropriate MIBs and documentation for the DSLAMs. The Department has two different models of DSLAM. The older model uses RADSL modems at the remote end, and all management information for these devices is stored on the DSLAM itself. The newer model uses routers at the remote end, and these routers have built in SNMP agents to handle requests for information about them. Unfortunately the manufacturer refers to the SNMP agents as "protected", meaning that they are only accessible through the interface provided by the DSLAM. The documentation on these "protected" agents is somewhat poor, and it took many months and several e-mail messages to the the manufacturer in order to figure out how to address them. In the end it was almost purely by accident that the correct way of talking to them was discovered. While browsing an SNMP walk of the DSLAM with a new MIB, an OID called entityMIB.entityMIBObjects.entityLogical.entLogicalTable.entLogicalEntry.entLogicalCommunity.1001000 was noticed. This OID had a value of public@s1p1. When the DSLAM was queried with this as a community name, it produced data from the DSL modem connected to slot one, port one.
The application, which fulfils the performance monitoring section of the OSI model, consists of two parts. The first is a data collector and the second is a graphical user interface that is used to display the data.
The data collection application is called by cron(8) every five minutes and polls each of the DSLAMs in order to obtain data about each of the RADSL lines attached to it. Data is collected on the up and downstream line speeds, the amount of traffic on the line, and the error and discard rates associated with the line. This collection is done using SNMP to each DSLAM, and in the case of RADSL routers, to each router as well.
As SNMP uses UDP as its transport protocol, care had to be taken to ensure that each of the devices the application wished to query was reachable. In the case of the two DSLAMs, it can assumed that this is the case since they are always switched on; this assumption does not hold true for remote routers, which are turned on and off at the whim of the user. The penalty for trying to connect to a router that is not currently on is that the application has to wait for the full UDP timeout (five seconds) before it can decide that a particular router is down. For this reason, the data collection application is careful to query the controlling DSLAM for its list of active interfaces, which it uses to determine which routers it should attempt to poll. This problem would have been compounded if SNMP had chosen to use TCP since the default TCP timeout is significantly longer (about 75 seconds on a FreeBSD machine). Although both these timeouts are configurable, it rarely makes sense to do so — the defaults are chosen to give best performance in most scenarios.
The information collected from the remote SNMP agents is processed to ensure that it is valid — that, for example, data on the line speed never exceeds the theoretical maximum for the line. Any erroneous data is discarded (actually, its recorded as an "unknown" value) and the validated information is stored in a round robin database using rrdtool(1).
By design, RRDs consolidate data over time. The consolidation function is applied to the average, maximum, and minimum values creating three separated groups of data. The RRD used for this application is configured to keep 600 5–minute samples (50 hours), 700 30–minute samples (14 days), 775 2–hour samples (58 days), and 797 1–day samples (just over two years).
This configuration was chosen partly because it gives us roughly an equal number of samples over each of a day, a week, a month and a year. These periods are used for graphing as shall be seen shortly. The configuration is compatible with other network monitoring tools, such as MRTG. By storing data for two years, one can easily figure out any long-term trends that appear.
The second component of the RADSL monitoring application is a graphical user interface. The interface is web-based, largely because it makes that data easily accessible to all users of the RADSL lines. This web interface is probably the component of the system that has undergone the most change, going from a simple, static web page to a fully fledged web application with dynamically generated content.
Initially the interface was fairly straight forward, based on ideas from a previous Masters student [Irwin, 2001]. This interface provided graphs for the up and downstream line speeds as well as a measure of the traffic on each of twenty RADSL lines. These twenty lines formed the initial infrastructure, with a further twenty four lines being added subsequently. The graphs were generated at fixed time intervals and were displayed on a static web page.
Since then, the interface has been significantly re-worked. It now monitors all forty-four of the Department's RADSL lines, with the facility to easily add more. Line speed, traffic, errors and discards are all monitored and graphed. All web pages are dynamically generated, and contain information extracted from the access multiplexers as the page is generated. An example of a web page from this system is given in Figure 3-3.
As can be seen from Figure 3-3, this application displays a number of graphs. For each of the metrics (line speed, traffic, errors, and discards), four graphs are displayed. Each of these graphs displays the up and downstream values for the metric plotted against time. Each graph displays a different time frame in the same amount of space, being a day, a week, a month, and a year.
The information collected by this system is used for a number of purposes. It is used to determine which lines are most utilised, and this data is used to help balance the load between the blades on the access multiplexer. Each access multiplexer has a number of blades that are connected to external lines. Every line on a blade shares a common uplink from the blade to the rest of the network, so to get the best performance it is important to evenly distribute the load amongst the available blades.
When a line is faulty, the information collected from the system is used to determine when the fault occurred and the duration of the fault condition. If the line fault is reported to the local telephone company (telco), this information is provided with the report to aid them in their diagnosis. Since this information can be compared against the dates on job cards in the area, et cetera, the availability of this data has often allowed the telco to quickly identify and rectify faults that would otherwise have taken a significant amount of time to trace.
The configuration of this application was done using the XML approach described in Section 3.2. It uses this configuration file to determine the host name to connect to, the SNMP community to use and the OIDs to retrieve from each hosts. A sample of this configuration is given in Figure 3-4.
Figure 3-4. Subset of configuration file
<?xml version="1.0" standalone="yes" ?> <monitor> <host name="dslam1.ict.ru.ac.za" community="public"> <oids> <oid name="upspeed" type="gauge" precision="32">.220.127.116.11.4.1.1718.104.22.168.22.214.171.124.1.1.6.$port.31</oid> <oid name="downspeed" type="gauge" precision="32">.126.96.36.199.188.8.131.52.1.5.$port</oid> </oids> </host> <host name="dslam1.ict.ru.ac.za" community="public"> <oids> <oid name="communities">.184.108.40.206.220.127.116.11.18.104.22.168</oid> <oid name="interfaces">.22.214.171.124.126.96.36.199.1.3"</oid> </oids> </host> <host name="dslam1.ict.ru.ac.za" community="public@s1p1"> <oids> <oid name="upspeed" type="gauge" precision="32">.188.8.131.52.4.1.17184.108.40.206.220.127.116.11.1.1.6.$port.31</oid> <oid name="downspeed" type="gauge" precision="32">.18.104.22.168.22.214.171.124.1.5.$port</oid> </oids> </host> </monitor>
This format of configuration files makes it easy to extend the system to add new access multiplexers. The system itself is designed to automatically detect the presence of remote RADSL modems and will start logging data as soon as a new remote network is detected. This makes the system almost completely self-configuring once the initial information has been provided.