5.3. Problems and solutions

Any application of this scale is bound to run into some problems along the way, and this network monitoring tool was no exception. These ranged from social issues to networking problems, and this section aims to document some of the more interesting ones.

5.3.1. Social considerations

One of the first problems encountered was social rather than technical. There are a few people on campus who run personal firewall software on their machines, and some of these people have their software configured to report on all unknown traffic. One of these people noticed ICMP echo requests coming from the monitoring machine and decided to investigate the matter. In a Usenet post to the University's ru.chat newsgroup, he asked if anyone knew the reason for the probe.

Several posts followed in which it was explained that the probes were harmless, useful, and formed an insignificant amount of traffic to his machine (128 bytes an hour). One of the comments that came out of the discussion was:

When a stream of pings arrive at my UTP socket it would be nice to 
know why *before* it happens, not after, especially in the light of 
nimba (sic) and friends.

In the light of this and in consultation with the Systems Manager at Rhodes, it was decided to change the source address of all ICMP echo requests to more accurately reflect what the purpose of them was. This new source address had a reverse DNS entry of we.are.drawing.network.maps.of.ru.ac.za to ensure that it was obvious that the probes were intentional. In addition, a web page was set up on the we.are.drawing.network.maps.of.ru.ac.za domain informing people of exactly what the modus operandi of the project was.

Since this first query about the project, it has run for over a year with no further comment. It can only be assumed that those people who have noticed the requests have been satisfied with the explanation available on the web page.

5.3.2. Network cards

Early on in the experiment, the monitoring machine experienced problems in transferring the ICMP echo requests it generated on to the physical network. This was characterised by errors in the system log indicating a lack of memory buffers (mbufs). Some research indicated that the likely reason for these errors was the SMC 1211-TX network card that was in the machine. The SMC 1211-TX is based around the RealTek 8139 chipset, which is known to experience problems handling a high load.

The problems with the RealTek network interface controller (NIC) chipset are best described by Bill Paul's commentary in the source code for FreeBSD's rl(4) driver. The first two paragraphs of this commentary are reproduced below [Paul, 1998].

The RealTek 8139 PCI NIC redefines the meaning of 'low end.' This is
probably the worst PCI ethernet controller ever made, with the possible
exception of the FEAST chip made by SMC. The 8139 supports bus-master
DMA, but it has a terrible interface that nullifies any performance
gains that bus-master DMA usually offers.

For transmission, the chip offers a series of four TX descriptor
registers. Each transmit frame must be in a contiguous buffer, aligned
on a longword (32-bit) boundary. This means we almost always have to
do mbuf copies in order to transmit a frame, except in the unlikely
case where a) the packet fits into a single mbuf, and b) the packet
is 32-bit aligned within the mbuf's data area. The presence of only
four descriptor registers means that we can never have more than four
packets queued for transmission at any one time.

The second paragraph is perhaps the most significant. It comments that one cannot have more than four packets queued for transmission. For an application that is trying to generate 65536 ICMP echo requests in batches of 128 at a time, this limitation is likely to have a noticeable effect.

With this in mind, the network card in the monitoring machine was replaced with an Intel Pro/100S card (sometimes also known as an EtherExpress Pro 10/100S, but this use is deprecated by Intel). This card was chosen because it is known to work very well with FreeBSD — the operating system on which this application was being run — and is extensively used on the FreeBSD Project's own network [FreeBSD, 2002].

The result was an immediately noticeable increase in performance and throughput. The error messages regarding the lack of mbufs disappeared, leading to the conclusion that the problem was indeed related to the use of a RealTek based network card.

5.3.3. MySQL problems

As was mentioned in Section 5.1, MySQL was chosen to be the database engine for this application because its performance was known to be substantially better than other open-source databases. The MySQL backend performed very well for the first five months of the project, easily handling the seventy thousand or so records that were inserted each day.

Unfortunately, as the database grew larger, performance slowly degraded. When the record count reached around eleven million, the performance of the database had become so bad that it was taking longer to update the indices on the database than it was to generate the data in the first place. This loss of performance was unacceptable because it affected the usability of the machine that was doing the monitoring.

An investigation with FreeBSD's truss(1) application determined that the performance loss in the database engine was due to the system blocking on disk writes. The machine running the monitoring application has a single 20 GB Western Digital IDE hard drive in it, running at 5600 RPM. This drive is slow by modern standards, so it was conceivable that using a different hard disk would fix the problem.

Lacking a suitable replacement, another solution had to be found. As an interim measure, the existing eleven million records in the database were moved to offline storage in early April 2002. The contents of the database were then deleted, restoring the performance of the database engine and the machine as a whole. Since the contents of this database had already been summarised in a round robin database, this rather drastic solution had very little noticeable impact on the outputs of the application.

As the next six months passed, the database again grew to be over ten million records. The resultant loss in performance was repeated, to the point where another solution needed to be found.

This time, the records were moved to a new machine. This machine had a series of SCSI hard disks in a RAID 5 configuration, offering significantly better disk performance than the original machine.

The original eleven million records were also imported into this new database, resulting in a single MySQL database that had a little over twenty-one million records in it.

In order to test the performance of this new database, some complex queries were run against it. These same queries were also run against the existing ten million records on the original machine in an effort to provide a comparison. The difference between the two results was significant. A query that took a little over an hour on the original machine took just over fifteen minutes on the machine with SCSI drives. This clearly demonstrated the need for high performance disk drives in machines running heavily utilised databases.

5.3.4. Routed subnets

One of the assumptions made while determining the impact of this monitoring application on the network's bandwidth was that all backbone links had sufficient bandwidth to carry the ICMP requests for all the machines they were serving. This assumption is reasonable since, as was explained in Section 2.4.1, the majority of the inter-switch links at Rhodes run at between 100 Mbps and 1 Gbps.

On two occasions during the year, this assumption was invalidated. Both of these occasions involved large subnets being routed over a dial-in line.

The first was a conference which Rhodes hosted at a local hotel. Connectivity for the conference was provided by a 64 Kbps ISDN dial-in line. This line had a class C network (256 IP addresses) routed over it, of which approximately ten were in use. Fortunately the problem was pre-empted and the ISDN router that provided the connection onto Rhodes' local area network was configured to reject all ICMP requests from the monitoring machine. This effectively prevented the ICMP queries from interfering with the remote dial-in network.

Unfortunately this was not so in the second case. This time an entire twenty-one bit network was routed over a 33.6 Kbps analogue dial-in connection between Grahamstown and Johannesburg. Of these 2048 IP addresses, under ten were actually in use, so in many ways, this routing was overkill.

Once every half hour, the monitoring system attempted to send 128 KB of data down this line. The result of this traffic is that for thirty seconds during the run, the data line is completely saturated by the ICMP echo requests. This delay was noticeable to users of the dial-in line who were attempting to communicate with Rhodes using interactive protocols such as telnet or ssh.

After a brief telephone call with the Systems Manager, the problem was resolved by configuring the monitoring system to simply ignore the subnet in question for the duration of the event.