Efficient management of high performance local area and wide area networks is essential. However, existing algorithmic methods for managing networks have not matured to the point where fault detection, isolation, correlation, and correction can be automated scalably. We propose a highly scalable architecture for network monitoring and control of gigabit networks with human-in-the-loop capability.
The monitoring and controlling system is composed of a network probe that can be inserted in an ATM physical link, an endsystem probe, software network management agents that provide extensible multi-attribute event filtering for highly scalable data/event collection, network operation centers (NOCs) which can remotely install and (re)configure these agents and efficient online event ordering algorithms that can help synthesize and display a consistent view of network health, status and performance. In this paper, we will emphasize on the design of a network probe and software network management agents that provide event-filtering to reduce the total data collected and processed for gigabit speed networks.
A Gb/s data rate has several implications on the design of a network probe, which is responsible for traffic measurements at the lowest level. For example, the network probe must be able to extract and record/log necessary packet header information (or ATM cell counts) for packets being received at Gb/s rate; and be able to deal with thousands of packet flows or ATM connections on a given physical link. Also, it is essential that the network probe not interfere with actual traffic while logging traffic data. The network probe will be built using the ATM Port Interconnect Controller (APIC) chip and a CPU-memory module. The APIC-based probe meets all the requirements mentioned above. With two full duplex 1.2 Gb/s ATM ports, the APIC can be easily inserted in a link as a probe for packet/cell "snooping" to log traffic measurements without interfering with network traffic. Moreover, the APIC's external memory/bus interface has been designed to deliver very high throughput and very low latency to applications. We also note that when the APIC is snooping on native ATM connections, it needs to only log the frame/packet header information, cell counts, and corrupted or lost frames in memory and does not need to bring in the entire frame into memory. This leads to significant savings in the memory bandwidth requirements.
Software network management agents in our proposed system are built atop network probes and used to track event flows, as well as classify and report events of interest to NOCs. Common events reported to NOCs include alarms (such as delay or packet loss thresholds being exceeded for a service class), quality of service statistics such as the performance of particular IPv6 packet flows, and ATM connection blocking rates. When these types of events are detected by the agent, it notifies the NOC via a trap and/or performs a corresponding local action.
In conventional low-speed networks, polling is traditionally used to monitor the current state of the network elements and to take corrective action when problems are detected. Inherently, selecting an appropriate polling interval is a hard problem. However, in gigabit networks, polling is even more difficult because the tremendous volume of events can trigger an enormous number of state changes between polling intervals. Thus, we propose using a highly flexible, scalable and high performance event-filtering mechanism for the software network management agents that will effectively eliminate redundant management traffic. Using a dynamic trie-based filter fusion technique, we can reduce the work required by the filtering mechanisms to classify E events of size L through N event filters, from O(E X L X N) to O(E X L), which is the minimum possible. Administrators and management applications can strategically install filters within agents to route events of interest to remote nodes (such as NOCs or other agents). Thus, multiple agents and filters can be composed and/or arranged hierarchically to reduce unnecessary network traffic and enhance the scalability of event notification in a large-scale network.
Furthermore, the system's feedback control mechanisms will provide support for network configuration management, ATM virtual circuit management, router-to-router link management and application-level congestion management. The system will be experimentally evaluated by building a 3-5 node testbed and using a suite of multimedia traffic generator tools.