How I streamlined one company's network layout and improved performance
Many companies depend on 802.11b wireless networks to complement their wired LAN connectivity, but wireless networks can pose unique administration challenges that require special attention. Small flaws can quickly compound to bring a wireless network to a crawl. Exercising extreme care in configuring and managing your wireless network is crucial. The wireless ISP (WISP) UtahWISP learned this lesson the hard way.
When UtahWISP started operations in northern Utah, it was among the first to provide broadband Internet access to fringe areas and new residential developments that lacked other affordable broadband solutions. The company placed a pair of wireless access points (APs) on nearby towers and installed receivers on customer rooftops so that anyone within line-of-sight of a tower would have broadband Internet access.
However, by providing a service where no other was available, UtahWISP saw its network quickly stressed by rapid customer growth. When its customer base reached more than 200 users and 12 APs, some wireless customers began to experience sluggish browsing, lost packets, and slow ping times. And because UtahWISP was hardly utilizing the bandwidth leaving its network, it was clear this was a wireless problem.
Crowded wireless networks are an increasingly common problem, so I thought UtahWISP would be a perfect candidate for a network makeover. After gathering a little more background information about the performance concerns, I visited UtahWISP to see how such a network makeover might help. Here's the story of one company's complete network makeover—from getting a bird's-eye view of the symptoms and identifying potential causes to implementing solutions and finally solving the problem.
Getting the Big Picture
I knew the network was slow, but after a cursory examination, I didn't know exactly why. I found plenty of unused bandwidth, and performance problems didn't correlate with typical usage patterns. Furthermore, performance concerns didn't affect all users at all times. Some users regularly reported problems, but others experienced quick response times and fast download speeds. According to Dale Meredith, UtahWISP's IT manager, "It didn't make sense. It was frustrating because we were nowhere near our expectations of wireless capacity."
I had dealt with slow wireless networks in the past, but this one was particularly difficult. In UtahWISP's demanding wireless environment, many problems seemed to compound to have a significant impact on network performance. But was this network really different from any other office environment? Surely, radio interference and signals blocked by buildings and trees weren't terribly different from those blocked by the walls and furniture of an office environment. The only real difference was scale.
Before delving into the nitty-gritty, I wanted to step back and get the big network picture. I knew UtahWISP had 12 APs on two separate towers. Each tower served customers within a 10-mile radius and had a fast wireless backhaul link to the main office, at which point it connected to the WISP's main router.
To map the network, I used AdRem Net-Crunch 3.1 Premium. NetCrunch not only got me started visualizing the network but also would give UtahWISP a long-term monitoring and reporting solution. Using NetCrunch's automatic-discovery feature, I built a visual map of the network. Although the result of the automatic discovery required some manual tweaking, it saved me a considerable amount of time. After some fine-tuning, I came up with an accurate representation of Utah-WISP's network, which Figure 1 shows.
To monitor network performance, the company had been using a variety of different tools— mostly Ping. When a customer called once to complain of slow performance, Utah-WISP's administrators opened up a command prompt to ping that customer for a certain-period and watch for problems. Unfortunately, the Windows Ping utility doesn't have much flexibility, and the continuous echo requests contributed to the user's problems by generating even more traffic on the network. Furthermore, Ping is hardly scalable and its output is difficult to visualize.
NetCrunch offers better monitoring features, including the ability to monitor SNMP counters, Syslog data, Windows Performance Counters, and ping times. The software provides dozens of configuration parameters for customizing monitoring for specific networks. NetCrunch also provides powerful charting features for real-time monitoring of the aforementioned parameters.
My primary objective was to track down network performance problems. The Net-Crunch network map, when we first saw it, clearly illustrated the existence of a problem: The map showed slow response times for some hosts, and the network seemed generally sluggish at times. I noticed that hosts would frequently go down, appearing as red icons on the map. These occurrences were probably the result of dropped packets, particularly considering the random nature of hosts' icons turning red.
I realized that UtahWISP's problem was deeper than I'd originally expected. Typically, you can track down wireless problems to one culprit, but this network was different. I could pinpoint no single major problem; apparently, a series of smaller problems was responsible for the slowdowns.
I would eventually need as much information as possible from the network, so I decided to install the Network Intelligence enVision HA Series log-collection device. This piece of hardware would let me aggregate logs from various devices across the network and gather all data in one place for analysis.
The Network Intelligence HA appliance, configured with the company's enVision software, is great for collecting, aggregating, and reporting on log data from a variety of platforms. The appliance has built-in templates for interpreting events coming from dozens of routers, firewalls, and other network devices.
The appliance is designed for high-speed log collection and can receive tens of thousands of events per second, so it was especially useful for this busy WISP environment. The appliance let me examine individual events as they came in and let me create aggregate views of the data through reports or custom queries. The appliance also let me create complex correlation rules that trigger alerts based on multiple events from multiple sources.
I configured the Cisco router to send its Syslog data to the Network Intelligence appliance. I also configured enVision to collect Windows event logs from the company's critical servers (e.g., Web, email, and DNS servers). Doing so let me correlate multiple events from multiple locations to get a better view of the network and have the necessary data to easily go back later and track down any problems.
After just a few minutes of collecting log data, I saw in the enVision alert monitor (which Figure 2 shows) at least one possible explanation for the performance problems: The network was flooded with virus, spyware, and peer-to-peer (P2P) file-sharing traffic. In a moment, I'll show you how we tackled these problems.
With log collection in place, I set out to learn more about how wireless network traffic travels through the air. I knew that the wireless traffic was half-duplex, which told me that congestion and packet collision might be a concern. In half-duplex networks, multiple hosts share the same medium, such as a network cable—or, in this case, a wireless channel. Because only one host can transmit on the medium at a time, you need schemes to manage traffic and avoid packet collision. This process involves ensuring that other hosts are transmitting, detecting collisions if packets collide, and retransmitting packets if necessary. With many hosts on the same medium, more collisions and therefore more retransmissions will occur. Therefore, network efficiency on a half-duplex network is largely affected by the number of hosts and the amount of network traffic that those hosts generate.
In a half-duplex wired network, you use a shared hub. In a full-duplex wired network, you have a switched environment. Utah-WISP had the equivalent of a giant hub with more than 200 users plugged in to it.
The problem gets more complex. The radio links of wireless networks traverse a crowded, unreliable, and sometimes hostile spectrum. Radio frames might be lost and require retransmission. Frequent retransmissions might cause enough latency to affect TCP/IP packets in the payload. TCP/IP expects packet loss from congestion but doesn't expect loss from unreliable connections. It thinks that when a packet is lost, it's because the network is too crowded and busy. Consequently, it reacts by dropping its transmission window size before retransmitting packets, initiating congestion-control mechanisms, and backing off its retransmission timer—effectively slowing down to compensate for what it thinks is a crowded network. The result is less network congestion but slower performance.
Furthermore, if the radio link is poor, network performance will suffer much more than necessary. "Our hardware manufacturer brought out some equipment to test the airwaves," Dale said, "and saw more interference in the spectrum in our area than the manufacturer had seen in other larger metropolitan areas." This crowded spectrum, coupled with the inherent weaknesses of the technology, was a big problem to overcome. Some of these problems were insurmountable, but if we could eliminate any unnecessary traffic, I felt that we could certainly improve performance.
When I inspected the router, I found plenty of excess bandwidth. My theory was that perhaps it was not just the number of bytes sent on the network but also the number of packets. So, I determined to reduce the number of packets on the network through a strategy of reducing broadcast traffic on the network, securing the network to eliminate any unnecessary malicious traffic, and organizing and managing network resources.
Reducing broadcast traffic. Perhaps the easiest task was to reduce broadcast traffic. Computers use broadcast traffic to locate other systems on the network and to advertise their own locations. A broadcast packet travels across the entire network, so at Utah-WISP, one customer's system would send a packet to the tower, which would send it out to every other customer in the range, then send it back to the main office, which sent it back to the other tower and out to all the other customers in that range. Every broadcast packet unnecessarily hit every system on the network.
To fix this problem, I quickly restructured the network subnets and configured the routers to reduce the broadcast scope. Essentially, I broke the network down into smaller segments so that broadcast traffic didn't travel across the entire customer network. To further reduce the amount of broadcast traffic across the network, I enabled features on the switches to cache Address Resolution Protocol (ARP) entries and control ARP floods. Doing so didn't make a huge difference, but the performance improvement was nevertheless noticeable, with slightly faster download speeds on the wireless network.
Securing the network. Because I was working with a WISP, I didn't have control over customer systems to install hotfixes or personal firewalls, and I couldn't put too many restrictions on the type of traffic I blocked. Although I could recommend that UtahWISP send email messages to all its users, recommending that they secure their systems, I knew that many users would ignore the advice or would lack the skills or knowledge necessary to perform the work.
Fortunately, this WISP had a well-defined usage policy in its contract that forbade running Internet servers within the network and participating in P2P file-sharing networks. Therefore, UtahWISP could start blocking certain traffic such as incoming Web or email traffic. Doing so, the company could greatly reduce the exposure resulting from certain customers unknowingly providing access to spammers or intruders.
In the process of upgrading the network, the company's Intrusion Detection System (IDS) alerted me that some customer systems were rapidly scanning other systems on the network. However, the customer systems had no reason to be connecting to other customer systems. I recognized this activity as a common side effect of a worm infection. One of the most common ways for such malware to spread is through email, so I knew that if UtahWISP could prevent worms from entering the networks, the chances of users clicking malicious attachments would drop to zero.
To combat worms and spam, I arranged for the implementation of two email products: Microsoft Exchange Server 2003 and NetIQ MailMarshal SMTP. Moving to Exchange 2003 was more than merely a security decision. The most important byproduct for UtahWISP was the tight Windows Server integration that let the company create one account for each customer, who would use the same account for Web services, disk storage, and email. Other important features were reliability, performance, and scalability. One feature that was particularly important to UtahWISP was the ability to provide Web-based email functionality through Outlook Web Access (OWA).
With Exchange in place, I worked on installing NetIQ's MailMarshal. MailMarshal is an email-content security product that can either integrate directly with Exchange or operate as a separate SMTP gateway. I chose to use the SMTP version because I wanted to keep the gateway separate from Exchange and because UtahWISP had no particular need for the integration features. Although MailMarshal would probably be more appropriate in a corporate environment, in which management can more closely enforce policies, I had to configure it to be a little more conservative for a WISP by blocking only those email messages that clearly contained viruses and by blocking only the most obvious spam. Nonetheless, shortly after installation, I witnessed a nearly 30 percent reduction in email volume making it to the Exchange server. Every piece of spam that doesn't get past the gateway and every blocked virus reduces the number of packets traversing the wireless portion of the network.
So far, I had successfully limited many unnecessary packets on the network. Although the difference was significant, it still wasn't enough. Watching the NetCrunch network map, I still saw occasional ping timeouts across the network.
Managing network resources. The next step was to control the actual traffic traversing the network. Rather than install a firewall and completely block certain types of traffic, I chose instead to throttle this traffic. For that purpose, I chose Packeteer's PacketShaper 6500, a network traffic shaper that lets you set rules that give high priority to certain types of traffic and limit the bit rate of other types of traffic. For example, I could give Web browsing high priority but limit P2P traffic to dialup-comparable speeds. Furthermore, UtahWISP's customers had service plans that allowed for certain levels of bandwidth. Up to now, the company had enforced those limitations via the wireless APs, but with PacketShaper, UtahWISP could more accurately control bandwidth for each user and report on actual usage.
I let the PacketShaper device collect statistics for a few hours, then looked at some of the charts. I knew UtahWISP experienced a lot of P2P traffic, but I had no idea how much until I actually saw it charted. In fact, the P2P bandwidth was so great that on the chart it dwarfed all other traffic combined. Furthermore, I was surprised to see just a few users taking up a large majority of the bandwidth. Some users had five or more P2P applications running at once. Having a birds-eye view of the network was extremely helpful in determining where to go next. I set limitations on all P2P traffic and gave protocols such as HTTP and DNS high priority on the network. Finally, I put all customers in separate service classes based on the service plan they had purchased.
After setting all the rules, I enabled packet shaping and immediately saw an astonishing reduction in network traffic, as the PacketShaper chart in Figure 3 shows. With the P2P traffic out of the way, I could now identify other network problems that the P2P traffic previously dwarfed. For example, I noticed one customer with an unusually large number of outgoing email messages—indicative of a spammer or, more likely, a virus infection. UtahWISP's technical support personnel contacted the customer to help eliminate the problem.
The DMZ and Server Hardware
Although the initial problem was solved, I decided to go one step further by examining UtahWISP's demilitarized zone (DMZ). A DMZ is a special network segment isolated from both the Internet and the internal network. I was surprised to find that the company didn't have a DMZ and instead placed all critical servers and all office PCs on the same network that its customers used. If Utah-WISP's servers were ever compromised, they could serve as excellent launching points for other attacks within the network. To build a DMZ, I implemented the Network Engines NS6300 ISA Server appliance.
The Network Engines appliance is a server with Windows 2003 and ISA Server 2004 pre-installed—everything was installed, hardened, and ready for basic configuration. I soon had a fully operational DMZ that was completely isolated from the rest of the network. I decided to create another isolated network for all office PCs. The appliance had six network ports on the front: one for the outside connection, one for appliance management, and four more for creating isolated networks.
With UtahWISP's DMZ servers now secure behind a firewall, I examined the servers themselves. The hardware was sufficient for the company's needs, but I knew that Web server performance would improve drastically if I upgraded the existing 256MB of RAM to 4GB, so I installed eight 512MB Micron Technologies modules to upgrade the critical servers. Although I didn't perform official before-and-after speed tests, I definitely noticed an overall speed improvement after the RAM upgrade.
The Server Rack
My goal was a complete network makeover, so I looked at UtahWISP's cluttered server rack. To manage its various servers, Utah-WISP had an unkempt row of monitors, keyboards, and mouse devices next to the server rack. Throughout the rather frustrating upgrade process, I would frequently grab the wrong mouse or type on the wrong keyboard. Furthermore, standing at the server bench for extended periods was uncomfortable.
The solution to this problem was a Belkin OmniView 16-port KVM switch with 25' cables. Belkin uses a cable design that combines keyboard, mouse, and video for two servers on one cable. This design let me easily connect all the company's servers to the KVM switch and control everything from a single, comfortable desk location nearby.
From Belkin, I also acquired an Omni-Guard 3200VA rackmount UPS to replace the four older UPS systems currently in place. This one UPS provided as much power capacity as the old systems, using about 25 percent of the space. Colored network cables let me easily distinguish different network segments.
The Final Meeting
After weeks of working on the network, I sat down with UtahWISP's administrators for a final meeting to review the upgrade. The network administrator told me that network speeds had improved by as much as 50 percent, and customers who had been experiencing the most problems had noticed even greater improvements. Furthermore, with the new components in place, the network administrator was better able to track down specific problems. The network makeover was complete, and the customer was satisfied.
Mark Burnett (email@example.com) is an independent security consultant and author who specializes in Windows security. He is an IIS MVP and the author of Perfect Passwords and Hacking the Code (Syngress).