Troubleshooting can account for as much as 90 percent of a network administrator's job. No one likes putting out fires, but you don't always have a choice. Good troubleshooting skills enable you to respond quickly in a crisis situation and keep your network running smoothly. When you face a troubleshooting challenge, start by asking yourself basic questions. What has changed? Has this problem occurred before, and if so, when? Is the problem reproducible? Did the user do anything differently? Are other users experiencing the same problem?

Next, try to isolate the problem, "cutting it in half" with each step you take to get closer to its source. For example, if a workstation can't connect to the network, try to determine whether you're facing a networkwide problem or a workstation-specific problem. If you can quickly determine that the problem applies to the workstation only, you've removed a significant half from the equation and are closer to isolating the problem. Even if you can't find a solution, isolating the problem will save a tremendous amount of time when you seek outside help.

To give you an idea of how this process works, I've gathered several troubleshooting scenarios, ranging from common but simple problems to more difficult challenges. You might run into similar situations in which you can apply some of the basic questions that I use to isolate the problems in these examples. For more information about the tools I use in the following scenarios, see the sidebar "Basic Troubleshooting Tools," page 56.

Problem: No Domain Server Is Available to Validate Passwords
You've undoubtedly encountered this problem: You sit down at your workstation and try to log on to the network, but you receive the dreaded No domain server was available to validate your password error message.

To troubleshoot this problem, you must determine whether the problem relates to the workstation, the network, or the server. Start by asking the following questions:

  • What has changed? Have you made any changes to your network that might have resulted in a problem? Did you add a new server, remove an existing server, make switch or hub changes, add or remove a domain controller (DC), or promote or demote a DC?
  • Are other workstations experiencing the problem?
  • Is the server up?

You discover that the workstation has been working as it should until now. No other workstations are experiencing the problem and the server is up, so you can safely presume that this problem is workstation-specific. Next, you need to determine where within the machine the problem lies. Your next questions are as follows:

  • Can the workstation ping the server?
  • Can the workstation obtain an IP address?

You can ping the server, but the ping times out on occasion, which indicates that you're experiencing intermittent communication between the server and workstation. From a command line, you type

ipconfig /renew

When you run this command multiple times, the workstation sometimes renews its IP address lease and sometimes does not. This symptom is an indication of intermittent communication between the server and workstation. You decide to swap out the workstation with another working workstation. The new workstation doesn't work in the original workstation's location, and the original workstation can connect to the network without problems from another location. Clearly, something is wrong with the original location's cable run or hub.

You try connecting the cable run to a different hub but still can't connect to the network. You now know that the cable run is the culprit. You've isolated the problem. Further investigation reveals that a cable tie in the server room has been cut and the drop run has a severed pin six.

Problem: Windows Services Don't Start
You notice that when you restart a Windows 2000 Server machine, services that aren't set to start with the Local System Account fail to start. You must manually open the service, reenter the password, and start the service. Each time you reenter the password, you receive the message <username> granted the logon as service right.

To troubleshoot this problem, start by asking the following questions:

  • What has changed? Did anyone make any changes on this server?
  • Did the services start in the past?
  • Are the username and password valid?

You investigate and discover that the server, a DC, was until recently a member of the Domain Controllers organizational unit (OU). The services started properly until the server was moved out of that OU. The username and password you used to start the services are valid. Upon further research, you discover that members of the Domain Controllers OU have specific rights, among them the right to log on as a service. The server lost this right when it moved out of the OU; you need to restore the right to the server.

To grant the right to the server, take the following steps:

  1. Start the Microsoft Management Console (MMC) Active Directory Users and Computers snap-in, then open the Domain Controllers OU's Properties dialog box.
  2. On the Group Policy tab, click Default Domain Controllers Policy, then click Edit. This step starts Group Policy Manager.
  3. Expand the Computer Configuration object, expand Windows Settings, then expand Security Settings. Expand Local Policies, then click User Rights Assignment.
  4. In the right-hand pane, right-click Log on as a service, then click Security.
  5. Add the user account used to start up the service to the policy, then click OK.

For more information about this procedure, see the Microsoft article "How to Troubleshoot Service Startup Problems" (http://support.microsoft.com/?kbid=259733).

Problem: Inbound External Email Stops Working
You use Microsoft Exchange 2000 Server for internal and Internet mail. Your ISP went out of business suddenly, and you made a quick switch to a new ISP. Users have Internet access but aren't receiving external messages. Outgoing email appears to be working fine.

To begin the troubleshooting process, ask yourself the following question:

  • Was email working before the ISP switch?

To make sure that the Exchange server is working and that your firewall is configured properly, you telnet into the Exchange server from the Internet on port 25. (For more information about this procedure, see the Microsoft article "XFOR: Telnet to Port 25 of IMC to Test IMC Communication" at http://support.microsoft.com/?kbid=153119.) You can send a test message, so you conclude that the server and the firewall are working. The problem probably relates to the ISP switch. Ask yourself the following question:

  • Has the domain been transferred correctly to the new ISP?

You use the Nslookup utility to attempt to look up the mail exchanger (MX) record for your domain, but you find no entry. From this evidence, you conclude that when you switched ISPs, the company that handles your domain (e.g., Network Solutions, Register.com) didn't transfer the domain properly. You can contact the company, share the IP address for your MX record, and ask the company to properly transfer your domain to your new ISP. After the MX record propagates throughout the Internet, incoming email will resume.

Problem: Servers Disappear from the Network
You realize that your Win2K Professional workstation can see your Win2K servers only sporadically. The servers seem to disappear from the network. To start the troubleshooting process, ask yourself the following questions:

  • Has this problem occurred in the past?
  • Are all workstations experiencing the problem?

You investigate and realize that this problem started after you upgraded your servers from Windows NT 4.0 to Win2K. All the workstations on your network are experiencing similar problems. You now need to determine whether this problem relates to the servers or the network.

You log on to a workstation, open a command line, and type

ping/pathping

to ping the server. You can ping by IP address but not by server name. You probably have a name-resolution or DNS problem. Next, you type

ipconfig/all

and notice that the DNS server for the workstation is pointing to your ISP's DNS server. Win2K uses DNS as its primary source of name resolution, but the workstation is trying to use the ISP's DNS servers to resolve the Win2K servers. When the workstations query the ISP's DNS server, they eventually time out and the servers disappear from the network. To fix this problem, you must set your primary DNS server to an internal Win2K DNS server so that internal workstations will query the Win2K DNS for local servers. After verifying that DNS is installed and running properly on the Win2K server, you change the Win2K DNS server IP address to point to itself. Next, using the DNS manager, you verify that the DNS server is in the root and enable forwarders. By enabling forwarders, you can resolve any address that isn't local to your network. You enter the ISP's DNS servers into the Forwarders field. Finally, you reconfigure DHCP on the server to change the DNS servers from the ISP's to the Win2K server and renew the IP addresses on the workstations. Your network is now stable. For more information about configuring DNS in this environment, see the Microsoft article "HOW TO: Configure DNS for Internet Access in Windows 2000" (http://support.microsoft.com/?kbid=300202).

Problem: Multiple WAN Lines on a LAN
You recently installed a LAN with two WAN connections in Los Angeles. One line goes to your private frame relay network, and the other line goes to the Internet for fault tolerance and performance. (Figure 1 shows the network configuration.) The Los Angeles users are having intermittent problems when attempting to connect to your server in New York. Ask yourself the following questions:

  • When do these problem occur?
  • What's the default gateway?

The problems occur intermittently. The DHCP in Los Angeles is configured with a default gateway of 192.168.1.11 (i.e., the firewall). All computers on the Los Angeles LAN are experiencing the same problem. Because all the workstations in Los Angeles are experiencing the same problem, you most likely have a global routing problem on the Los Angeles network.

A static route on your firewall routes 192.168.2.0 mask 255.255.255.0 to your router of 192.168.1.10. You use the Route Print command to verify this routing. The Los Angeles server can sometimes ping the New York server, but not always. You run Tracert and receive the results that Figure 2 shows. The results represent the path that the packets should follow. However, sometimes when you issue the Tracert command, the packets time out after the first hop (192.168.1.11). This information leads you to suspect that the firewall isn't reliably forwarding packets to the Cisco Systems router for 192.168.2.0 traffic.

You review the firewall log and realize that packets are intermittently denied forwarding to 192.168.1.10, even though a rule is in place to forward these packets. Firewalls vary, but most firewall vendors will discourage you from using your firewall to perform router functions. If the firewall falls prey to an attack, an intruder can gain extensive information about your WAN connections.

You reconfigure the network to use a default gateway of 192.168.1.10 (the router). You issue the command

Ip route 0.0.0.0 0.0.0.0 192.168.1.11

to establish a default route on the router to push all traffic to the firewall. When users want to access the Internet, they can go to the router and out through the firewall.

How would a failure of the Los Angeles router (192.168.1.10) affect Internet access? How would you fix the problem if the frame relay network went down but the Internet connection stayed up? If the Los Angeles router went down, you'd lose the Internet connection. Remember that the default gateway is configured for the router. If this router goes down, packets wouldn't be forwarded to the firewall. You can restore Internet access in Los Angeles by changing the DHCP default gateway to the firewall. Of course, the private WAN and Internet access will remain down at all other locations until you fix the Los Angeles router.

Problem: Workstations Drop Off the Network
Workstations on the fifth floor of your corporate offices can no longer see your server or connect to the Internet. The problem is intermittent. Ask yourself the following questions:

  • How long has this problem occurred?
  • Has anything changed?

Using the Pathping utility, you note some packet loss errors. This problem appears to be isolated to the fifth floor.

You use a tone generator or a cable scanner, and you trace the network connections back to an Ethernet switch on the sixth floor. The fifth and sixth floors share this switch. You suspect a possible bad switch port and swap ports with a computer on the sixth floor, but the problem remains isolated on the fifth floor, so the switch is probably OK. You conclude that you should examine the physical space for clues.

You return to the fifth floor and notice a small five-port hub in one of the cubicles. Looking more closely, you notice four other small hubs daisy chained together. You've discovered the problem. You can have only one Class I repeater hop (0.7 microsecond latency) or two Class II (0.46 microsecond latency) repeater hops per segment with 100Base-T Ethernet. (For this reason, I discourage the use of small hubs on production networks.) You remove all the small hubs and run the drops directly to the switch on the sixth floor. Problem solved.

Don't assume that you can memorize solutions to common problems. Rather, approach each problem with an open mind and ask yourself simple questions to try and cut the problem in half at each turn. Remember, problem isolation is the key.