In this chapter, I’ll give you a checklist of performance and health items that may come in handy when you’re blindsided with a vague complaint about the network being slow. Even if the complaint comes from just one user, you’d be remiss not to tentatively investigate. This checklist will give you some clues as to whether you need to eliminate the network as a problem or investigate further. I emphasize that this checklist is not a troubleshooting guide. It’s only going to tell you if there’s a problem—not necessarily what that problem is or how to fix it.
In addition to providing a checklist, I’ll show you how to check each item. Some of this you’ve seen before, and some of it will be new. Keep in mind that you may have to work through this checklist on multiple devices in your network. It’s rare that a network problem will manifest itself in an obvious way on every router and switch. By the way, all the IOS commands I show you in this chapter, with one exception, will work on both routers and switches. I’ll point out that exception when we get to it.
Here’s the checklist:
What constitutes normal CPU usage varies based on the device, what role it plays in your network, and how busy your network is overall. The command show processes cpu history will display three graphs showing historical CPU usage for the last 60 seconds, 60 minutes, and 72 hours:
Switch1#show processes cpu history 11111 555554444444444444445555577777444445555544444444444444455555 100 90 80 70 60 50 40 30 20 10 ***** ********** ***** *** 0....5....1....1....2....2....3....3....4....4....5....5....6 0 5 0 5 0 5 0 5 0 5 0 CPU% per second (last 60 seconds) 1111111111111 111111111511111111111111111111 1 1111111111111 775555355452474525525527355552355555155515556571555550511534 100 90 80 70 60 50 40 30 20 ****** ** * * ** ** * **** ***** *** *** * ***** * * 10 ########################################*###*#*########### 0....5....1....1....2....2....3....3....4....4....5....5....6 0 5 0 5 0 5 0 5 0 5 0 CPU% per minute (last 60 minutes) * = maximum CPU% # = average CPU% 564545555354545 512522004969548 100 90 80 70 60 ** * * * * 50 ** * **** *** * 40 *************** 30 *************** 20 *************** 10 ############### 0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.. 0 5 0 5 0 5 0 5 0 5 0 5 0 CPU% per hour (last 72 hours) * = maximum CPU% # = average CPU%
In the last two graphs, the hash or pound sign (#) indicates the average CPU usage. On Switch1, it’s consistently 10% or less for the past hour. Second-by-second fluctuations in CPU usage are normal, but what you want to look for is a high average CPU usage. If you see a sustained average CPU usage above 80%, it could indicate a problem.
Notice that on the 72-hour graph, the data stops prior to the 15-hour mark. This indicates the switch was turned off until about 15 hours ago.
Generally, routers and switches should always be on. If one reboots unexpectedly, it’s almost always either human error or a device problem. You can figure out when the device last rebooted by using the show version command:
Switch1#show version
Cisco IOS Software, C3560 Software (C3560-IPSERVICESK9-M), Version 15.0(2)SE5, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2013 by Cisco Systems, Inc.
Compiled Fri 25-Oct-13 13:18 by prod_rel_team
ROM: Bootstrap program is C3560 boot loader
BOOTLDR: C3560 Boot Loader (C3560-HBOOT-M) Version 12.2(44)SE6, RELEASE SOFTWARE (fc1)
Switch1 uptime is 14 hours, 43 minutes
System returned to ROM by power-on
System image file is "flash:/c3560-ipservicesk9-mz.150-2.SE5.bin"
Switch1 rebooted almost 15 hours ago, which agrees perfectly with what the CPU graph shows. One thing that’s misleading, however, is the output System returned to ROM by power-on. This does not necessarily mean that someone pulled the power plug on the switch. IOS will display this message even if you do a soft reload using the reload command.
If only one user is complaining about the network, you may have a physical connectivity problem somewhere between the switch and the user. The command show interfaces counters errors is a reliable way of pinpointing such problems. The only downside is that it won’t give you any useful information on a router:
Switch1#show interfaces counters errors Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards Fa0/1 0 0 0 0 0 0 Fa0/2 0 0 0 0 0 0 Fa0/3 0 0 0 0 0 0 Fa0/4 0 0 0 0 0 0 ... Port Single-Col Multi-Col Late-Col Excess-Col Carri-Sen Runts Giants Fa0/1 0 0 0 0 0 0 0 Fa0/2 0 0 0 0 0 0 0 Fa0/3 0 0 0 0 0 0 0 Fa0/4 0 0 0 0 0 0 0
If you run this command on one of your switches, you’ll see many more lines of text. I’ve truncated the output here. You don’t need to know what each of these items means. You just need to know that ideally they should all be 0. If any of them aren’t zero, it doesn’t necessarily indicate a problem, but any nonzero values should stay the same. If you see any of the numbers steadily increasing, you have a physical connectivity problem such as a damaged or defective cable, an improper punch-down to the patch panel or network jack, or even the wrong type of network cable.
High ping times, especially over a WAN connection, can indicate a physical connectivity problem or a heavily loaded link. A ping response time nearing 100 ms doesn’t necessarily indicate a problem, but it does warrant further investigation. Inconsistent ping times can indicate a device, link, or routing protocol going up and down. Having a graph of ping times can be tremendously helpful in spotting patterns.
My favorite tool for graphing ping times on the fly is Colasoft Ping Tool (colasoft.com). Figure 21.1 illustrates what a series of normal, consistent pings looks like.
Visually, the graph isn’t very interesting. It’s mostly flat, with some spikes here and there. Most of the ping response times are 1 millisecond (ms), which is fantastic. Keep in mind that the figure illustrates ping times in my lab network, so the graph looks better than what you’d see in a real network.
An IP route going down and coming up or flapping without explanation always requires investigation. The easiest way to check for this is to run a show ip route:
Switch1#show ip route Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2 ia - IS-IS inter area, * - candidate default, U - per-user static route o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP + - replicated route, % - next hop override Gateway of last resort is not set 1.0.0.0/32 is subnetted, 1 subnets C 1.1.1.1 is directly connected, Loopback1 2.0.0.0/32 is subnetted, 1 subnets O 2.2.2.2 [110/2] via 10.0.99.2, 00:10:17, FastEthernet0/24 10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks C 10.0.99.0/30 is directly connected, FastEthernet0/24 L 10.0.99.1/32 is directly connected, FastEthernet0/24 172.31.0.0/24 is subnetted, 1 subnets O 172.31.70.0 [110/2] via 10.0.99.2, 00:10:17, FastEthernet0/24
Notice that the two OSPF routes were installed in the routing table only 10 minutes ago. In a stable network, you should expect to see routes age in terms of days, not minutes. If you have routes that never seem to get very old, you can use a feature called route table profiling to track the number of IP routing table changes on a device. The global configuration command ip route profile enables the feature.
Once you issue this command, IOS will inspect the routing table every five seconds and record the number of changes. You can view the number of changes with a show ip route profile:
Switch1#show ip route profile IP routing table change statistics: Frequency of changes in a 5 second sampling interval ----------------------------------------------------------- Change/ Fwd-path Prefix Nexthop Pathcount Prefix interval change add change change refresh ----------------------------------------------------------- 0 251 251 260 260 260 1 0 0 0 0 0 2 9 9 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0 10 0 0 0 0 0
I must admit that I still have difficulty remembering how to interpret this table, so I’ll keep it simple. In a stable network, the numbers in the 0 row should increase every five seconds. The 0 indicates the number of changes to the IP routing table that have occurred within the last five seconds. In a stable network, you should have no changes, hence the 0. The numbers in the other rows should not increase. If they do, routes are flapping and you’ll need to investigate.
Refer to the commands in table 21.1 to complete the hands-on lab exercises.
Command |
Configuration mode |
Description |
---|---|---|
show processes cpu history | N/A | Displays historical CPU usage |
show version | N/A | Displays device uptime |
show interfaces counters errors | N/A | Displays interface errors |
show ip route | N/A | Displays the IP routing table |
ip route profile | Global | Enables IP routing table profiling |
show ip route profile | N/A | Displays the number of changes to the IP routing table |
As you read through this checklist, you may notice a theme: what’s normal depends on your network. As you get time, answer the following questions to create a baseline for each of your devices so you have a better idea of what normal looks like for your network. Be sure to do this when the network is humming along smoothly and nobody’s complaining.
Here are some things to record:
It can take some time to compile this information, so don’t feel like you have to do this in one sitting. You may never need it, but having it available can make performing a differential diagnosis that much easier.