Chapter 21. Performance and health checklist

In this chapter, I’ll give you a checklist of performance and health items that may come in handy when you’re blindsided with a vague complaint about the network being slow. Even if the complaint comes from just one user, you’d be remiss not to tentatively investigate. This checklist will give you some clues as to whether you need to eliminate the network as a problem or investigate further. I emphasize that this checklist is not a troubleshooting guide. It’s only going to tell you if there’s a problem—not necessarily what that problem is or how to fix it.

In addition to providing a checklist, I’ll show you how to check each item. Some of this you’ve seen before, and some of it will be new. Keep in mind that you may have to work through this checklist on multiple devices in your network. It’s rare that a network problem will manifest itself in an obvious way on every router and switch. By the way, all the IOS commands I show you in this chapter, with one exception, will work on both routers and switches. I’ll point out that exception when we get to it.

Here’s the checklist:

Is the CPU being overloaded?
What’s the system uptime?
Is there a damaged network cable or jack?
Are ping times unusually high or inconsistent?
Are routes flapping?

21.1. Is the CPU being overloaded?

What constitutes normal CPU usage varies based on the device, what role it plays in your network, and how busy your network is overall. The command show processes cpu history will display three graphs showing historical CPU usage for the last 60 seconds, 60 minutes, and 72 hours:

Switch1#show processes cpu history



                               11111
      555554444444444444445555577777444445555544444444444444455555
  100
   90
   80
   70
   60
   50
   40
   30
   20
   10 *****               **********     *****               ***
     0....5....1....1....2....2....3....3....4....4....5....5....6
               0    5    0    5    0    5    0    5    0    5    0
               CPU% per second (last 60 seconds)




      1111111111111 111111111511111111111111111111 1 1111111111111
      775555355452474525525527355552355555155515556571555550511534
  100
   90
   80
   70
   60
   50
   40
   30
   20 ****** ** *    * ** ** * ****  ***** *** *** *  ***** *  *
   10 ########################################*###*#*###########
     0....5....1....1....2....2....3....3....4....4....5....5....6
               0    5    0    5    0    5    0    5    0    5    0
               CPU% per minute (last 60 minutes)
              * = maximum CPU%   # = average CPU%

      564545555354545
      512522004969548
  100
   90
   80
   70
   60 ** *      * * *
   50 ** * **** *** *
   40 ***************
   30 ***************
   20 ***************
   10 ###############
     0....5....1....1....2....2....3....3....4....4....5....5....6....6....7..
               0    5    0    5    0    5    0    5    0    5    0    5    0
                   CPU% per hour (last 72 hours)
                  * = maximum CPU%   # = average CPU%

In the last two graphs, the hash or pound sign (#) indicates the average CPU usage. On Switch1, it’s consistently 10% or less for the past hour. Second-by-second fluctuations in CPU usage are normal, but what you want to look for is a high average CPU usage. If you see a sustained average CPU usage above 80%, it could indicate a problem.

Notice that on the 72-hour graph, the data stops prior to the 15-hour mark. This indicates the switch was turned off until about 15 hours ago.

21.2. What’s the system uptime?

Generally, routers and switches should always be on. If one reboots unexpectedly, it’s almost always either human error or a device problem. You can figure out when the device last rebooted by using the show version command:

Switch1#show version
Cisco IOS Software, C3560 Software (C3560-IPSERVICESK9-M), Version 15.0(2)SE5, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2013 by Cisco Systems, Inc.
Compiled Fri 25-Oct-13 13:18 by prod_rel_team

ROM: Bootstrap program is C3560 boot loader
BOOTLDR: C3560 Boot Loader (C3560-HBOOT-M) Version 12.2(44)SE6, RELEASE SOFTWARE (fc1)

Switch1 uptime is 14 hours, 43 minutes
System returned to ROM by power-on
System image file is "flash:/c3560-ipservicesk9-mz.150-2.SE5.bin"

Switch1 rebooted almost 15 hours ago, which agrees perfectly with what the CPU graph shows. One thing that’s misleading, however, is the output System returned to ROM by power-on. This does not necessarily mean that someone pulled the power plug on the switch. IOS will display this message even if you do a soft reload using the reload command.

21.3. Is there a damaged network cable or jack?

If only one user is complaining about the network, you may have a physical connectivity problem somewhere between the switch and the user. The command show interfaces counters errors is a reliable way of pinpointing such problems. The only downside is that it won’t give you any useful information on a router:

Switch1#show interfaces counters errors

Port    Align-Err   FCS-Err   Xmit-Err    Rcv-Err  UnderSize  OutDiscards
Fa0/1           0         0          0          0          0            0
Fa0/2           0         0          0          0          0            0
Fa0/3           0         0          0          0          0            0
Fa0/4           0         0          0          0          0            0

...

Port  Single-Col  Multi-Col  Late-Col  Excess-Col  Carri-Sen  Runts  Giants
Fa0/1          0          0         0           0          0      0       0
Fa0/2          0          0         0           0          0      0       0
Fa0/3          0          0         0           0          0      0       0
Fa0/4          0          0         0           0          0      0       0

If you run this command on one of your switches, you’ll see many more lines of text. I’ve truncated the output here. You don’t need to know what each of these items means. You just need to know that ideally they should all be 0. If any of them aren’t zero, it doesn’t necessarily indicate a problem, but any nonzero values should stay the same. If you see any of the numbers steadily increasing, you have a physical connectivity problem such as a damaged or defective cable, an improper punch-down to the patch panel or network jack, or even the wrong type of network cable.

21.4. Are ping times unusually high or inconsistent?

High ping times, especially over a WAN connection, can indicate a physical connectivity problem or a heavily loaded link. A ping response time nearing 100 ms doesn’t necessarily indicate a problem, but it does warrant further investigation. Inconsistent ping times can indicate a device, link, or routing protocol going up and down. Having a graph of ping times can be tremendously helpful in spotting patterns.

My favorite tool for graphing ping times on the fly is Colasoft Ping Tool (colasoft.com). Figure 21.1 illustrates what a series of normal, consistent pings looks like.

Figure 21.1. Colasoft Ping Tool showing a series of pings

Visually, the graph isn’t very interesting. It’s mostly flat, with some spikes here and there. Most of the ping response times are 1 millisecond (ms), which is fantastic. Keep in mind that the figure illustrates ping times in my lab network, so the graph looks better than what you’d see in a real network.

21.5. Are routes flapping?

An IP route going down and coming up or flapping without explanation always requires investigation. The easiest way to check for this is to run a show ip route:

Switch1#show ip route

Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       + - replicated route, % - next hop override

Gateway of last resort is not set

      1.0.0.0/32 is subnetted, 1 subnets
C        1.1.1.1 is directly connected, Loopback1
      2.0.0.0/32 is subnetted, 1 subnets
O        2.2.2.2 [110/2] via 10.0.99.2, 00:10:17, FastEthernet0/24
      10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
C        10.0.99.0/30 is directly connected, FastEthernet0/24

L        10.0.99.1/32 is directly connected, FastEthernet0/24
      172.31.0.0/24 is subnetted, 1 subnets
O        172.31.70.0 [110/2] via 10.0.99.2, 00:10:17, FastEthernet0/24

Notice that the two OSPF routes were installed in the routing table only 10 minutes ago. In a stable network, you should expect to see routes age in terms of days, not minutes. If you have routes that never seem to get very old, you can use a feature called route table profiling to track the number of IP routing table changes on a device. The global configuration command ip route profile enables the feature.

Once you issue this command, IOS will inspect the routing table every five seconds and record the number of changes. You can view the number of changes with a show ip route profile:

Switch1#show ip route profile
IP routing table change statistics:
Frequency of changes in a 5 second sampling interval
-----------------------------------------------------------
Change/   Fwd-path  Prefix   Nexthop  Pathcount  Prefix
interval  change    add      change   change     refresh
-----------------------------------------------------------
0         251       251      260      260        260
1         0         0        0        0          0
2         9         9        0        0          0
3         0         0        0        0          0
4         0         0        0        0          0
5         0         0        0        0          0
10        0         0        0        0          0

I must admit that I still have difficulty remembering how to interpret this table, so I’ll keep it simple. In a stable network, the numbers in the 0 row should increase every five seconds. The 0 indicates the number of changes to the IP routing table that have occurred within the last five seconds. In a stable network, you should have no changes, hence the 0. The numbers in the other rows should not increase. If they do, routes are flapping and you’ll need to investigate.

21.6. Commands in this chapter

Refer to the commands in table 21.1 to complete the hands-on lab exercises.

Table 21.1. Commands used in this chapter

Command	Configuration mode	Description
show processes cpu history	N/A	Displays historical CPU usage
show version	N/A	Displays device uptime
show interfaces counters errors	N/A	Displays interface errors
show ip route	N/A	Displays the IP routing table
ip route profile	Global	Enables IP routing table profiling
show ip route profile	N/A	Displays the number of changes to the IP routing table

21.7. Hands-on lab

As you read through this checklist, you may notice a theme: what’s normal depends on your network. As you get time, answer the following questions to create a baseline for each of your devices so you have a better idea of what normal looks like for your network. Be sure to do this when the network is humming along smoothly and nobody’s complaining.

Here are some things to record:

What’s the average CPU usage for the past 24 hours?
What are the average ping times to network resources (servers) from different offices?
Are there any ports that always seem to have errors?
What’s the typical age of routes?
Are there any routers or switches that have been up for a surprisingly long or short time?

It can take some time to compile this information, so don’t feel like you have to do this in one sitting. You may never need it, but having it available can make performing a differential diagnosis that much easier.