19.6. Using traceroute, tcptraceroute, and mtr to Pinpoint Network Problems

You're having problems reaching a particular host or network, and ping confirms there is a problem, but there are several routers between you and the problem, so you need to narrow it down further. How do you do this?

Use traceroute, tcptraceroute, or mtr.

traceroute is an old standby that works well on your local network. Here is a two-hop traceroute on a small LAN with at least two subnets:

	$ traceroute mailserver1
	traceroute to mailserver1.alrac.net (192.168.2.76), 30 hops max, 40 byte packets
	 1 pyramid.alrac.net (192.168.1.45) 3.605 ms 6.902 ms 9.165 ms
	 2 mailserver1.alrac.net (192.168.2.76) 3.010 ms 0.070 ms 0.068 ms

This shows you that it passes through a single router, pyramid. If you run traceroute on a single subnet, it should show only one hop, as no routing is involved:

	$ traceroute uberpc
	traceroute to uberpc.alrac.net (192.168.1.77), 30 hops max, 40 byte packets
	 1  uberpc (192.168.1.77) 5.722 ms 0.075 ms 0.068 ms

traceroute may not work over the Internet because a lot of routers are programmed to ignore its UDP datagrams. If you see a lot of timeouts, try the -I option, which sends ICMP ECHO requests instead.

You could also try tcptraceroute, which sends TCP packets and is therefore nearly nonignorable:

	$ tcptraceroute bratgrrl.com
	Selected device eth0, address 192.168.1.10, port 49422 for outgoing packets
	Tracing the path to bratgrrl.com (67.43.0.135) on TCP port 80 (www), 30 hops max
	 1  192.168.1.50  6.498 ms  0.345 ms  0.334 ms
	 2  gateway.foo.net (12.169.163.1)  23.381 ms 22.002 ms 23.047 ms
	 3  router.foo.net (12.169.174.1)  23.285 ms 23.434 ms 22.804 ms
	 4  12.100.100.201  54.091 ms  48.301 ms *
	 5  12.101.6.101  101.154 ms  100.027 ms  110.753 ms
	 6  tbr2.cgcil.ip.att.net (12.122.10.61)  104.155 ms 101.934 ms 101.387 ms
	 7  tbr2.dtrmi.ip.att.net (12.122.10.133)  108.611 ms 105.148 ms 108.538 ms
	 8  gar3.dtrmi.ip.att.net (12.123.139.141)  108.815 ms 116.832 ms 97.934 ms
	 9 * * *
	10 lw-core1-ge2.rtr.liquidweb.com (209.59.157.30)   116.363 ms  115.567 ms  149.428
	ms
	11 lw-dc1-dist1-ge1.rtr.liquidweb.com (209.59.157.2)   129.055 ms  137.067 ms *
	12 host6.miwebdns6.com (67.43.0.135) [open]  130.926 ms 122.942 ms 125.739 ms

An excellent utility that combines ping and traceroute is mtr (My Traceroute). Use this to capture combined latency, packet loss, and problem router statistics. Here is an example that runs mtr 100 times, organizes the data in a report format, and stores it in a text file:

	$ mtr -r -c100 oreilly.com >> mtr.txt

The file looks like this:

	HOST: xena                         Loss%   Snt   Last   Avg   Best   Wrst   StDev
	  1. pyramid.alrac.net              0.0%   100    0.4   0.5    0.3    6.8     0.7
	  2. gateway.foo.net                0.0%   100   23.5  23.1   21.6   29.8     1.0
	  3. router.foo.net                 0.0%   100  23.4  24.4   21.9   78.9     5.9
	  4. 12.222.222.201                1.0%   100    52.8  57.9   44.5  127.3    10.3
	  5. 12.222.222.50                  4.0%   100   61.9  62.4   50.1  102.9     9.8
	  6. gbr1.st6wa.ip.att.net          1.0%   100   61.4  76.2   46.2  307.8    48.8
	  7. br1-a350s5.attga.ip.att.net    3.0%   100   57.2  60.0   44.4  107.1    11.6
	  8. so0-3-0-2488M.scr1.SFO1.gblx   1.0%   100   73.9  83.4   64.0  265.9    27.6
	  9. sonic-gw.customer.gblx.net     2.0%   100   72.6  79.9   69.3  119.5     7.5
	 10. 0.ge-0-1-0.gw.sr.sonic.net     2.0%   100   71.5  78.2   67.6  142.2     9.3
	 11. gig50.dist1-1.sr.sonic.net     0.0%   100   81.1  84.3   73.1  169.1    12.1
	 12. ora-demarc.customer.sonic.ne   5.0%   100   69.1  82.9   69.1  144.6    10.2
	 13. www.oreillynet.com             4.0%   100   75.4  81.0   69.8  119.1     7.0

This shows a reasonably clean run with low packet loss and low latency. When you're having problems, create a cron job to run mtr at regular intervals by using a command like this (using your own domain and filenames, of course):

	$ mtr -r -c100 oreillynet.com >> mtr.txt && date >> mtr.txt

This stores the results of every mtr run in a single file, with the date and time at the end of each entry.

You can watch mtr in real time like this:

	$ mtr -c100 oreillynet.com

You can skip DNS lookups with the -n switch.

If any of these consistently get hung up at the same router, or if mtr consistently shows greater than 5 percent packet losses and long transit times on the same router, then it's safe to say that particular router has a problem. If it's a router that you control, then for gosh sakes fix it. If it isn't, use dig or who is to find out who it belongs to, and nicely report the trouble to them.

Save your records so they can see the numbers with their own eyes.

There are a lot of web sites that let you run various network tools, such as ping and traceroute, from their sites. This is a good way to get some additional information for comparison.

mtr can generate a lot of network traffic, so don't run it all the time.

tcptraceroute sends TCP SYN packets instead of UDP or ICMP ECHO packets. These are more likely to get through firewalls, and are not going to be ignored by routers. When the host responds, tcptraceroute sends TCP RST to close the connection, so the TCP three-way handshake is never completed. This is the same as the half-open (-sS) scan used by Nmap.