Problem Diagnosis With Ping

The two most used tools, and almost always the first used to diagnose a network problem are traceroute and ping. The results they return are however most often misunderstood or interpreted in a way that leads to an incorrect conclusion.

Let's take the ping utility specifically. The common mistakes that is made is that whatever the ping result is, is due to the target of the ping. For example, if there is no ping response; conclude that the site is down. Or if there is packet loss or long return times, conclude that it is because of some problem with the target address. While both those outcomes could be the case, far more often than not, they are completely the wrong conclusions to draw.

The common causes of this misinterpretation are:

1. Ping sends a packet to the destination address that typically will traverse several other network points to get there. A problem at any one of those points will cause a non response to the ping query
2. In many cases web sites and other servers sit behind firewalls, and many, if not most, firewalls block ping packets. So while web traffic may reach the site, ping packets may not.
3. The ping packet has a source (the system initiating the ping) as well as a destination, it may be that the source does not have a correct route path to the destination, or that the destination does not have a correct return route path to the source. This could be because of specific firewall rules, an error in the route tables 'somewhere' along the data path, or a specific routing policy deliberately put in place to block access.

The traceroute command can be used to help detect if 1. or 3. are the cause of the problem, which has its own issues, but more on that later. A positive result from either telnet and tcptraceroute will conclusively rule out 2. as a possible case.

Telnet can be used to open a connection any any port, not just the telnet default port. A successful telnet connection where ping has failed is proof positive that a firewall is preventing access to ping packets. Here is an example:

$ ping cisco.com PING cisco.com (198.133.219.25) 56(84) bytes of data.

--- cisco.com ping statistics ---

6 packets transmitted, 0 received, 100% packet loss, time 5008ms

$ telnet cisco.com 80

Trying 198.133.219.25...

Connected to cisco.com.

Escape character is '^]'.

You can see that the ping packet failed, but that telnet to port 80 succeeded in connecting to the server.

So too with tcptraceroute on port 80:

$ tcptraceroute cisco.com 80

traceroute to cisco.com (198.133.219.25), 30 hops max, 40 byte packets

1 192.168.6.254 (192.168.6.254) 8.557 ms 10.624 ms *

....

15 cisco.com (198.133.219.25) 289.162 ms 237.972 ms 242.171 ms

Another common error using ping is that the results of just a few ping tests are indicative of the condition of a data path. It may be true, but such a conclusion can only be relied upon over a statistically meaningful sample size. Also, to be truly accurate, the distribution of packets responses outside the acceptable level needs to be known.

For example, as single ping test of four packets where one packet is dropped, can not, in any meaningful way, be used to conclude that there is 25% packet loss on that circuit. Ten thousand ping tests, over several hours where there is say 5% lost has far more meaning; however consider if the test was done over 24 hours, and for one hour the target site was down. The 100% loss during that hour looks like a general 5% packet loss over 24 hours.

It is therefore important to review the record of the ping test and see if the distribution of any packet loss is regular or confined to a specific period, before a real conclusion can be drawn.

A third common error is that the cause whatever is result is gained is due to the target site. For example, say 5% packet loss was found when pinging 3com.com, this by no means indicates that the problem lays with that site, rather, the problem could be with any of the points along the data path to that site, inclusive the source (my own computer):

$ traceroute 3com.com traceroute to 3com.com (192.136.34.41), 30 hops max, 40 byte packets

1 192.168.6.254 (192.168.6.254) 10.285 ms 13.316 ms 14.440 ms

2 129.1.233.220.exetel.com.au (220.233.1.129) 132.994 ms 135.387 ms 136.312 ms

3 241.0.233.220.exetel.com.au (220.233.0.241) 137.192 ms 141.296 ms 162.018 ms

4 10.0.1.1 (10.0.1.1) 168.530 ms 174.358 ms 176.908 ms

5 38.2.233.220.exetel.com.au (220.233.2.38) 177.729 ms 188.233 ms 189.122 ms

6 359-ge-0-0-0.GW5.SYD2.ALTER.NET (203.166.92.57) 197.691 ms 85.598 ms 156.625 ms

7 0.so-0-2-0.XR3.SYD2.ALTER.NET (210.80.33.189) 158.108 ms 159.430 ms 160.260 ms

8 0.so-4-3-0.IR1.LAX12.ALTER.NET (210.80.50.249) 305.124 ms 305.952 ms 306.775 ms

9 0.so-5-0-0.IL1.LAX9.ALTER.NET (152.63.48.65) 313.518 ms 321.047 ms 321.868 ms

10 0.so-5-0-0.XT1.SAC1.ALTER.NET (152.63.0.98) 405.111 ms 406.359 ms 407.241 ms

11 GigabitEthernet6-0-0.GW9.SAC1.ALTER.NET (152.63.55.73) 331.091 ms 337.600 ms 341.527 ms

12 eds-gw.customer.alter.net (63.114.61.154) 357.930 ms 287.765 ms 310.755 ms

13 205.141.209.3 (205.141.209.3) 311.606 ms 312.502 ms 313.587 ms

14 10.231.1.2 (10.231.1.2) 341.277 ms 342.101 ms 342.931 ms

15 205.141.209.133 (205.141.209.133) 344.380 ms 345.861 ms 346.689 ms

16 ip-192-136-34-41.ip.3com.com (192.136.34.41) 261.317 ms 266.998 ms 346.689 ms

You can clearly see the number of hops the data must traverse. In this case there is no evidence of any problem along the data path. But if the traceroute looked like this:

$ traceroute 3com.com traceroute to 3com.com (192.136.34.41), 30 hops max, 40 byte packets

1 192.168.6.254 (192.168.6.254) 10.285 ms 13.316 ms 14.440 ms

2 129.1.233.220.exetel.com.au (220.233.1.129) 132.994 ms 135.387 ms 136.312 ms

3 241.0.233.220.exetel.com.au (220.233.0.241) 137.192 ms 141.296 ms 162.018 ms

4 10.0.1.1 (10.0.1.1) 168.530 ms 174.358 ms 176.908 ms

5 38.2.233.220.exetel.com.au (220.233.2.38) 177.729 ms 188.233 ms 189.122 ms

6 359-ge-0-0-0.GW5.SYD2.ALTER.NET (203.166.92.57) 197.691 ms 85.598 ms 156.625 ms

7 0.so-0-2-0.XR3.SYD2.ALTER.NET (210.80.33.189) 758.108 ms 759.430 ms *

8 0.so-4-3-0.IR1.LAX12.ALTER.NET (210.80.50.249) * * 806.775 ms

9 0.so-5-0-0.IL1.LAX9.ALTER.NET (152.63.48.65) 813.518 ms * 721.868 ms

10 0.so-5-0-0.XT1.SAC1.ALTER.NET (152.63.0.98) * 1406.359 ms 1007.241 ms

11 GigabitEthernet6-0-0.GW9.SAC1.ALTER.NET (152.63.55.73) 731.091 ms 737.600 ms 1341.527 ms

12 eds-gw.customer.alter.net (63.114.61.154) 357.930 ms * *

13 205.141.209.3 (205.141.209.3) 811.606 ms 812.502 ms 813.587 ms

14 10.231.1.2 (10.231.1.2) 741.277 ms 742.101 ms 1342.931 ms

15 205.141.209.133 (205.141.209.133) * * 746.689 ms

16 ip-192-136-34-41.ip.3com.com (192.136.34.41) 761.317 ms 866.998 ms *

It would be reasonable to conclude that there was some serious problem between hop 6 and hop 7 that is causing the ping test to return its lossy result.

To conclude, we can see that ping:

1. is a useful tool to indicate where a problem may be
2. should be used in combination with other tests to eliminate false positives
3. should not be used for small, isolated tests 4. is a good indicator of problems over sadistically meaningful sample sizes

Broadband Internet

Tuesday, June 10, 2008

Previous Posts