1 Comment

Linux Utilities for Diagnostics

I spend a fair amount of time troubleshooting issues on Linux and other Unix and Unix-like systems. While there are dozens of utilities I use for diagnosing and resolving issues, I consistently employ a small set of tools to do quick, high-level checks of system health. These checks are in the categories of disk utilization, memory and CPU utilization, and networking and connectivity. Triaging the health of the system in each of these categories allows me to quickly hone in on where a problem may exist.

These utilities are usually available on all Linux systems. Most are available, or have analogues, on other Unix and Unix-like systems.

Disk Utilization

Generally, disk utilization is the first thing I check as a lack of free disk space spells certain doom for most user and kernel processes. I have seen more strange behavior from a lack of free disk space than anything else.

  • df reports filesystem disk space usage. This quickly allows me to see how much free space remains on each filesystem.

  • df -h displays, in human-readable format, the free space available on all mounted filesystems.

    
    $ df -h
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/xvda        47G   26G   19G  58% /
    devtmpfs        4.0G   12K  4.0G   1% /dev
    none            802M  184K  802M   1% /run
    none            5.0M     0  5.0M   0% /run/lock
    none            4.0G     0  4.0G   0% /run/shm
    

  • du estimates file space usage. This allows me to pinpoint which fields are taking up large amounts of disk space so I can investigate further.
  • du -sh * summarizes, in human-readable format, the space utilized by all files/folders in the current directory.
    
    $ du -sh *
    18M     bundle
    8.6M    cached-copy
    444M    log
    4.0K    pids
    4.0K    system
    

Memory, CPU Utilization, and I/O

Running out of available memory is also a major cause of performance problems and strange behavior on systems. CPU utilization and I/O rates can quickly provide clues as to whether performance problems are due to bottlenecks internal to a given system, or from external sources.

  • free reports the amount of free and used memory on the system. This provides immediate feedback on whether a system lacks free memory.
  • free -m displays, in megabytes, the amount of used and free physical and swap memory, and the amount of memory used for buffers/caching.
    
    $ free -m
                    total       used       free     shared    buffers     cached
    Mem:          8014       6339       1674          0        136       3887
    -/+ buffers/cache:       2314       5699
    Swap:          511        153        358
    
  • vmstat reports on memory, swap, I/O, system activity, and CPU activity. This provides averages of various metrics since boot and can report continuously on current metrics. Analyzing the metrics can provide insight into what the system is doing at a given time (e.g. frequently swapping, waiting on I/O, etc.).
  • vmstat 1 will print out the metrics once every second until halted, using megabytes instead of bytes.
    
    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
    1  0     80     59     43   1791    0    0     0     0 1131 1057 15  2 83  0
    0  0     80     57     43   1791    0    0     8    96 1031  936 19  2 79  0
    0  0     80     60     43   1791    0    0    40    64 1666 1444  9  2 89  0
    0  0     80     60     43   1791    0    0     8     0  667  553  0  0 100  0
    1  0     80     57     43   1791   16    0    16   104  808  748 12  2 86  0
    0  0     80     59     43   1791    0    0    12  3028 1813 1723 44  5 50  0
    0  0     80     59     43   1791    0    0     0    56 1119 1066 17  1 81  0
    1  0     80     50     43   1791    0    0    68     0 1219 1024 25  4 71  0
    0  0     80     60     43   1791    0    0    52    68 1725 1435 12  1 86  0
    0  0     80     60     43   1791    0    0     8     0 2236 1699 35  5 60  0
    0  0     80     60     43   1791    0    0     0    68  163  209  0  0 99  0
    1  0     80     60     43   1791    0    0     0   140 1456 1379 22  3 74  0
    1  0     80     61     43   1791    0    0     0    56 1481 1242 24  4 72  0
    0  0     80     60     43   1791    0    0   356     0 1359  930 11  3 86  0
    0  0     80     60     43   1792    0    0   428     0 1619  992  2  1 97  0
    0  0     80     60     43   1792    0    0     8  2196  313  396  0  0 100  0
    0  0     80     60     43   1792    0    0     0     0  144  181  0  0 100  0
    

Networking and Connectivity

Network connectivity and routing issues are usually apparent. However, trying to determine the exact nature of or reason for the issue can be a bit more difficult.

  • ping sends an ICMP echo request to a host. This provides immediate confirmation of whether or not a remote host is accessible.
  • ping 8.8.8.8 will ping Google’s DNS servers, which usually indicates with a high degree of certainty whether or not Internet connectivity is available.
    
    $ ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
    64 bytes from 8.8.8.8: icmpreq=1 ttl=54 time=0.681 ms
    64 bytes from 8.8.8.8: icmpreq=2 ttl=54 time=0.679 ms
    64 bytes from 8.8.8.8: icmpreq=3 ttl=54 time=0.703 ms
    64 bytes from 8.8.8.8: icmpreq=4 ttl=54 time=0.703 ms
    64 bytes from 8.8.8.8: icmp_req=5 ttl=54 time=0.677 ms
    
  • mtr combines ping with traceroute and prints the route packet trace to a remote host, along with packet response times and loss percentages.

  • mtr -c 5 -r 8.8.8.8 will send five packets to Google’s DNS servers and report back the intermediate routers, with details about response times and packet loss along the way.

    
    $ mtr -c 5 -r  8.8.8.8
    HOST: localhost                 Loss%   Snt   Last   Avg  Best  Wrst StDev
    1.|-- router2-dal.linode.com     0.0%     5    0.9   0.7   0.6   0.9   0.2
    2.|-- ae2.car02.dllstx2.network  0.0%     5    0.3   6.3   0.3  30.5  13.5
    3.|-- po102.dsr01.dllstx2.netwo  0.0%     5    1.1   0.6   0.5   1.1   0.3
    4.|-- po21.dsr01.dllstx3.networ  0.0%     5    1.3   2.5   0.6   8.0   3.1
    5.|-- ae17.bbr02.eq01.dal03.net  0.0%     5    0.5   0.6   0.5   0.8   0.1
    6.|-- ae7.bbr01.eq01.dal03.netw  0.0%     5    0.5   0.6   0.5   0.7   0.1
    7.|-- 25.10.6132.ip4.static.sl-  0.0%     5    0.6   0.9   0.6   2.2   0.7
    8.|-- 216.239.50.89              0.0%     5    0.5   0.6   0.5   0.8   0.1
    9.|-- 64.233.174.69              0.0%     5    1.0   0.8   0.8   1.0   0.1
    10.|-- google-public-dns-a.googl  0.0%     5    0.8   0.8   0.7   0.8   0.0
    

  • netstat displays information about network connections, routing tables, and interfaces. While it is a very sophisticated tool which has many different possible applications, it provides an easy way to display a few important bits of data:
  • netstat -nlp displays information about processes that are currently listening on a socket.
    
    $ sudo netstat -nlp
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      2858/mysqld
    tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      2665/sshd
    tcp        0      0 0.0.0.0:25              0.0.0.0:*               LISTEN      3133/master
    tcp6       0      0 :::8080                 :::*                    LISTEN      3160/apache2
    tcp6       0      0 :::22                   :::*                    LISTEN      2665/sshd
    tcp6       0      0 :::25                   :::*                    LISTEN      3133/master
    tcp6       0      0 :::443                  :::*                    LISTEN      3160/apache2
    udp        0      0 0.0.0.0:68              0.0.0.0:*                           2633/dhclient3
    
  • netstat -rn displays the current routing table.
    
    $ netstat -rn
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
    0.0.0.0         173.255.206.1   0.0.0.0         UG        0 0          0 eth0
    173.255.206.0   0.0.0.0         255.255.255.0   U         0 0          0 eth0
    

Conclusion

The examples above show some of the most common ways these utilities can be used to perform diagnostics on systems based on disk utilization, memory and CPU utilization, and network activity and connectivity. Some of these utilities (particularly netstat) are quite powerful, and could be used to display or diagnose much more than shown in the examples above. Past troubleshooting experience, and the specific histories of given systems, guide the particular ways that I deploy these tools to assist in the investigation and resolution of system issues.