Thursday, July 15, 2010

Server High Load Troubleshooting

Server High Load Troubleshooting

The first command I run when I log in to the system is uptime:

$ uptime
18:30:35 up 365 days, 5:29, 2 users, load average: 1.37, 10.15, 8.10

See My load average is 1.37, 10.15, 8.10. These numbers represent
My average system load over the last 1, 5 and 15 minutes, respectively. Technically speaking, the load average represents the average number of processes that have to wait for CPU time during the last 1, 5 or 15 minutes.

$top

If the first tool I use when I log in to a sluggish system is uptime, the second tool I use is top. The
great thing about top is that it’s available for allmajor Linux systems, and it provides a lot of useful information in a single screen. top is a quite complex tool with many options that could warrant its own article. For this column, I stick to how to interpret its output to diagnose high load.

CPU-Bound Load

CPU-bound load is load caused when you have toomany CPU-intensive processes running at once. Because each process needs CPU resources, they all must wait their turn. To check whether load is CPU-bound, check the CPU line in the top output:

Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st


Each of these percentages are a percentage of the CPU time tied up doing a particular task. Again, you could spend an entire column on all of the output from top, so here’s a few of these values and how to read them.

us: user CPU time. More often than not, when you have CPU-bound load, it’s due to a process
run by a user on the system, such as Apache, MySQL or maybe a shell script. If this percentage
is high, a user process such as those is a likely cause of the load.

sy: system CPU time. The system CPU time is the percentage of the CPU tied up by kernel and
other system processes. CPU-bound load should manifest either as a high percentage of user or
high system CPU time.

id: CPU idle time. This is the percentage of the time that the CPU spends idle. The higher the
number here the better! In fact, if you see really high CPU idle time, it’s a good indication that
any high load is not CPU-bound.

wa: I/O wait. The I/O wait value tells the percentage of time the CPU is spending waiting on I/O (typically disk I/O). If you have high load and this value is high, it’s likely the load is not CPU-bound but is due to either RAM issues or high disk I/O.