No matter the monitorization solution, CPU utilization is important but if you really want to know what is happening with your system, you should look at load average. I will explain below why.
The current CPU utilization does not reflect the actual load of the system because when a host is heavily loaded, its CPU usage doesn't have to be necessary close to or at 100%.
Furthermore, the CPU utilization tends to generate (in monitoring tools) a significant number of false alerts, even if the percentage alert is close to 100%. Explanation follows.
There is one way you can benefit of this type of monitoring: to receive alert when a server's CPU utilization is at 100% in 5 minutes interval.
In this situation, you could consider someone should take a look at this server even if this doesn't necessary mean the system is overloaded. If you receive too many alerts, the minutes value can be increased.
What could bring more value to the feedback we receive from our servers, is the load average.
The load average consists in 3 values which are the average of 1, 5 and 15 minutes.
The three values of load average of one, five, and fifteen minutes are mathematically calculated since the system is started up but they decay exponentially at different speed (by X after 1, 5 and 15 minutes). Not going too deep here, the most important is 1 minute load average doesn't include only the last 60 seconds activity but it is not far away. It is about 63% of last minute plus 37% since startup. The same goes with the load of 5 and 15 minutes.
Each process using or waiting for CPU adds 1 to the load number. in Linux, also the processes in uninterruptible sleep are included (those that usually wait for disk activity). This means, you can have a heavily loaded (and probably unresponsive) server with CPU utilization nearly to zero and not receiving any alert.
One relevant example: a stalled NFS share can put the processes using it in uninterruptible sleep, increasing the load while CPU usage is apparently normal. So you would have zero alerts for a troubled system.
Load average versus CPU threads
There is one more aspect, relevant for load average monitoring: the number of CPUs.
On two systems, first with 4 processors, second with 10 processors, the 1 minute load average of 6 doesn't have the same impact.
So the load value alone does not give us the true status of the system. It is mandatory to be reflected in the number of threads.
Taking the example below, we can easily conclude that a load average of 33 is OK for this system as we have 4 cpus with 16 threads each:
user@gzlinux $ grep -E "cpu cores|siblings|physical id" /proc/cpuinfo | xargs -n 11 echo | sort | uniq physical id : 0 siblings : 16 cpu cores : 8 physical id : 1 siblings : 16 cpu cores : 8 physical id : 2 siblings : 16 cpu cores : 8 physical id : 3 siblings : 16 cpu cores : 8 user@gzlinux $ cat /proc/cpuinfo | grep processor | wc -l 64 user@gzlinux $ cat /proc/loadavg 33.34 34.92 22.28 10/46253 386038
Inspiration and further information about load average: