Does 100% utilization really mean the disk bottleneck?

From man page of iostat in Linux, %util is explained as below.

%util:

Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

I have seen many cases people misuse this metric when to tell if the system runs into disk I/O bottleneck.

To explain what could be wrong with %util, let’s start from a fio benchmark run.

1
2
$ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=1 --name=fiojob1 --filename=/dev/nvme1n1
write: io=2555.8MB, bw=43615KB/s, iops=10903, runt= 60004msec

During the fio run, we can examine the iostat output to understand the disk utilization.

1
2
Device: rrqm/s wrqm/s r/s w/s   rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0 0 0 10993 0 43972 8 30.13 2.74 0 2.74 0.09 100.08

From the iostat output, the disk utilization is 100%. Can we conclude the system is running into disk I/O bottleneck. Not yet!!

The fio benchmark runs on a AWS EC2 instance which has one EBS volume attached. The IOPS limit is set to 16000 when to create the EBS volume. Obviously, the ceiling is not reached yet since the w/s achieved is only 10993. But, why the disk utilization is already 100%.

The problem is due to parallelism.

In the man page of iostat, when it says %util, it means “Percentage of elapsed time during which I/O requests were issued to the device”. It also means “Device saturation occurs when this value is close to 100%”. This is NOT correct for the modern SSD disk any more.

For the tranditional magnetic hard disk, due to the nature of spining disk head, it can only handle one I/O at a time. The above statement makes sense. However, for the modern SSD disk, it can handle multiple I/Os at a time. When the disk is 100% busy, it might be continuously handling one or more write requests during that sampling period. In other words, it doesn’t mean the SSD disk can’t handle extra I/Os even if it’s already 100% busy servicing I/O requests.

For the previous fio run, in order to push the disk limit, we need a way to increase I/O parallelism. To achieve this, we can keep increasing numjobs from 1 to 2, or more.

numjobs=2:

1
2
$ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=2 --name=fiojob1 --filename=/dev/nvme1n1
write: io=3122.2MB, bw=53283KB/s, iops=13320, runt= 60003msec
1
2
Device: rrqm/s wrqm/s r/s w/s     rkB/s wkB/s   avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0 0 0 13412.2 0 53648.8 8 44.84 3.34 0 3.34 0.07 100

Now, this w/s is increased from 10993 to 13412. We can push more until the limit 16000.

numjobs=4:

1
2
$ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=4 --name=fiojob1 --filename=/dev/nvme1n1
write: io=3798.2MB, bw=64817KB/s, iops=16204, runt= 60004msec
1
2
Device: rrqm/s wrqm/s r/s w/s     rkB/s wkB/s   avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0 0 0 16001.6 0 64006.4 8 50.15 3.13 0 3.13 0.06 100

To this point, the disk reaches its IOPS limit 16000 and it is finally saturated.