The complete explanation for await, svctm and %util in iostat

iostat is a Linux I/O performance monitoring utility. It’s very commonly used to analyze device utilization.

/proc/diskstats

The statistics fields in iostat are calculated based on the I/O statistics of block devices in /proc/diskstats.

Each line in the /proc/diskstats file contains the following 14 fields. More fields are added in kernel 4.18 and later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1  major number
2 minor number
3 device name
4 reads completed successfully
5 reads merged
6 sectors read
7 time spent reading (ms)
8 writes completed
9 writes merged
10 sectors written
11 time spent writing (ms)
12 I/Os currently in progress
13 time spent doing I/Os (ms)
14 weighted time spent doing I/Os (ms)

We will need to know these statistics later when we learn how iostat calculates its fields.

iostat

Now, let’s use fio load generator to benchamrk a AWS EBS gp3 volume and examine the iostat report.

1
2
$ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=1 --name=fiojob1 --filename=/dev/nvme1n1
write: io=2555.8MB, bw=43615KB/s, iops=10903, runt= 60004msec
1
2
Device: rrqm/s wrqm/s r/s w/s   rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0 0 0 10993 0 43972 8 30.13 2.74 0 2.74 0.09 100

As we can see above, it achieves 10903 iops and the disk is 100% utilized.

You may naturally conclude that the disk runs into bottleneck since it’s 100% busy. Is this really true? Before we answer this question, let’s firstly understand how iostat computes the statistics fields.

How are the iostat fields calculated?

Since the basic fields, such as r/s and w/s, are very straightforward, we will mainly focus on the following three extended fields because they are commonly used to identify the disk bottlenck.

  • await
  • svctm
  • %util

From iostat source code, to calcualte the total number of read and write IOs:

1
n_ios  = blkio.rd_ios + blkio.wr_ios;

To calculate the total amount of time(ms) waiting in queue:

1
n_ticks = blkio.rd_ticks + blkio.wr_ticks;

To calculate the average I/O wait time(ms):

1
wait = n_ios ? n_ticks / n_ios : 0.0;

To calcualte the average I/O service time(ms):

1
svc_t = n_ios ? blkio.ticks / n_ios : 0.0;

Note: the blkio.ticks is calculated based on the field 13 “time spent doing I/Os (ms)” in /proc/diskstats.

To calculate the disk utilization:

1
busy = 100.0 * blkio.ticks / deltams; /* percentage! */

Two traps in iostat

With above implementaion, there are actually two traps when to use them to identify disk bottleneck.

  • svctm - average I/O service time(ms)
  • %util - disk utilization

Let’s try to increase the fio numjobs from 1 to 4.

1
2
$ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=4 --name=fiojob1 --filename=/dev/nvme1n1
write: io=3798.2MB, bw=64817KB/s, iops=16204, runt= 60004msec
1
2
Device: rrqm/s wrqm/s r/s w/s     rkB/s wkB/s   avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0 0 0 16001.6 0 64006.4 8 50.15 3.13 0 3.13 0.06 100

Comparing the 1 fio job and 4 jobs runs:

  • The w/s increases from 10903 to 16001 even though the disk utilization is 100% for the two runs. It means, with one job run, the disk is not fully saturated yet.
  • The average I/O service time(svctm) reduces from 0.09 to 0.06.

Literally, the disk could respond the request in 0.09ms under ligher load, and 0.06ms under heavier load. This seems unlikely and it’s not what iostat supposes to tell us on the average disk service time.

And from iostat man page, you may see this warning.

svctm
The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version.

For the traditional spining harddisk, the I/O has to be serialized due to the nature of disk head movement from one disk platter location to another. That means only one I/O can be serviced at once. In such case, the svctm can reflect how fast each I/O is responded.

For the modern SSD disk, it won’t be true any more because the disk can respond to multiple IOs at once. Even if it’s 100% utilized, it just means during that period, the disk is busy responding I/O requests. It doesn’t always mean the disk is already saturated.

In order to idenitfy if the disk is completely saturated(peaked), the only way is to offer more work to do in parallel. In the above example, by increasing fio numjobs from 1 to 4, the w/s is peaked at 16001 which is align with the specfied iops for the EBS gp3 volume.

In a word, both the svctm and %util fields in iostat are misleading for the modern SSD storage system. They should be used with extra care.

Reference