Emulate a slow block device with dm-delay

It might be necessary to emulate a slow block device when you don’t have a slow disk for debugging some high latency performance issue. The device mapper driver in Linux provides a solution called “delay” target.

In this post, we will learn how to use the “delay” target to delay reads and writes to a block device.

In the following experiment, we add extra latency to the NVME disk in order to emulate slower disk.

NVME latency

Firstly, we run fio read on raw NVME disk to understand the baseline latency.

[root@perf-vm2 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@perf-vm2 ~]# fio --blocksize=4k --ioengine=libaio --readwrite=read --filesize=5G --group_reporting --direct=1 --iodepth=128 --name=job1 --filename=/dev/sdc
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=944MiB/s,w=0KiB/s][r=242k,w=0 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=2966: Fri May 19 18:27:48 2023
   read: IOPS=245k, BW=956MiB/s (1002MB/s)(5120MiB/5356msec)
    slat (nsec): min=1465, max=465769, avg=2633.66, stdev=11980.49
    clat (usec): min=125, max=1941, avg=520.06, stdev=107.83
     lat (usec): min=145, max=1947, avg=522.76, stdev=107.17
    clat percentiles (usec):
     |  1.00th=[  306],  5.00th=[  429], 10.00th=[  478], 20.00th=[  486],
     | 30.00th=[  490], 40.00th=[  490], 50.00th=[  498], 60.00th=[  523],
     | 70.00th=[  529], 80.00th=[  529], 90.00th=[  570], 95.00th=[  676],
     | 99.00th=[  881], 99.50th=[ 1434], 99.90th=[ 1565], 99.95th=[ 1663],
     | 99.99th=[ 1811]
   bw (  KiB/s): min=848840, max=1047024, per=100.00%, avg=980152.50, stdev=56118.97, samples=10
   iops        : min=212210, max=261756, avg=245038.10, stdev=14029.75, samples=10
  lat (usec)   : 250=0.01%, 500=52.21%, 750=45.61%, 1000=1.31%
  lat (msec)   : 2=0.86%
  cpu          : usr=11.15%, sys=71.11%, ctx=9203, majf=0, minf=162
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1310720,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=956MiB/s (1002MB/s), 956MiB/s-956MiB/s (1002MB/s-1002MB/s), io=5120MiB (5369MB), run=5356-5356msec

Disk stats (read/write):
  sdc: ios=1261305/0, merge=0/0, ticks=317744/0, in_queue=317668, util=89.88%

While fio is running, we issue the command “iostat -ktdx 1 sdc” to monitor the latency. It shows the read latency(r_await time) is ~0.25ms.

[root@perf-vm2 ~]# iostat -ktdx 1 sdc
05/19/2023 06:27:46 PM
Device:         rrqm/s   wrqm/s       r/s     w/s    rkB/s      wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00 251770.00    0.00 1007080.00     0.00     8.00    64.04    0.25    0.25    0.00   0.00  99.00

Create dm-delay target

Now, we use dmsetup utility to create a delayed target layer on top of the raw NVME. We specify the read delay as 50ms.

[root@perf-vm2 ~]# size=$(blockdev --getsize /dev/sdc)
[root@perf-vm2 ~]# echo $size
209715200

[root@perf-vm2 ~]# echo "0 $size delay /dev/sdc 0 50" | dmsetup create delayed
[root@perf-vm2 ~]# dmsetup table delayed
0 209715200 delay 8:32 0 50
[root@perf-vm2 ~]# ls -la /dev/mapper/ | grep delayed
lrwxrwxrwx  1 root root       7 May 19 18:37 delayed -> ../dm-3

Check the latency of dm-delay

We can check the latency introduced by dm-delay while fio is running.

[root@perf-vm2 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@perf-vm2 ~]# fio --blocksize=4k --ioengine=libaio --readwrite=read --filesize=5G --group_reporting --direct=1 --iodepth=128 --name=job1 --filename=/dev/dm-3
job1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.7
Starting 1 process
^Cbs: 1 (f=1): [R(1)][4.3%][r=10.0MiB/s,w=0KiB/s][r=2562,w=0 IOPS][eta 08m:13s]
fio: terminating on signal 2
Jobs: 1 (f=1): [R(1)][4.5%][r=10.0MiB/s,w=0KiB/s][r=2562,w=0 IOPS][eta 08m:12s]
job1: (groupid=0, jobs=1): err= 0: pid=3138: Fri May 19 18:39:39 2023
   read: IOPS=2559, BW=9.00MiB/s (10.5MB/s)(229MiB/22905msec)
    slat (nsec): min=915, max=132881, avg=3115.23, stdev=3538.89
    clat (usec): min=49420, max=52624, avg=50006.18, stdev=201.19
     lat (usec): min=49424, max=52628, avg=50009.46, stdev=200.86
    clat percentiles (usec):
     |  1.00th=[49546],  5.00th=[49546], 10.00th=[49546], 20.00th=[50070],
     | 30.00th=[50070], 40.00th=[50070], 50.00th=[50070], 60.00th=[50070],
     | 70.00th=[50070], 80.00th=[50070], 90.00th=[50070], 95.00th=[50070],
     | 99.00th=[50594], 99.50th=[51119], 99.90th=[51119], 99.95th=[52691],
     | 99.99th=[52691]
   bw (  KiB/s): min= 9632, max=10520, per=99.81%, avg=10217.24, stdev=117.17, samples=45
   iops        : min= 2408, max= 2630, avg=2554.31, stdev=29.29, samples=45
  lat (msec)   : 50=52.82%, 100=47.18%
  cpu          : usr=0.11%, sys=1.20%, ctx=2973, majf=0, minf=161
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=58624,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=9.00MiB/s (10.5MB/s), 9.00MiB/s-9.00MiB/s (10.5MB/s-10.5MB/s), io=229MiB (240MB), run=22905-22905msec

Disk stats (read/write):
    dm-3: ios=58624/0, merge=0/0, ticks=2925349/0, in_queue=2930643, util=99.66%, aggrios=58624/0, aggrmerge=0/0, aggrticks=65/0, aggrin_queue=65, aggrutil=0.06%
  sdc: ios=58624/0, merge=0/0, ticks=65/0, in_queue=65, util=0.06%

We can see that the r_await time is 50ms on dm-3 which is the dm-delay target on top of sdc.

[root@perf-vm2 ~]# iostat -ktdx 1
05/19/2023 06:39:35 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00 2560.00    0.00 10240.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00 2560.00    0.00 10240.00     0.00     8.00   128.00   50.00   50.00    0.00   0.39 100.00

We also can specify the write latency(e.g. 100ms) as below.

[root@perf-vm2 ~]# echo "0 $size delay /dev/sdc 0 50 /dev/sdc 0 100" | dmsetup create delayed

[root@perf-vm2 ~]# dmsetup table delayed
0 209715200 delay 8:32 0 50 8:32 0 100

[root@perf-vm2 ~]# ls -la /dev/mapper/| grep delayed
lrwxrwxrwx  1 root root       7 May 19 18:41 delayed -> ../dm-3

We run the fio write against the dm-delay target.

[root@perf-vm2 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@perf-vm2 ~]# fio --blocksize=4k --ioengine=libaio --readwrite=write --filesize=5G --group_reporting --direct=1 --iodepth=128 --name=job1 --filename=/dev/dm-3
job1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.7
Starting 1 process
^Cbs: 1 (f=1): [W(1)][0.9%][r=0KiB/s,w=5125KiB/s][r=0,w=1281 IOPS][eta 17m:25s]
fio: terminating on signal 2

job1: (groupid=0, jobs=1): err= 0: pid=3210: Fri May 19 18:42:36 2023
  write: IOPS=1259, BW=5039KiB/s (5160kB/s)(53.0MiB/10772msec)
    slat (nsec): min=1771, max=49839k, avg=13004.20, stdev=591607.93
    clat (msec): min=99, max=174, avg=100.68, stdev= 6.72
     lat (msec): min=99, max=174, avg=100.69, stdev= 6.75
    clat percentiles (msec):
     |  1.00th=[  100],  5.00th=[  101], 10.00th=[  101], 20.00th=[  101],
     | 30.00th=[  101], 40.00th=[  101], 50.00th=[  101], 60.00th=[  101],
     | 70.00th=[  101], 80.00th=[  101], 90.00th=[  101], 95.00th=[  101],
     | 99.00th=[  102], 99.50th=[  169], 99.90th=[  174], 99.95th=[  174],
     | 99.99th=[  176]
   bw (  KiB/s): min= 3352, max= 5416, per=99.91%, avg=5033.62, stdev=396.81, samples=21
   iops        : min=  838, max= 1354, avg=1258.38, stdev=99.20, samples=21
  lat (msec)   : 100=52.92%, 250=47.08%
  cpu          : usr=0.29%, sys=0.74%, ctx=1123, majf=0, minf=30
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,13569,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=5039KiB/s (5160kB/s), 5039KiB/s-5039KiB/s (5160kB/s-5160kB/s), io=53.0MiB (55.6MB), run=10772-10772msec

Disk stats (read/write):
    dm-3: ios=41/13568, merge=0/0, ticks=2118/1344129, in_queue=1351987, util=100.00%, aggrios=42/13569, aggrmerge=0/0, aggrticks=18/25, aggrin_queue=43, aggrutil=0.20%
  sdc: ios=42/13569, merge=0/0, ticks=18/25, in_queue=43, util=0.20%

The write latency on dm-3 is 100ms as below iostat shows.

[root@perf-vm2 ~]# iostat -ktdx 1
05/19/2023 06:42:33 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00 1280.00     0.00  5120.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00 1280.00     0.00  5120.00     8.00   128.00  100.00    0.00  100.00   0.78 100.00

Suspending I/Os

The device mapper can also suspend and resume I/Os.

[root@perf-vm2 ~]# dmsetup suspend /dev/dm-3
[root@perf-vm2 ~]# dmsetup resume /dev/dm-3


[root@perf-vm2 ~]# iostat -ktdx 1
05/19/2023 06:43:55 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00 1073.00     0.00  4292.00     8.00     0.07    0.07    0.00    0.07   0.00   0.20
dm-3              0.00     0.00    0.00  945.00     0.00  3780.00     8.00    78.85   94.44    0.00   94.44   0.65  61.70

05/19/2023 06:43:56 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00


05/19/2023 06:44:06 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00 1280.00     0.00  5120.00     8.00     0.08    0.06    0.00    0.06   0.02   2.80
dm-3              0.00     0.00    0.00 1280.00     0.00  5120.00     8.00   127.65  100.46    0.00  100.46   0.78 100.00
^C

Create a ramdisk

If you want to emulate a fast disk without using NVME, the ramdisk(aka RAM backed disk) can be used. The size is capped by the physical RAM size.

# load the brd kernel module. rd_nr is the maximum number of ramdisks. rd_size is the ramdisk size in KB.
$ sudo modprobe brd rd_nr=1 rd_size=1048576

$ ls -l /dev/ram0
brw-rw---- 1 root disk 1, 0 Aug 24 20:00 /dev/ram0

$ sudo blockdev --getsize /dev/ram0 # Display the size in 512-byte sectors
2097152

Reference