Optimizing NVMe performance with dm-crypt
I recently switched over from Windows 7 to Debian Linux on my main desktop, and along the way decided to set up full disk encryption with dm-crypt because it seemed like support was good, so why not? At the very least it would save me any worry if the drive dies and I had to throw it out. In general I would like to have ‘performance’ more than ‘paranoid’ settings when it comes to tweaks. I basically accepted Debian’s default options on the install and went for it. I also ended up doing dm-crypt-on-LVM for the flexibility of multiple partitions. The drive ended up looking like this:
nvme0n1 259:0 0 1.8T 0 disk
├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
├─nvme0n1p2 259:2 0 488M 0 part /boot
└─nvme0n1p3 259:3 0 1.8T 0 part
└─nvme1n1p3_crypt 254:1 0 1.8T 0 crypt
├─m2-root 254:3 0 32G 0 lvm /
├─m2-swap 254:6 0 32G 0 lvm [SWAP]
└─m2-home 254:7 0 1.5T 0 lvm /home
The problem came when I started running IO-heavy workloads, generally loading large AI models from disk. These only ran at about 1000 MB/s, far short of the 7450 MB/s they promise on the box. I know Samsung probably lies on these benchmarks but not that bad. Unfortunately I can’t remember what I got when testing under Windows 7.
I turned on all the performance options in /etc/crypttab
, so the line for the NVMe ends up looking like:
nvme1n1p3_crypt UUID=21d6f689-0bcf-4c4e-ae81-0181c74bcd2f none luks,discard,no-read-workqueue,no-write-workqueue
Again, no saved benchmarks there but there wasn’t a noticeable improvement. Here’s what they looked like afterwards (from kdiskmark
, 1 GiB test):
[Read]
Sequential 1 MiB (Q= 8, T= 1): 7096.145 MB/s [ 6929.8 IOPS] < 1145.94 us>
Sequential 1 MiB (Q= 1, T= 1): 1130.209 MB/s [ 1103.7 IOPS] < 905.32 us>
Random 4 KiB (Q= 32, T= 1): 734.142 MB/s [ 183535.6 IOPS] < 173.98 us>
Random 4 KiB (Q= 1, T= 1): 78.616 MB/s [ 19654.2 IOPS] < 50.46 us>
[Write]
Sequential 1 MiB (Q= 8, T= 1): 1299.046 MB/s [ 1268.6 IOPS] < 6146.53 us>
Sequential 1 MiB (Q= 1, T= 1): 1093.343 MB/s [ 1067.7 IOPS] < 854.56 us>
Random 4 KiB (Q= 32, T= 1): 413.172 MB/s [ 103293.0 IOPS] < 308.34 us>
Random 4 KiB (Q= 1, T= 1): 210.846 MB/s [ 52711.6 IOPS] < 18.76 us>
Note that the queue depth 8 sequential test actually comes pretty close to the advertised speed, but the Q=1 test is more like what I was seeing when running actual large loads from disk.
For comparison I ran a (256 MiB) test on the /boot
partition, which is outside the dm-crypt volume:
[Read]
Sequential 1 MiB (Q= 8, T= 1): 6863.149 MB/s [ 6702.3 IOPS] < 1164.45 us>
Sequential 1 MiB (Q= 1, T= 1): 3559.749 MB/s [ 3476.3 IOPS] < 285.60 us>
Random 4 KiB (Q= 32, T= 1): 971.108 MB/s [ 242777.2 IOPS] < 131.33 us>
Random 4 KiB (Q= 1, T= 1): 66.923 MB/s [ 16730.9 IOPS] < 59.35 us>
[Write]
Sequential 1 MiB (Q= 8, T= 1): 4557.689 MB/s [ 4450.9 IOPS] < 1631.63 us>
Sequential 1 MiB (Q= 1, T= 1): 2848.125 MB/s [ 2781.4 IOPS] < 250.41 us>
Random 4 KiB (Q= 32, T= 1): 280.971 MB/s [ 70242.7 IOPS] < 453.73 us>
Random 4 KiB (Q= 1, T= 1): 264.417 MB/s [ 66104.4 IOPS] < 14.64 us>
The read speeds are mostly the same, except that important sequential Q=1 benchmark is over 3 times faster! I did a bit of looking around, CloudFlare had a good writeup on optimizing disk encryption speed, and I also found this useful Reddit post. It seems like the sector size dm-crypt is using is important. It may be using 512 bytes, even though the overlying file system (ext4 in this case) is using 4096 so we never use any less than that size. Let’s check what size nvme1n1p3_crypt is using:
# cryptsetup luksDump /dev/nvme0n1p3
[...]
Data segments:
0: crypt
offset: 16777216 [bytes]
length: (whole device)
cipher: aes-xts-plain64
sector: 512 [bytes]
Keyslots:
0: luks2
Key: 512 bits
Priority: normal
Cipher: aes-xts-plain64
Cipher key: 512 bits
[...]
We’re using a sector size of 512. It’s also using AES-256, in XTS mode (which is why the key is 2x the size).
I set up a test using my secondary NVMe. This is older and slower (Samsung 980 Pro vs the 990), as well as in the 2nd M.2 slot which means it’s sharing 4 PCIe lanes with the network interfaces, SATA devices, etc. instead of its own dedicated 4 lanes to the CPU.
I formatted this guy up, using the above options, but AES-128 to see if there was any appreciable difference. This is the sequence of commands. Note I’m reading the password from a file (which itself is on the encrypted /home
drive).
# cryptsetup luksFormat --key-size=256 /dev/mapper/nvme1n1p2 < /my/password/file
# cryptsetup open /dev/nvme1n1p2 test_crypt < /my/password/file
# cryptsetup refresh --perf-no_read_workqueue --perf-no_write_workqueue --allow-discards temp_crypt
# mkfs.ext4 /dev/mapper/test_crypt
# mount /dev/mapper/test_crypt /mnt/test/
Results:
[Read]
Sequential 1 MiB (Q= 1, T= 1): 1126.740 MB/s [ 1100.3 IOPS] < 908.13 us>
A little faster, maybe from the smaller key size, but largely the same as above. Let’s try reformatting it with a block size of 4096 using the --sector-size=4096
option when calling cryptsetup luksFormat
. Results:
[Read]
Sequential 1 MiB (Q= 1, T= 1): 1994.320 MB/s [ 1947.6 IOPS] < 512.92 us>
Nearly a 2x speed-up versus the original case! Obviously this is the way to go. For experimentation I tried a sector size of 4096, but with AES-256 to see how much of an effect that has:
[Read]
Sequential 1 MiB (Q= 1, T= 1): 1768.740 MB/s [ 1727.3 IOPS] < 586.19 us>
A little slower, but still faster with the larger sector size.
So, let’s change the key size and sector size. Thanks to the wonderful cryptsetup, we can actually do this online, on the boot drive, while using it! It even tolerates interruptions (not that I wanted to test this, it’s scary…). The command we want:
# cryptsetup reencrypt --key-size=256 --sector-size=4096 /dev/nvme0n1p3
Enter passphrase for key slot 0:
Auto-detected active dm device 'nvme1n1p3_crypt' for data device /dev/nvme0n1p3.
Finished, time 51m22s, 1862 GiB written, speed 618.6 MiB/s
After an fstrim
, a reboot, we end up with:
[Read]
Sequential 1 MiB (Q= 8, T= 1): 7251.646 MB/s [ 7081.7 IOPS] < 1121.95 us>
Sequential 1 MiB (Q= 1, T= 1): 1975.299 MB/s [ 1929.0 IOPS] < 519.47 us>
Random 4 KiB (Q= 32, T= 1): 743.511 MB/s [ 185877.8 IOPS] < 171.88 us>
Random 4 KiB (Q= 1, T= 1): 84.115 MB/s [ 21028.8 IOPS] < 47.20 us>
[Write]
Sequential 1 MiB (Q= 8, T= 1): 3065.347 MB/s [ 2993.5 IOPS] < 2532.55 us>
Sequential 1 MiB (Q= 1, T= 1): 2050.924 MB/s [ 2002.9 IOPS] < 413.80 us>
Random 4 KiB (Q= 32, T= 1): 477.777 MB/s [ 119444.3 IOPS] < 266.06 us>
Random 4 KiB (Q= 1, T= 1): 244.687 MB/s [ 61171.9 IOPS] < 15.65 us>
Alright so not quite double, but an improvement all around, for free! Write performance was also greatly increased. Still short of the unencrypted performance, but still pretty quick. The Q=8 test is actually almost the 8 GiB/s speed of the PCIe 4.0 x4 link, so I call that pretty good.
In summary, if you’re using NVMe drives with dm-crypt, for max performance I would:
- Make sure your dm-crypt is using a sector size of 4096 (or your sector size if using something different)
- Use a key size of 256 (AES-128 with XTS)
- Enable TRIM through the dm-crypt volume with the
--allow-discards
flag tocryptsetup
ordiscard
in/etc/crypttab
- Disable workqueues with
--perf-no_read_workqueue, --perf-no_write_workqueue
flags orno-read-workqueue,no-write-workqueue
options. YMMV with this though so test both (cryptsetup refresh
can change it online)
The security implications of the above are left as an exercise for the reader, but personally I’m OK with it.
Comments