Understanding NVMe Internal Queue Priority and Linux Support
NVMe (Non-Volatile Memory Express) is designed for high-performance SSDs connected via PCIe, with deep parallelism and a powerful queuing model. One of its less widely used features is the internal queue priority, defined by the NVMe specification, which allows devices to prioritize I/O commands based on their importance.
This blog explores the introduction of NVMe internal queue priority, its history, how to use it (when supported), the Linux kernel patches that attempted to expose this functionality, and the reasons why those efforts ultimately failed.
1. What Is NVMe Internal Queue Priority?
NVMe supports up to 65,535 I/O submission and completion queues, and the spec allows each queue to be assigned a priority level.
The NVMe Submission Queue Entry (SQE) contains a PRIORITY
field (bits 3:1 in CDW0) which indicates the priority of a given I/O command:
- Priority 0: Highest
- Priority 1: High
- Priority 2: Medium
- Priority 3: Low
These priority values are meaningful only if the NVMe controller supports an arbitration mechanism such as Weighted Round Robin (WRR) or Time-Limited Priority (TLP).
Several studies such as Shashank Gugnani et al. (UCC 2018) and Joshi et al. (HotStorage 2017) highlight the performance impact and potential benefits of command prioritization on NVMe SSDs.
2. History of Queue Priority in NVMe
The queue priority field has existed in the NVMe specification from early versions, but hardware adoption and software support were limited. In particular:
- Most consumer NVMe SSDs default to Round Robin (RR) arbitration.
- Only some enterprise SSDs implement WRR arbitration, enabling them to respect the command priority.
The Linux kernel’s NVMe subsystem initially did not expose any mechanism to allow user-space or higher-level schedulers to utilize this feature.
Presentations such as LinuxFast Vault 2017 discussed the scalability limits of NVMe under contention and hinted at the usefulness of smarter queuing.
3. How to Use Queue Priority (If Your Device Supports It)
To check if your NVMe controller supports WRR or other arbitration modes:
nvme id-ctrl /dev/nvme0 | grep -i arb
Look for:
ARB: 1 = Weighted Round Robin (WRR)
You can also set the arbitration mode using:
nvme set-feature /dev/nvme0 --feature-id=1 --value=1
This sets the arbitration method to WRR. However, there is no built-in user-space method in Linux to set command-level priority bits — that would require a kernel modification.
In experimental setups such as those shown in Woo et al., FAST ‘21, software-defined QoS using NVMe arbitration was evaluated by directly manipulating the arbitration fields for research purposes.
4. Linux Patches Attempting to Add Queue Priority Support
Several developers attempted to introduce internal queue priority into the Linux NVMe stack:
Patch (2018): Set Priority Bits in CDW0
- Author: Jianchao Wang
- Link: linux-nvme ML archive
- Goal: Allow per-command priority bits to be set from I/O context.
Patchset (2019–2020): Add Weighted Round Robin Queue Support
- Author: Weiping Zhang
- Patchset: “[PATCH v4 0/5] Add support Weighted Round Robin for blkcg and nvme”
-
Features:
- Four queue levels: urgent, high, medium, low
- Module parameters:
wrr_urgent_queues
, etc. - Controller arbitration settings adjusted dynamically
5. Why These Patches Failed
Despite the technical merit, both approaches were rejected or stalled in the kernel community. Here’s why:
1. Lack of Hardware Adoption
Most NVMe SSDs (especially consumer ones) do not implement WRR arbitration. Supporting queue priority would only help a small number of enterprise devices.
2. Kernel Complexity and Maintainability
Adding per-command priority meant changes deep in the block layer and NVMe driver. The patches added module parameters, queue allocation logic, and scheduling complexity.
3. Conflicts with Existing Schedulers
Linux I/O schedulers (BFQ, mq-deadline, etc.) may reorder or delay I/O, undermining command-level priority semantics. There was no clean integration.
4. Spec Limitations
The NVMe spec defines arbitration as a command fetch priority, not I/O completion or QoS mechanism. It can’t guarantee service levels or isolation. This gap between what WRR can do and what users expect from “prioritization” was noted in research like HotStorage’17 Joshi et al..
5. Policy vs Mechanism Philosophy
Kernel maintainers generally avoid embedding policy in drivers. Command priority decisions are better suited for user space, but the interface to do this cleanly didn’t exist.
Conclusion
NVMe internal queue priority is a powerful but underused feature. It relies on hardware support for WRR arbitration and appropriate software hooks to be useful. While several attempts were made to support this in Linux, none were merged due to hardware limitations, integration complexity, and philosophical concerns about policy.
For now, if you want to take advantage of this feature:
- Use
nvme-cli
to set arbitration modes (if your device supports it). - Consider kernel customization if working with specific hardware in a performance-critical environment.
As SSDs evolve and if WRR gains traction in more controllers, future kernel versions may revisit this capability.