Understanding NVMe Internal Queue Priority and Linux Support

2022-03-08 | 📖 5 min read | 阅读：次

Understanding NVMe Internal Queue Priority and Linux Support

NVMe (Non-Volatile Memory Express) is designed for high-performance SSDs connected via PCIe, with deep parallelism and a powerful queuing model. One of its less widely used features is the internal queue priority, defined by the NVMe specification, which allows devices to prioritize I/O commands based on their importance.

This blog explores the introduction of NVMe internal queue priority, its history, how to use it (when supported), the Linux kernel patches that attempted to expose this functionality, and the reasons why those efforts ultimately failed.

1. What Is NVMe Internal Queue Priority?

NVMe supports up to 65,535 I/O submission and completion queues, and the spec allows each queue to be assigned a priority level.

The NVMe Submission Queue Entry (SQE) contains a PRIORITY field (bits 3:1 in CDW0) which indicates the priority of a given I/O command:

Priority 0: Highest
Priority 1: High
Priority 2: Medium
Priority 3: Low

These priority values are meaningful only if the NVMe controller supports an arbitration mechanism such as Weighted Round Robin (WRR) or Time-Limited Priority (TLP).

Several studies such as Shashank Gugnani et al. (UCC 2018) and Joshi et al. (HotStorage 2017) highlight the performance impact and potential benefits of command prioritization on NVMe SSDs.

2. History of Queue Priority in NVMe

The queue priority field has existed in the NVMe specification from early versions, but hardware adoption and software support were limited. In particular:

Most consumer NVMe SSDs default to Round Robin (RR) arbitration.
Only some enterprise SSDs implement WRR arbitration, enabling them to respect the command priority.

The Linux kernel’s NVMe subsystem initially did not expose any mechanism to allow user-space or higher-level schedulers to utilize this feature.

Presentations such as LinuxFast Vault 2017 discussed the scalability limits of NVMe under contention and hinted at the usefulness of smarter queuing.

3. How to Use Queue Priority (If Your Device Supports It)

To check if your NVMe controller supports WRR or other arbitration modes:

nvme id-ctrl /dev/nvme0 | grep -i arb

Look for:

ARB: 1 = Weighted Round Robin (WRR)

You can also set the arbitration mode using:

nvme set-feature /dev/nvme0 --feature-id=1 --value=1

This sets the arbitration method to WRR. However, there is no built-in user-space method in Linux to set command-level priority bits — that would require a kernel modification.

In experimental setups such as those shown in Woo et al., FAST ‘21, software-defined QoS using NVMe arbitration was evaluated by directly manipulating the arbitration fields for research purposes.

4. Linux Patches Attempting to Add Queue Priority Support

Several developers attempted to introduce internal queue priority into the Linux NVMe stack:

Patch (2018): Set Priority Bits in CDW0

Author: Jianchao Wang
Link: linux-nvme ML archive
Goal: Allow per-command priority bits to be set from I/O context.

Patchset (2019–2020): Add Weighted Round Robin Queue Support

Author: Weiping Zhang
Patchset: “[PATCH v4 0/5] Add support Weighted Round Robin for blkcg and nvme”
Features:
- Four queue levels: urgent, high, medium, low
- Module parameters: wrr_urgent_queues, etc.
- Controller arbitration settings adjusted dynamically

5. Why These Patches Failed

Despite the technical merit, both approaches were rejected or stalled in the kernel community. Here’s why:

1. Lack of Hardware Adoption

Most NVMe SSDs (especially consumer ones) do not implement WRR arbitration. Supporting queue priority would only help a small number of enterprise devices.

2. Kernel Complexity and Maintainability

Adding per-command priority meant changes deep in the block layer and NVMe driver. The patches added module parameters, queue allocation logic, and scheduling complexity.

3. Conflicts with Existing Schedulers

Linux I/O schedulers (BFQ, mq-deadline, etc.) may reorder or delay I/O, undermining command-level priority semantics. There was no clean integration.

4. Spec Limitations

The NVMe spec defines arbitration as a command fetch priority, not I/O completion or QoS mechanism. It can’t guarantee service levels or isolation. This gap between what WRR can do and what users expect from “prioritization” was noted in research like HotStorage’17 Joshi et al..

5. Policy vs Mechanism Philosophy

Kernel maintainers generally avoid embedding policy in drivers. Command priority decisions are better suited for user space, but the interface to do this cleanly didn’t exist.

Conclusion

NVMe internal queue priority is a powerful but underused feature. It relies on hardware support for WRR arbitration and appropriate software hooks to be useful. While several attempts were made to support this in Linux, none were merged due to hardware limitations, integration complexity, and philosophical concerns about policy.

For now, if you want to take advantage of this feature:

Use nvme-cli to set arbitration modes (if your device supports it).
Consider kernel customization if working with specific hardware in a performance-critical environment.

As SSDs evolve and if WRR gains traction in more controllers, future kernel versions may revisit this capability.

Bean Huo

Bean Huo

Understanding NVMe Internal Queue Priority and Linux Support

Understanding NVMe Internal Queue Priority and Linux Support

1. What Is NVMe Internal Queue Priority?

2. History of Queue Priority in NVMe

3. How to Use Queue Priority (If Your Device Supports It)

4. Linux Patches Attempting to Add Queue Priority Support

Patch (2018): Set Priority Bits in CDW0

Patchset (2019–2020): Add Weighted Round Robin Queue Support

5. Why These Patches Failed

1. Lack of Hardware Adoption

2. Kernel Complexity and Maintainability

3. Conflicts with Existing Schedulers

4. Spec Limitations

5. Policy vs Mechanism Philosophy

Conclusion

References

Understanding NVMe Internal Queue Priority and Linux Support

1. What Is NVMe Internal Queue Priority?

2. History of Queue Priority in NVMe

3. How to Use Queue Priority (If Your Device Supports It)

4. Linux Patches Attempting to Add Queue Priority Support

Patch (2018): Set Priority Bits in CDW0

Patchset (2019–2020): Add Weighted Round Robin Queue Support

5. Why These Patches Failed

1. Lack of Hardware Adoption

2. Kernel Complexity and Maintainability

3. Conflicts with Existing Schedulers

4. Spec Limitations

5. Policy vs Mechanism Philosophy

Conclusion

References

Share this post:

Related Posts