NVMe Driver Reverse Engineering
- NVMe Driver Reverse Engineering
Tracing ENODATA
- The NVMe status codes from the drive itself are defined in the NVMe driver's host file.
static blk_status_t nvme_error_status(u16 status)
{
switch (status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
return BLK_STS_NOSPC;
case NVME_SC_LBA_RANGE:
case NVME_SC_CMD_INTERRUPTED:
case NVME_SC_NS_NOT_READY:
return BLK_STS_TARGET;
case NVME_SC_BAD_ATTRIBUTES:
case NVME_SC_ONCS_NOT_SUPPORTED:
case NVME_SC_INVALID_OPCODE:
case NVME_SC_INVALID_FIELD:
case NVME_SC_INVALID_NS:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
case NVME_SC_COMPARE_FAILED:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
case NVME_SC_REFTAG_CHECK:
case NVME_SC_INVALID_PI:
return BLK_STS_PROTECTION;
case NVME_SC_RESERVATION_CONFLICT:
return BLK_STS_RESV_CONFLICT;
case NVME_SC_HOST_PATH_ERROR:
return BLK_STS_TRANSPORT;
case NVME_SC_ZONE_TOO_MANY_ACTIVE:
return BLK_STS_ZONE_ACTIVE_RESOURCE;
case NVME_SC_ZONE_TOO_MANY_OPEN:
return BLK_STS_ZONE_OPEN_RESOURCE;
default:
return BLK_STS_IOERR;
}
}
- This function is ultimately called by NVMe upon completion of an NVMe I/O operation. It processes the end of an NVMe request by handling the NVMe-specific status and translating it into a block layer status, which is then used to finalize the request in the block layer.
static inline void nvme_end_req(struct request *req)
{
blk_status_t status = nvme_error_status(nvme_req(req)->status);
if (unlikely(nvme_req(req)->status && !(req->rq_flags & RQF_QUIET)))
nvme_log_error(req);
nvme_end_req_zoned(req);
nvme_trace_bio_complete(req);
if (req->cmd_flags & REQ_NVME_MPATH)
nvme_mpath_end_request(req);
blk_mq_end_request(req, status);
}
- The mapping to block level error codes is done by
blk_status_to_errno
. Most of this syntax can be ignored, but what matters is this is taking the raw NVMe
int blk_status_to_errno(blk_status_t status)
{
int idx = (__force int)status;
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
return -EIO;
return blk_errors[idx].errno;
}
EXPORT_SYMBOL_GPL(blk_status_to_errno);
Scratch Notes
NVMe Block Status Codes
static blk_status_t nvme_error_status(u16 status)
{
switch (status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
return BLK_STS_NOSPC;
case NVME_SC_LBA_RANGE:
case NVME_SC_CMD_INTERRUPTED:
case NVME_SC_NS_NOT_READY:
return BLK_STS_TARGET;
case NVME_SC_BAD_ATTRIBUTES:
case NVME_SC_ONCS_NOT_SUPPORTED:
case NVME_SC_INVALID_OPCODE:
case NVME_SC_INVALID_FIELD:
case NVME_SC_INVALID_NS:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
case NVME_SC_COMPARE_FAILED:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
case NVME_SC_REFTAG_CHECK:
case NVME_SC_INVALID_PI:
return BLK_STS_PROTECTION;
case NVME_SC_RESERVATION_CONFLICT:
return BLK_STS_RESV_CONFLICT;
case NVME_SC_HOST_PATH_ERROR:
return BLK_STS_TRANSPORT;
case NVME_SC_ZONE_TOO_MANY_ACTIVE:
return BLK_STS_ZONE_ACTIVE_RESOURCE;
case NVME_SC_ZONE_TOO_MANY_OPEN:
return BLK_STS_ZONE_OPEN_RESOURCE;
default:
return BLK_STS_IOERR;
}
}
The nvme_error_status
function is designed to translate NVMe-specific status codes (status
) into generic block layer status codes understood by the Linux kernel's block layer (blk_status_t
). This translation is important for integrating NVMe devices into the Linux block layer, allowing NVMe storage to be managed like other block devices.
The function uses a switch
statement to map various NVMe status codes (defined as NVME_SC_*
) to corresponding Linux block layer status codes (BLK_STS_*
). Here's a breakdown of what it's doing:
-
NVME_SC_SUCCESS
: If the NVMe status code indicates success (NVME_SC_SUCCESS
), it translates this to the block layer's success status (BLK_STS_OK
). -
Other NVMe Status Codes: For other NVMe status codes, the function maps them to appropriate block layer status codes based on the type of error. For example:
NVME_SC_CAP_EXCEEDED
: Translated toBLK_STS_NOSPC
, indicating no space left on the device.NVME_SC_LBA_RANGE
,NVME_SC_CMD_INTERRUPTED
,NVME_SC_NS_NOT_READY
: These are translated toBLK_STS_TARGET
, indicating a target device error.NVME_SC_BAD_ATTRIBUTES
,NVME_SC_ONCS_NOT_SUPPORTED
, etc.: These imply an unsupported operation or invalid command, and are mapped toBLK_STS_NOTSUPP
.NVME_SC_WRITE_FAULT
,NVME_SC_READ_ERROR
, etc.: These indicate medium errors (issues with the storage medium itself) and are translated toBLK_STS_MEDIUM
.-
And so on for other cases.
-
Default Case: If the NVMe status code does not match any of the specific cases, the function defaults to returning
BLK_STS_IOERR
, a generic I/O error status.
Each case in the switch
statement maps a specific NVMe status code (or a group of related codes) to a corresponding generic block layer status. This mapping is crucial for ensuring that NVMe devices behave consistently with other block devices in the Linux kernel and that higher-level systems and applications can appropriately interpret and handle errors originating from NVMe devices.
In summary, this function is integral to the NVMe driver in Linux, allowing NVMe-specific errors to be communicated up through the kernel's block layer in a standardized way.
NVMe End Request
static inline void nvme_end_req(struct request *req)
{
blk_status_t status = nvme_error_status(nvme_req(req)->status);
if (unlikely(nvme_req(req)->status && !(req->rq_flags & RQF_QUIET)))
nvme_log_error(req);
nvme_end_req_zoned(req);
nvme_trace_bio_complete(req);
if (req->cmd_flags & REQ_NVME_MPATH)
nvme_mpath_end_request(req);
blk_mq_end_request(req, status);
}
The function nvme_end_req
is a part of the NVMe driver in the Linux kernel. It's responsible for finalizing an NVMe I/O request. Here's a step-by-step explanation of what this function is doing:
-
Convert NVMe Status to Block Layer Status:
blk_status_t status = nvme_error_status(nvme_req(req)->status);
- This line calls
nvme_error_status
with the NVMe request's status code. Thenvme_error_status
function translates NVMe-specific status codes (likeNVME_SC_WRITE_FAULT
) into generic block layer status codes (blk_status_t
). This translation is necessary because the block layer of the kernel (which handles block devices like SSDs and HDDs) uses its own set of status codes.
-
Log Error if Necessary:
if (unlikely(nvme_req(req)->status && !(req->rq_flags & RQF_QUIET))) nvme_log_error(req);
- This line checks if there was an error in the NVMe request (i.e.,
status
is non-zero) and if the request is not flagged asRQF_QUIET
(which suppresses error logging). If these conditions are met, it logs the error usingnvme_log_error
.
-
Handle Zoned Requests:
nvme_end_req_zoned(req);
- This line calls a function to handle the end of a zoned request. NVMe zoned namespaces (ZNS) are a type of storage architecture on NVMe devices that can improve performance and endurance by dividing the storage space into zones.
-
Trace Bio Completion:
nvme_trace_bio_complete(req);
- This line is for tracing/block layer accounting purposes. It's used for diagnostic and performance monitoring tools to track the completion of the block I/O (
bio
) request.
-
Multipath Handling:
if (req->cmd_flags & REQ_NVME_MPATH) nvme_mpath_end_request(req);
- This checks if the request is a multipath I/O request (using NVMe multipathing features). If it is, it calls
nvme_mpath_end_request
to handle the multipath-specific aspects of request completion.
-
Finalize the Request:
blk_mq_end_request(req, status);
- Finally, it calls
blk_mq_end_request
, which marks the request as completed in the block layer. Thestatus
parameter is the translated block status code.
In summary, nvme_end_req
is a comprehensive function that handles the finalization of NVMe I/O requests. It takes care of translating NVMe-specific status codes to generic block layer codes, logs errors if necessary, deals with zoned requests, handles multipathing, and ensures the request is properly completed in the block subsystem of the kernel.
The Call to NVMe End Request
void nvme_complete_rq(struct request *req)
{
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
trace_nvme_complete_rq(req);
nvme_cleanup_cmd(req);
/*
* Completions of long-running commands should not be able to
* defer sending of periodic keep alives, since the controller
* may have completed processing such commands a long time ago
* (arbitrarily close to command submission time).
* req->deadline - req->timeout is the command submission time
* in jiffies.
*/
if (ctrl->kas &&
req->deadline - req->timeout >= ctrl->ka_last_check_time)
ctrl->comp_seen = true;
switch (nvme_decide_disposition(req)) {
case COMPLETE:
nvme_end_req(req);
return;
case RETRY:
nvme_retry_req(req);
return;
case FAILOVER:
nvme_failover_req(req);
return;
case AUTHENTICATE:
#ifdef CONFIG_NVME_HOST_AUTH
queue_work(nvme_wq, &ctrl->dhchap_auth_work);
nvme_retry_req(req);
#else
nvme_end_req(req);
#endif
return;
}
}
EXPORT_SYMBOL_GPL(nvme_complete_rq);
void nvme_complete_batch_req(struct request *req)
{
trace_nvme_complete_rq(req);
nvme_cleanup_cmd(req);
nvme_end_req_zoned(req);
}
EXPORT_SYMBOL_GPL(nvme_complete_batch_req);
The provided code defines two functions, nvme_complete_rq
and nvme_complete_batch_req
, which are part of the NVMe (Non-Volatile Memory Express) driver in the Linux kernel. These functions are responsible for handling the completion of NVMe requests. Let's break down what each function does:
nvme_complete_rq
Function
This function is called to complete a single NVMe request.
-
Trace and Cleanup:
trace_nvme_complete_rq(req);
records a tracepoint for the completion of the request. Tracepoints are used for debugging and performance analysis.nvme_cleanup_cmd(req);
cleans up the command associated with the request.
-
Keep Alive Check:
- The block checks if the completion of long-running commands should trigger a keep-alive signal. NVMe controllers may drop connections if they don't receive periodic keep-alive messages.
ctrl->kas
checks if keep-alive is enabled. If the command has been running longer than the last keep-alive check, it setsctrl->comp_seen
totrue
to ensure a keep-alive message is sent.
-
Command Disposition:
switch (nvme_decide_disposition(req))
decides how to handle the request based on its current state, which can be one ofCOMPLETE
,RETRY
,FAILOVER
, orAUTHENTICATE
.COMPLETE
: Callsnvme_end_req(req)
to finalize the request.RETRY
: Callsnvme_retry_req(req)
to retry the request.FAILOVER
: Callsnvme_failover_req(req)
to handle failover in case of an error.AUTHENTICATE
: In systems with host authentication (CONFIG_NVME_HOST_AUTH
), queues authentication work and retries the request. Otherwise, it ends the request.
nvme_complete_batch_req
Function
This function is designed for completing a batch of NVMe requests, particularly for zoned namespaces.
-
Trace and Cleanup:
- Similar to
nvme_complete_rq
, it records a tracepoint and cleans up the command.
- Similar to
-
Zoned Request Completion:
- Calls
nvme_end_req_zoned(req)
to handle the specific requirements for completing a zoned request.
- Calls
Block Layer Status Code
Found in \block\blk-core.c
int blk_status_to_errno(blk_status_t status)
{
int idx = (__force int)status;
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
return -EIO;
return blk_errors[idx].errno;
}
EXPORT_SYMBOL_GPL(blk_status_to_errno);
The blk_status_to_errno
function you've found in the Linux kernel source converts a block layer status code (blk_status_t
) to a standard Linux error number (errno
). The implementation uses an array (blk_errors
) to map these status codes to corresponding errno
values. Here's a breakdown of how it works:
- Convert Status to Array Index:
-
int idx = (__force int)status;
converts theblk_status_t
status code to an integeridx
. The__force
is a cast used in the Linux kernel to override the compiler's type checking, indicating that this conversion is intentional. -
Boundary Check:
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors))) return -EIO;
- This line checks if the index
idx
is within the bounds of theblk_errors
array. TheARRAY_SIZE(blk_errors)
macro returns the number of elements in theblk_errors
array. WARN_ON_ONCE
is a kernel macro that prints a warning message to the kernel log the first time the condition is true (here, ifidx
is out of bounds). It helps to catch potential bugs during development.-
If
idx
is out of bounds, the function returns-EIO
, a generic I/O error. -
Return Corresponding Error Number:
return blk_errors[idx].errno;
- This line accesses the
blk_errors
array at the indexidx
and returns theerrno
value associated with that block status code. -
The
blk_errors
array is likely defined elsewhere in the kernel source and contains a mapping of block status codes to their correspondingerrno
values. -
Export Symbol:
EXPORT_SYMBOL_GPL(blk_status_to_errno);
makes theblk_status_to_errno
function available to other kernel modules under the terms of the GPL.
In summary, blk_status_to_errno
uses an array to map block status codes to errno
values. The function ensures that the index is within the bounds of the array and then uses this index to fetch and return the corresponding error number. This implementation provides a flexible and maintainable way to handle the conversion, as changes to the mapping only require updating the blk_errors
array rather than modifying the function logic.
Block Layer Status Code Mapping
Summary
nvme_complete_rq
is a more comprehensive function handling individual NVMe requests, including various scenarios like retries, failovers, and special handling for long-running commands in the context of keep-alive signals.nvme_complete_batch_req
is a simpler function used for completing a batch of requests, with specific handling for zoned namespaces.
These functions are critical for the NVMe driver's operation in Linux, ensuring that NVMe requests are correctly processed, completed, retried, or failed over as needed. They also handle the specific requirements of zoned namespaces, a feature that optimizes SSD performance and longevity.
NVMe Completion Described
The nvme_end_req
function is typically called at the end of an NVMe I/O request's lifecycle in the Linux kernel's NVMe driver. This function is invoked when an I/O request to an NVMe device has been processed and is ready to be completed. Here's a general overview of when and how it gets called:
-
Completion of an I/O Operation: NVMe I/O operations, such as read or write requests, are processed asynchronously. When an NVMe I/O operation completes (successfully or with an error), a completion routine is triggered.
nvme_end_req
is often called within such completion routines. -
From NVMe Interrupt Handlers: NVMe devices use interrupts to signal the completion of I/O operations. The NVMe driver has interrupt handlers to respond to these interrupts. When an interrupt indicates the completion of an I/O request, the interrupt handler will eventually call
nvme_end_req
to finalize the request. -
Completion Path in the NVMe Driver: Within the NVMe driver's code, after processing the response from the NVMe device (like handling the status of the command), the driver will call
nvme_end_req
to clean up the request and perform necessary final steps like error logging, multipath handling, and notifying the block layer of the completion status. -
In Response to Errors or Timeouts: If an error occurs during the processing of an NVMe command, or if a command times out,
nvme_end_req
will be called as part of the error handling or timeout handling code path to properly complete the request with the appropriate error status. -
NVMe Queue Processing: The NVMe driver uses request queues to manage I/O operations. As requests are processed and completed (either by the hardware or due to some error conditions),
nvme_end_req
is called to properly close out these requests.
The exact call chain leading to nvme_end_req
can be quite complex, as it involves asynchronous I/O processing, interrupt handling, and interactions with the Linux kernel's block layer. The function plays a crucial role in ensuring that each I/O request is correctly finalized, whether it completes successfully or encounters an error.
NVMe ENODATA
NOTE THIS CODE IS SPECIFIC TO NVMe-OF.
- Started by looking for ENODATA and it only appears here:
inline u16 errno_to_nvme_status(struct nvmet_req *req, int errno)
{
switch (errno) {
case 0:
return NVME_SC_SUCCESS;
case -ENOSPC:
req->error_loc = offsetof(struct nvme_rw_command, length);
return NVME_SC_CAP_EXCEEDED | NVME_SC_DNR;
case -EREMOTEIO:
req->error_loc = offsetof(struct nvme_rw_command, slba);
return NVME_SC_LBA_RANGE | NVME_SC_DNR;
case -EOPNOTSUPP:
req->error_loc = offsetof(struct nvme_common_command, opcode);
switch (req->cmd->common.opcode) {
case nvme_cmd_dsm:
case nvme_cmd_write_zeroes:
return NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR;
default:
return NVME_SC_INVALID_OPCODE | NVME_SC_DNR;
}
break;
case -ENODATA:
req->error_loc = offsetof(struct nvme_rw_command, nsid);
return NVME_SC_ACCESS_DENIED;
case -EIO:
fallthrough;
default:
req->error_loc = offsetof(struct nvme_common_command, opcode);
return NVME_SC_INTERNAL | NVME_SC_DNR;
}
}
This C function, errno_to_nvme_status
, converts a given error number (errno
) to an NVMe status code. It's designed to be used within the context of NVMe (Non-Volatile Memory express) subsystems, particularly for handling error conditions. The function takes two parameters: a pointer to a struct nvmet_req
(which represents a request in the NVMe target subsystem) and an int
representing the error number (errno
). Here's a breakdown of its behavior:
-
Switch Statement on
errno
: The function uses a switch-case statement to handle different values oferrno
. -
Case 0 (Success): If
errno
is 0 (no error), it returnsNVME_SC_SUCCESS
, indicating successful completion. -
Case
-ENOSPC
(No Space Left on Device): For this error, it sets theerror_loc
field of the request to the offset of thelength
field in thestruct nvme_rw_command
. It then returnsNVME_SC_CAP_EXCEEDED | NVME_SC_DNR
, indicating that the capacity is exceeded and the request should not be retried. -
Case
-EREMOTEIO
(Remote I/O Error): This setserror_loc
to the offset of theslba
field instruct nvme_rw_command
and returnsNVME_SC_LBA_RANGE | NVME_SC_DNR
, indicating an LBA (Logical Block Addressing) range error and that the request should not be retried. -
Case
-EOPNOTSUPP
(Operation Not Supported): The function setserror_loc
to the offset of theopcode
field instruct nvme_common_command
. It then checks the opcode of the command: if it'snvme_cmd_dsm
ornvme_cmd_write_zeroes
, it returnsNVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR
, indicating that the operation is not supported. For other opcodes, it returnsNVME_SC_INVALID_OPCODE | NVME_SC_DNR
, indicating an invalid opcode. -
Case
-ENODATA
(No Data): Setserror_loc
to the offset of thensid
field instruct nvme_rw_command
and returnsNVME_SC_ACCESS_DENIED
, indicating access is denied. -
Case
-EIO
(I/O Error) and Default: For an I/O error or any other unspecified error (default
case), it setserror_loc
to the offset of theopcode
instruct nvme_common_command
and returnsNVME_SC_INTERNAL | NVME_SC_DNR
, indicating an internal error and that the request should not be retried.
This function is specific to NVMe storage device error handling, translating system errors into NVMe-specific status codes, providing additional information about the location of the error in the command structure.
Code Explanation
The code you've found is from the Linux kernel, specifically from the NVMe target subsystem (nvmet
). This subsystem allows a machine to act as an NVMe target, meaning it can present NVMe storage devices over a network. The code implements various functionalities needed to set up and manage an NVMe-over-Fabrics (NVMe-oF) target.
Here's a brief overview of key components of the code:
-
Initialization and Exit Routines: The
nvmet_init
andnvmet_exit
functions are responsible for initializing and cleaning up the NVMe target subsystem when the module is loaded or unloaded. -
Command Processing: Functions like
nvmet_req_init
,nvmet_req_complete
, andnvmet_parse_io_cmd
handle the processing of NVMe commands received from remote hosts. They deal with command setup, execution, and completion. -
Error Handling: The
errno_to_nvme_status
function, which you previously mentioned, maps Linux kernel error codes to NVMe status codes. This is important for correctly communicating errors back to NVMe clients. -
Subsystem Management: The code includes functions for managing NVMe subsystems (
nvmet_subsys
), which represent groupings of NVMe namespaces (storage volumes) that can be accessed by remote hosts. -
Controller Management: Functions for managing NVMe controllers (
nvmet_ctrl
) are present. In NVMe-oF, a controller is an entity through which a host connects and accesses namespaces. -
Asynchronous Event Notification: The code supports handling and sending asynchronous events to NVMe hosts, which is essential for notifying remote hosts about changes or issues in the target subsystem.
-
ConfigFS Integration: The NVMe target subsystem integrates with ConfigFS, a configuration file system in Linux, to allow dynamic configuration of NVMe targets.
-
Queue and Namespace Handling: There are functions for handling NVMe queues (
nvmet_sq
andnvmet_cq
) and namespaces (nvmet_ns
), which are central to the operation of NVMe. -
Fabric Transport Handling: The code handles various NVMe-over-Fabrics transport methods, allowing NVMe commands to be sent and received over different types of networks.
This code is part of the kernel's implementation of an NVMe target. It is not strictly NVMe-oF specific; some parts are generic to NVMe targets, but it includes support for NVMe-oF as well. The NVMe target functionality enables a Linux system to provide NVMe storage over a network, making it a crucial component for implementing NVMe-oF services in data centers and cloud environments.
Zoned vs Not Zoned
The difference between zoned and non-zoned in the context of NVMe (Non-Volatile Memory Express) and the provided code snippets pertains to a specific type of storage management called Zoned Namespaces (ZNS).
Zoned Namespaces is a relatively new concept in NVMe storage, introduced to optimize the way data is written and managed, particularly for SSDs. Let's break down what each type of namespace means:
-
Non-Zoned (Regular) NVMe Namespaces:
- In regular (non-zoned) NVMe namespaces, data can be written to any location on the drive without any specific restrictions. This is the traditional way SSDs operate.
- You can perform typical read, write, and erase operations anywhere across the drive's storage space.
- This approach is straightforward but can lead to inefficiencies in how data is stored and managed, especially as the drive starts to fill up and requires more frequent garbage collection and wear leveling.
-
Zoned NVMe Namespaces:
- In ZNS (Zoned Namespaces), the storage space is divided into zones.
- Each zone must be written sequentially, meaning new data must be appended to the end of the current data in the zone. This is unlike regular namespaces where data can be written randomly.
- This sequential write requirement aligns better with the underlying characteristics of NAND flash memory, potentially increasing drive efficiency, lifespan, and performance.
- ZNS can reduce the overhead of garbage collection and wear leveling, as each zone is written and erased as a unit.
Now, looking at your provided code:
-
nvme_end_req_zoned
function: This function is specifically for handling the completion of requests in a zoned namespace. If the request is a zone append operation (indicated byREQ_OP_ZONE_APPEND
), it updates the sector number based on the result of the NVMe command. This is specific to ZNS, where the location where data is written might be determined by the drive and needs to be communicated back to the host system. -
nvme_end_req
function: This is a more general function for handling the completion of any NVMe request (both zoned and non-zoned). It callsnvme_end_req_zoned
as part of its process, so it correctly handles zoned requests when necessary. For non-zoned requests, or if zoned features are not enabled (CONFIG_BLK_DEV_ZONED
not set),nvme_end_req
handles the request completion without the specific considerations needed for ZNS.
In summary, the key difference is in how data is written and managed on the drive: zoned namespaces require sequential writes within zones, while non-zoned namespaces do not have this restriction. The code you provided shows how the NVMe driver in Linux handles these differences during the completion of I/O requests.
Enable NVMEe Trace Events
See: https://github.com/linux-nvme/nvme-trace
To use, first enable nvme trace events. To enable all of them, you can do it like this:
echo 1 > /sys/kernel/debug/tracing/events/nvme/enable
This will likely fill the trace buffer rather quickly when you're tracing all IO for even a modest workload.
You can also filter for specific things. For example, if you want to trace only admin commands, you can do it like this:
echo "qid==0" > /sys/kernel/debug/tracing/events/nvme/filter
Once your trace is enabled and some commands have run, you can get the trace and pipe it to this parsing program like this:
cat /sys/kernel/debug/tracing/trace | ./nvme-trace.pl