Symptoms:
The hang occurs when an ABORT command times out. Virtual Machine can be unresponsive or just the IO task may be unresponsive.
Detailed log from /var/log/kern.log on the Ubuntu 1804 VM:
Ubuntu 1804.1: Linux eng26-dhcp20 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Mar 2 14:25:00 hostname kernel: [17427.420654] nvme nvme1: I/O 42 QID 2 timeout, aborting
Mar 2 14:25:00 hostname kernel: [17430.423909] nvme nvme1: Abort status: 0x0
Mar 2 14:25:30 hostname kernel: [17460.570298] nvme nvme1: I/O 42 QID 2 timeout, reset controller
Mar 2 14:26:01 hostname kernel: [17491.415229] nvme nvme1: I/O 42 QID 2 timeout, disable controller
Mar 2 14:28:30 hostname kernel: [17641.034980] INFO: task kworker/1:1H:304 blocked for more than 120 seconds.
Mar 2 14:28:30 hostname kernel: [17641.035003] Not tainted 4.15.0-20-generic #21-Ubuntu
Mar 2 14:28:30 hostname kernel: [17641.035018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 2 14:28:30 hostname kernel: [17641.035038] kworker/1:1H D 0 304 2 0x80000000
Mar 2 14:28:30 hostname kernel: [17641.035045] Workqueue: kblockd blk_mq_timeout_work
Mar 2 14:28:30 hostname kernel: [17641.035046] Call Trace:
Mar 2 14:28:30 hostname kernel: [17641.035051] __schedule+0x297/0x8b0
Mar 2 14:28:30 hostname kernel: [17641.035054] ? vprintk_emit+0x333/0x3a0
Mar 2 14:28:30 hostname kernel: [17641.035055] schedule+0x2c/0x80
Mar 2 14:28:30 hostname kernel: [17641.035057] schedule_preempt_disabled+0xe/0x10
Mar 2 14:28:30 hostname kernel: [17641.035058] __mutex_lock.isra.2+0x18c/0x4d0
Mar 2 14:28:30 hostname kernel: [17641.035061] ? ttwu_do_wakeup+0x1e/0x150
Mar 2 14:28:30 hostname kernel: [17641.035062] __mutex_lock_slowpath+0x13/0x20
Mar 2 14:28:30 hostname kernel: [17641.035063] ? __mutex_lock_slowpath+0x13/0x20
Mar 2 14:28:30 hostname kernel: [17641.035064] mutex_lock+0x2f/0x40
Mar 2 14:28:30 hostname kernel: [17641.035070] nvme_start_freeze+0x21/0x50 [nvme_core]
Mar 2 14:28:30 hostname kernel: [17641.035072] nvme_dev_disable+0x3f9/0x450 [nvme]
Mar 2 14:28:30 hostname kernel: [17641.035077] ? __dev_printk+0x3c/0x80
Mar 2 14:28:30 hostname kernel: [17641.035077] ? dev_warn+0x64/0x80
Mar 2 14:28:30 hostname kernel: [17641.035079] nvme_timeout+0x2de/0x310 [nvme]
Mar 2 14:28:30 hostname kernel: [17641.035081] ? __switch_to_asm+0x40/0x70
Mar 2 14:28:30 hostname kernel: [17641.035082] ? __switch_to_asm+0x34/0x70
Mar 2 14:28:30 hostname kernel: [17641.035083] ? __switch_to_asm+0x40/0x70
Mar 2 14:28:30 hostname kernel: [17641.035084] blk_mq_rq_timed_out+0x32/0x70
Mar 2 14:28:30 hostname kernel: [17641.035085] blk_mq_check_expired+0x53/0x60
Mar 2 14:28:30 hostname kernel: [17641.035086] bt_iter+0x4d/0x60
Mar 2 14:28:30 hostname kernel: [17641.035087] blk_mq_queue_tag_busy_iter+0x162/0x260
Mar 2 14:28:30 hostname kernel: [17641.035089] ? blk_mq_rq_timed_out+0x70/0x70
Mar 2 14:28:30 hostname kernel: [17641.035090] ? blk_mq_rq_timed_out+0x70/0x70
Mar 2 14:28:30 hostname kernel: [17641.035092] blk_mq_timeout_work+0x60/0x160
Mar 2 14:28:30 hostname kernel: [17641.035094] process_one_work+0x1de/0x410
Mar 2 14:28:30 hostname kernel: [17641.035095] worker_thread+0x32/0x410
Mar 2 14:28:30 hostname kernel: [17641.035097] kthread+0x121/0x140
Mar 2 14:28:30 hostname kernel: [17641.035098] ? process_one_work+0x410/0x410
Mar 2 14:28:30 hostname kernel: [17641.035100] ? kthread_create_worker_on_cpu+0x70/0x70
Mar 2 14:28:30 hostname kernel: [17641.035101] ret_from_fork+0x35/0x40
Mar 2 14:28:30 hostname kernel: [17641.035110] INFO: task kworker/u4:2:2105 blocked for more than 120 seconds.
Mar 2 14:28:30 hostname kernel: [17641.035129] Not tainted 4.15.0-20-generic #21-Ubuntu
Mar 2 14:28:30 hostname kernel: [17641.035143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 2 14:28:30 hostname kernel: [17641.035162] kworker/u4:2 D 0 2105 2 0x80000000
Mar 2 14:28:30 hostname kernel: [17641.035165] Workqueue: nvme-wq nvme_reset_work [nvme]
Mar 2 14:28:30 hostname kernel: [17641.035165] Call Trace:
Mar 2 14:28:30 hostname kernel: [17641.035167] __schedule+0x297/0x8b0
Mar 2 14:28:30 hostname kernel: [17641.035169] schedule+0x2c/0x80
Mar 2 14:28:30 hostname kernel: [17641.035170] blk_mq_freeze_queue_wait+0x4b/0xb0
Mar 2 14:28:30 hostname kernel: [17641.035172] ? wait_woken+0x80/0x80
Mar 2 14:28:30 hostname kernel: [17641.035175] nvme_wait_freeze+0x33/0x50 [nvme_core]
Mar 2 14:28:30 hostname kernel: [17641.035176] nvme_reset_work+0xb71/0x12e9 [nvme]
Mar 2 14:28:30 hostname kernel: [17641.035177] process_one_work+0x1de/0x410
Mar 2 14:28:30 hostname kernel: [17641.035178] worker_thread+0x32/0x410
Mar 2 14:28:30 hostname kernel: [17641.035180] kthread+0x121/0x140
Mar 2 14:28:30 hostname kernel: [17641.035181] ? process_one_work+0x410/0x410
Mar 2 14:28:30 hostname kernel: [17641.035182] ? kthread_create_worker_on_cpu+0x70/0x70
Mar 2 14:28:30 hostname kernel: [17641.035184] ? do_syscall_64+0x73/0x130
Mar 2 14:28:30 hostname kernel: [17641.035186] ? SyS_exit_group+0x14/0x20
Mar 2 14:28:30 hostname kernel: [17641.035187] ret_from_fork+0x35/0x40
(END)
=============================================================
We also see this on RHEL 8.2:
RHEL 8.3: Linux hostname 4.18.0-240.el8.x86_64 #1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Detailed log from /var/log/kern.log
2022-02-25T19:27:02.333739+01:00 hostname kernel: [ 788.623401] nvme nvme2: I/O 72 QID 2 timeout, aborting
2022-02-25T19:27:02.333747+01:00 hostname kernel: [ 788.623516] nvme nvme2: Abort status: 0x0
2022-02-25T19:27:32.541729+01:00 hostname kernel: [ 818.831071] nvme nvme2: I/O 72 QID 2 timeout, reset controller
2022-02-25T19:27:32.546740+01:00 hostname kernel: [ 818.836274] WARNING: CPU: 0 PID: 934 at kernel/irq/chip.c:210 irq_startup+0xde/0xf0
2022-02-25T19:27:32.546745+01:00 hostname kernel: [ 818.836275] Modules linked in: 8021q garp mrp stp llc nfsv3 nfs fscache intel_rapl_msr intel_rapl_common nfit libnvdimm nft_reject_ipv4 kvm_intel nf_reject_ipv4 nft_reject nf_tables_set kvm nf_tables irqbypass nfnetlink vmw_vsock_vmci_transport vsock crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_balloon intel_rapl_perf pcspkr joydev vmw_vmci i2c_piix4 vfat fat nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c sr_mod cdrom sg ata_generic crc32c_intel serio_raw nvme drm_kms_helper nvme_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci ata_piix libata vmxnet3
2022-02-25T19:27:32.546746+01:00 hostname kernel: [ 818.836304] CPU: 0 PID: 934 Comm: kworker/0:1H Kdump: loaded Not tainted 4.18.0-193.70.1.el8_2.x86_64 #1
2022-02-25T19:27:32.546748+01:00 hostname kernel: [ 818.836305] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
2022-02-25T19:27:32.546749+01:00 hostname kernel: [ 818.836309] Workqueue: kblockd blk_mq_timeout_work
2022-02-25T19:27:32.546750+01:00 hostname kernel: [ 818.836310] RIP: 0010:irq_startup+0xde/0xf0
2022-02-25T19:27:32.546750+01:00 hostname kernel: [ 818.836312] Code: 31 f6 4c 89 ef e8 12 2f 00 00 85 c0 75 20 48 89 ee 31 d2 4c 89 ef e8 e1 d7 ff ff 48 89 df e8 b9 fe ff ff 89 c5 e9 59 ff ff ff <0f> 0b eb b9 0f 0b eb b5 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
2022-02-25T19:27:32.546752+01:00 hostname kernel: [ 818.836313] RSP: 0018:ffffbe09c2c2bbd8 EFLAGS: 00010002
2022-02-25T19:27:32.546752+01:00 hostname kernel: [ 818.836314] RAX: 0000000000000080 RBX: ffff997ba7725200 RCX: 0000000000000080
2022-02-25T19:27:32.546754+01:00 hostname kernel: [ 818.836314] RDX: 0000000000000001 RSI: ffffffffb560d980 RDI: ffff997ba7ced960
2022-02-25T19:27:32.546754+01:00 hostname kernel: [ 818.836315] RBP: ffff997ba7ced960 R08: 0000000000000000 R09: ffff997ba5ea0920
2022-02-25T19:27:32.546755+01:00 hostname kernel: [ 818.836315] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
2022-02-25T19:27:32.546756+01:00 hostname kernel: [ 818.836316] R13: 0000000000000001 R14: 0000000000000000 R15: ffff997ba92e0f00
2022-02-25T19:27:32.546756+01:00 hostname kernel: [ 818.836316] FS: 0000000000000000(0000) GS:ffff997badc00000(0000) knlGS:0000000000000000
2022-02-25T19:27:32.546757+01:00 hostname kernel: [ 818.836317] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-02-25T19:27:32.546757+01:00 hostname kernel: [ 818.836317] CR2: 00007fdc243de000 CR3: 00000003baa0a002 CR4: 00000000007606f0
2022-02-25T19:27:32.546758+01:00 hostname kernel: [ 818.836334] PKRU: 55555554
2022-02-25T19:27:32.546759+01:00 hostname kernel: [ 818.836334] Call Trace:
2022-02-25T19:27:32.546759+01:00 hostname kernel: [ 818.836339] enable_irq+0x46/0x90
2022-02-25T19:27:32.546760+01:00 hostname kernel: [ 818.836343] nvme_poll_irqdisable+0x2b7/0x370 [nvme]
2022-02-25T19:27:32.546760+01:00 hostname kernel: [ 818.836345] __nvme_disable_io_queues+0x1b2/0x1f0 [nvme]
2022-02-25T19:27:32.546761+01:00 hostname kernel: [ 818.836346] ? nvme_del_queue_end+0x20/0x20 [nvme]
2022-02-25T19:27:32.546761+01:00 hostname kernel: [ 818.836347] nvme_dev_disable+0x1ff/0x220 [nvme]
2022-02-25T19:27:32.546762+01:00 hostname kernel: [ 818.836348] nvme_timeout+0x2ca/0x320 [nvme]
2022-02-25T19:27:32.546763+01:00 hostname kernel: [ 818.836351] ? __switch_to_asm+0x41/0x70
2022-02-25T19:27:32.546763+01:00 hostname kernel: [ 818.836352] ? __switch_to_asm+0x35/0x70
2022-02-25T19:27:32.546763+01:00 hostname kernel: [ 818.836353] ? __switch_to_asm+0x41/0x70
2022-02-25T19:27:32.546764+01:00 hostname kernel: [ 818.836353] ? __switch_to_asm+0x35/0x70
2022-02-25T19:27:32.546764+01:00 hostname kernel: [ 818.836354] ? __switch_to_asm+0x41/0x70
2022-02-25T19:27:32.546765+01:00 hostname kernel: [ 818.836354] ? __switch_to_asm+0x35/0x70
2022-02-25T19:27:32.546766+01:00 hostname kernel: [ 818.836355] ? __switch_to_asm+0x41/0x70
2022-02-25T19:27:32.546766+01:00 hostname kernel: [ 818.836356] ? __switch_to_asm+0x35/0x70
2022-02-25T19:27:32.546767+01:00 hostname kernel: [ 818.836356] ? __switch_to_asm+0x41/0x70
2022-02-25T19:27:32.546767+01:00 hostname kernel: [ 818.836357] blk_mq_check_expired+0x11f/0x140
2022-02-25T19:27:32.546768+01:00 hostname kernel: [ 818.836360] blk_mq_queue_tag_busy_iter+0x19a/0x300
2022-02-25T19:27:32.546768+01:00 hostname kernel: [ 818.836361] ? blk_poll+0x350/0x350
2022-02-25T19:27:32.546769+01:00 hostname kernel: [ 818.836362] ? __switch_to_asm+0x41/0x70
2022-02-25T19:27:32.546770+01:00 hostname kernel: [ 818.836363] ? blk_poll+0x350/0x350
2022-02-25T19:27:32.546770+01:00 hostname kernel: [ 818.836365] ? release_thread+0x10/0x30
2022-02-25T19:27:32.546770+01:00 hostname kernel: [ 818.836366] blk_mq_timeout_work+0x54/0x130
2022-02-25T19:27:32.546771+01:00 hostname kernel: [ 818.836368] process_one_work+0x1a7/0x3b0
2022-02-25T19:27:32.546772+01:00 hostname kernel: [ 818.836369] worker_thread+0x30/0x390
2022-02-25T19:27:32.546773+01:00 hostname kernel: [ 818.836370] ? create_worker+0x1a0/0x1a0
2022-02-25T19:27:32.546774+01:00 hostname kernel: [ 818.836372] kthread+0x112/0x130
2022-02-25T19:27:32.546774+01:00 hostname kernel: [ 818.836374] ? kthread_flush_work_fn+0x10/0x10
2022-02-25T19:27:32.546774+01:00 hostname kernel: [ 818.836375] ret_from_fork+0x1f/0x40
2022-02-25T19:27:32.546775+01:00 hostname kernel: [ 818.836376] ---[ end trace 6500b1afc30fe12d ]---
Bugs in the earlier Linux NVME Drivers when using the vNVME new Retry Handler, under APD or with heavy IO load conditions can cause hangs/lockups in the Linux Guest.
Its observed there is Hang/Lockup problem in 4.18.xx or 4.15.xx Linux kernel versions.
The issue has been fixed in 7.0 P05 (u3f)