When we destroy a cm_id, we must purge associated events from the
event queue. If the cm_id is for a listen request, we also purge
corresponding pending connect requests. This requires destroying
the cm_id's associated with the connect requests by calling
rdma_destroy_id(). rdma_destroy_id() blocks until all outstanding
callbacks have completed.
The issue is that we hold file->mut while purging events from the
event queue. We also acquire file->mut in our event handler. Calling
rdma_destroy_id() while holding file->mut can lead to a deadlock,
since the event handler callback cannot acquire file->mut, which
prevents rdma_destroy_id() from completing.
Fix this by moving events to purge from the event queue to a temporary
list. We can then release file->mut and call rdma_destroy_id()
outside of holding any locks.
Bug report by Or Gerlitz <ogerlitz@mellanox.com>:
[ INFO: possible circular locking dependency detected ]
3.3.0-rc5-00008-g79f1e43-dirty #34 Tainted: G I
tgtd/9018 is trying to acquire lock:
(&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0359a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm]
but task is already holding lock:
(&file->mut){+.+.+.}, at: [<ffffffffa02470fe>] ucma_free_ctx+0xb6/0x196 [rdma_ucm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is: