From: Hefty, Sean Date: Fri, 2 Mar 2012 00:01:19 +0000 (+0000) Subject: RDMA/ucma: Fix AB-BA deadlock X-Git-Tag: v3.4-rc1~169^2^2 X-Git-Url: http://git.openpandora.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=186834b5de69a89ba6cc846e7259451ced689b64;p=pandora-kernel.git RDMA/ucma: Fix AB-BA deadlock When we destroy a cm_id, we must purge associated events from the event queue. If the cm_id is for a listen request, we also purge corresponding pending connect requests. This requires destroying the cm_id's associated with the connect requests by calling rdma_destroy_id(). rdma_destroy_id() blocks until all outstanding callbacks have completed. The issue is that we hold file->mut while purging events from the event queue. We also acquire file->mut in our event handler. Calling rdma_destroy_id() while holding file->mut can lead to a deadlock, since the event handler callback cannot acquire file->mut, which prevents rdma_destroy_id() from completing. Fix this by moving events to purge from the event queue to a temporary list. We can then release file->mut and call rdma_destroy_id() outside of holding any locks. Bug report by Or Gerlitz : [ INFO: possible circular locking dependency detected ] 3.3.0-rc5-00008-g79f1e43-dirty #34 Tainted: G I tgtd/9018 is trying to acquire lock: (&id_priv->handler_mutex){+.+.+.}, at: [] rdma_destroy_id+0x33/0x1f0 [rdma_cm] but task is already holding lock: (&file->mut){+.+.+.}, at: [] ucma_free_ctx+0xb6/0x196 [rdma_ucm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&file->mut){+.+.+.}: [] lock_acquire+0xf0/0x116 [] mutex_lock_nested+0x64/0x2e6 [] ucma_event_handler+0x148/0x1dc [rdma_ucm] [] cma_ib_handler+0x1a7/0x1f7 [rdma_cm] [] cm_process_work+0x32/0x119 [ib_cm] [] cm_work_handler+0xfb8/0xfe5 [ib_cm] [] process_one_work+0x2bd/0x4a6 [] worker_thread+0x1d6/0x350 [] kthread+0x84/0x8c [] kernel_thread_helper+0x4/0x10 -> #0 (&id_priv->handler_mutex){+.+.+.}: [] __lock_acquire+0x10d5/0x1752 [] lock_acquire+0xf0/0x116 [] mutex_lock_nested+0x64/0x2e6 [] rdma_destroy_id+0x33/0x1f0 [rdma_cm] [] ucma_free_ctx+0x117/0x196 [rdma_ucm] [] ucma_close+0x77/0xb4 [rdma_ucm] [] fput+0x117/0x1cf [] filp_close+0x6d/0x78 [] put_files_struct+0xbd/0x17d [] exit_files+0x46/0x4e [] do_exit+0x299/0x75d [] do_group_exit+0x7e/0xa9 [] get_signal_to_deliver+0x536/0x555 [] do_signal+0x39/0x634 [] do_notify_resume+0x27/0x69 [] retint_signal+0x46/0x83 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&file->mut); lock(&id_priv->handler_mutex); lock(&file->mut); lock(&id_priv->handler_mutex); *** DEADLOCK *** 1 lock held by tgtd/9018: #0: (&file->mut){+.+.+.}, at: [] ucma_free_ctx+0xb6/0x196 [rdma_ucm] stack backtrace: Pid: 9018, comm: tgtd Tainted: G I 3.3.0-rc5-00008-g79f1e43-dirty #34 Call Trace: [] ? console_unlock+0x18e/0x207 [] print_circular_bug+0x28e/0x29f [] __lock_acquire+0x10d5/0x1752 [] lock_acquire+0xf0/0x116 [] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm] [] mutex_lock_nested+0x64/0x2e6 [] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm] [] ? trace_hardirqs_on_caller+0x11e/0x155 [] ? trace_hardirqs_on+0xd/0xf [] rdma_destroy_id+0x33/0x1f0 [rdma_cm] [] ucma_free_ctx+0x117/0x196 [rdma_ucm] [] ucma_close+0x77/0xb4 [rdma_ucm] [] fput+0x117/0x1cf [] filp_close+0x6d/0x78 [] put_files_struct+0xbd/0x17d [] ? put_files_struct+0x22/0x17d [] exit_files+0x46/0x4e [] do_exit+0x299/0x75d [] do_group_exit+0x7e/0xa9 [] get_signal_to_deliver+0x536/0x555 [] ? trace_hardirqs_on+0xd/0xf [] do_signal+0x39/0x634 [] ? printk+0x3c/0x45 [] ? trace_hardirqs_on_caller+0x11e/0x155 [] ? trace_hardirqs_on+0xd/0xf [] ? _raw_spin_unlock_irq+0x2b/0x40 [] ? set_current_blocked+0x44/0x49 [] ? retint_signal+0x11/0x83 [] do_notify_resume+0x27/0x69 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] retint_signal+0x46/0x83 Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- Reading git-diff-tree failed