pandora-kernel.git
9 years agoMerge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into for-3.17...
Jens Axboe [Tue, 5 Aug 2014 17:02:05 +0000 (11:02 -0600)]
Merge branch 'for-jens' of evilpiepirate.org/git/linux-bcache into for-3.17/drivers

Kent writes:

Hey Jens, here's the pull request for 3.17 - typically late, but lots of
tasty fixes in this one.

9 years agobcache: Drop unneeded blk_sync_queue() calls
Kent Overstreet [Mon, 7 Jul 2014 20:03:36 +0000 (13:03 -0700)]
bcache: Drop unneeded blk_sync_queue() calls

this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.

Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2

9 years agobcache: add mutex lock for bch_is_open
Jianjian Huo [Sun, 13 Jul 2014 16:08:59 +0000 (09:08 -0700)]
bcache: add mutex lock for bch_is_open

Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.

Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
9 years agobcache: Correct printing of btree_gc_max_duration_ms
Surbhi Palande [Thu, 17 Apr 2014 19:07:04 +0000 (12:07 -0700)]
bcache: Correct printing of btree_gc_max_duration_ms

time_stats::btree_gc_max_duration_mc is not bit shifted by 8

Fixes BUG #138

Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
9 years agobcache: try to set b->parent properly
Slava Pestov [Sat, 12 Jul 2014 07:22:53 +0000 (00:22 -0700)]
bcache: try to set b->parent properly

bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1

9 years agobcache: fix memory corruption in init error path
Slava Pestov [Thu, 19 Jun 2014 22:05:59 +0000 (15:05 -0700)]
bcache: fix memory corruption in init error path

If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.

Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d

9 years agobcache: fix crash with incomplete cache set
Slava Pestov [Fri, 11 Jul 2014 19:17:41 +0000 (12:17 -0700)]
bcache: fix crash with incomplete cache set

Change-Id: I6abde52afe917633480caaf4e2518f42a816d886

9 years agobcache: Fix more early shutdown bugs
Kent Overstreet [Thu, 12 Jun 2014 02:44:49 +0000 (19:44 -0700)]
bcache: Fix more early shutdown bugs

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: fix use-after-free in btree_gc_coalesce()
Slava Pestov [Sun, 13 Jul 2014 04:53:11 +0000 (21:53 -0700)]
bcache: fix use-after-free in btree_gc_coalesce()

If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.

This regression was introduced in 2d7f9531.

Change-Id: I76564d7257800583214376b4bacf236cda90c89c

9 years agobcache: Fix an infinite loop in journal replay
Kent Overstreet [Mon, 2 Jun 2014 22:39:44 +0000 (15:39 -0700)]
bcache: Fix an infinite loop in journal replay

When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.

Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: fix crash in bcache_btree_node_alloc_fail tracepoint
Slava Pestov [Fri, 23 May 2014 18:18:35 +0000 (11:18 -0700)]
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint

'b' was NULL.

Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0

9 years agobcache: bcache_write tracepoint was crashing
Slava Pestov [Thu, 22 May 2014 19:14:24 +0000 (12:14 -0700)]
bcache: bcache_write tracepoint was crashing

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: fix typo in bch_bkey_equal_header
Slava Pestov [Tue, 1 Jul 2014 05:31:20 +0000 (22:31 -0700)]
bcache: fix typo in bch_bkey_equal_header

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: Allocate bounce buffers with GFP_NOWAIT
Kent Overstreet [Mon, 19 May 2014 15:55:40 +0000 (08:55 -0700)]
bcache: Allocate bounce buffers with GFP_NOWAIT

There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c

9 years agobcache: Make sure to pass GFP_WAIT to mempool_alloc()
Kent Overstreet [Mon, 19 May 2014 15:57:55 +0000 (08:57 -0700)]
bcache: Make sure to pass GFP_WAIT to mempool_alloc()

this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: fix uninterruptible sleep in writeback thread
Slava Pestov [Thu, 1 May 2014 20:48:57 +0000 (13:48 -0700)]
bcache: fix uninterruptible sleep in writeback thread

There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: wait for buckets when allocating new btree root
Slava Pestov [Tue, 22 Apr 2014 01:23:12 +0000 (18:23 -0700)]
bcache: wait for buckets when allocating new btree root

Tested:
- sometimes bcache_tier test would hang on startup with a failure
  to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: fix crash on shutdown in passthrough mode
Slava Pestov [Tue, 20 May 2014 19:20:28 +0000 (12:20 -0700)]
bcache: fix crash on shutdown in passthrough mode

We never started the writeback thread in this case, so don't stop it.

9 years agobcache: fix lockdep warnings on shutdown
Slava Pestov [Tue, 29 Apr 2014 22:39:27 +0000 (15:39 -0700)]
bcache: fix lockdep warnings on shutdown

9 years agobcache allocator: send discards with correct size
Slava Pestov [Tue, 22 Apr 2014 01:22:35 +0000 (18:22 -0700)]
bcache allocator: send discards with correct size

9 years agobcache: Fix to remove the rcu_sched stalls.
Surbhi Palande [Thu, 10 Apr 2014 23:09:51 +0000 (16:09 -0700)]
bcache: Fix to remove the rcu_sched stalls.

while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande <sap@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: Fix a journal replay bug
Kent Overstreet [Fri, 11 Apr 2014 00:58:49 +0000 (17:58 -0700)]
bcache: Fix a journal replay bug

journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agobcache: Fix a bug when detaching
Kent Overstreet [Thu, 20 Mar 2014 00:49:37 +0000 (17:49 -0700)]
bcache: Fix a bug when detaching

After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
9 years agodrbd: silence underflow warning in read_in_block()
Dan Carpenter [Tue, 6 May 2014 11:28:32 +0000 (14:28 +0300)]
drbd: silence underflow warning in read_in_block()

My static checker warns that "data_size" could be negative and underflow
the limit check.  The code looks suspicious but I don't know if it is a
real bug.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: implicitly truncate cpu-mask
Lars Ellenberg [Mon, 19 May 2014 07:52:02 +0000 (09:52 +0200)]
drbd: implicitly truncate cpu-mask

Don't error out with misleading "out of memory"
if the cpu-mask has more bits set than there are CPUs.
Just truncate to nr_cpu_ids implicitly.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: drop spurious parameters from _drbd_md_sync_page_io
Lars Ellenberg [Tue, 1 Apr 2014 22:24:18 +0000 (00:24 +0200)]
drbd: drop spurious parameters from _drbd_md_sync_page_io

size is always 4096,
page is always device->md_io.page.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: resync should only lock out specific ranges
Lars Ellenberg [Wed, 7 May 2014 20:41:28 +0000 (22:41 +0200)]
drbd: resync should only lock out specific ranges

During resync, if we need to block some specific incoming write because
of active resync requests to that same range, we potentially caused
*all* new application writes (to "cold" activity log extents) to block
until this one request has been processed.

Improve the do_submit() logic to
 * grab all incoming requests to some "incoming" list
 * process this list
   - move aside requests that are blocked by resync
   - prepare activity log transactions,
   - commit transactions and submit corresponding requests
   - if there are remaining requests that only wait for
     activity log extents to become free, stop the fast path
     (mark activity log as "starving")
   - iterate until no more requests are waiting for the activity log,
     but all potentially remaining requests are only blocked by resync
 * only then grab new incoming requests

That way, very busy IO on currently "hot" activity log extents cannot
starve scattered IO to "cold" extents. And blocked-by-resync requests
are processed once resync traffic on the affected region has ceased,
without blocking anything else.

The only blocking mode left is when we cannot start requests to "cold"
extents because all currently "hot" extents are actually used.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add per device data_gen_id
Lars Ellenberg [Wed, 14 May 2014 18:33:05 +0000 (20:33 +0200)]
drbd: debugfs: add per device data_gen_id

The data generation identifiers used to be exposed via sysfs
at /sys/block/drbdX/drbd/meta_data/data_gen_id (out-of-tree),
for advanced policy scripting.
Bring that information over to debugfs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add per connection oldest requests
Lars Ellenberg [Wed, 14 May 2014 18:13:36 +0000 (20:13 +0200)]
drbd: debugfs: add per connection oldest requests

Information of former /sys/block/drbdX/drbd/oldest_requests
is already with higher detail in these files:
 debugfs/drbd/resource/$name/in_flight_summary,
 debugfs/drbd/resource/$name/volumes/$vnr/oldest_requests

This patch adds
 debugfs/drbd/resource/$name/connections/peer/oldest_requests

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add version tag to debugfs files
Lars Ellenberg [Tue, 6 May 2014 13:05:23 +0000 (15:05 +0200)]
drbd: debugfs: add version tag to debugfs files

Make the first line of debugfs files a version number,
starting now with "v: 0".

If we change content of presentation, we will bump that.
Monitoring or diagnostic scritps that may parse these files
can then easily know when they need to be reviewed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add per volume oldest_requests
Lars Ellenberg [Thu, 8 May 2014 11:39:35 +0000 (13:39 +0200)]
drbd: debugfs: add per volume oldest_requests

Show oldest requests
 * pending master bio completion and,
 * if different, local disk bio completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add callback_history
Lars Ellenberg [Tue, 6 May 2014 13:02:05 +0000 (15:02 +0200)]
drbd: debugfs: add callback_history

Add a per-connection worker thread callback_history
with timing details, call site and callback function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: Add in_flight_summary
Lars Ellenberg [Mon, 5 May 2014 21:05:47 +0000 (23:05 +0200)]
drbd: debugfs: Add in_flight_summary

* Add details about pending meta data operations to in_flight_summary.

* Report number of requests waiting for activity log transactions.

* timing details of peer_requests to in_flight_summary.

* FLUSH details
  DRBD devides the incoming request stream into "epochs",
  in which peers are allowed to re-order writes independendly.

  These epochs are separated by P_BARRIER on the replication link.
  Such barrier packets, depending on configuration, may cause
  the receiving side to drain the lower level device request queues
  and call blkdev_issue_flush().

  This is known to be an other major source of latency in DRBD.

  Track timing details of calls to blkdev_issue_flush(),
  and add them to in_flight_summary.

* data socket stats
  To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
  it is useful to see network buffer stats along with the timing details of
  requests, peer requests, and meta data IO.

* pending bitmap IO timing details to in_flight_summary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: deal with destructor racing with open of debugfs file
Lars Ellenberg [Mon, 5 May 2014 12:05:54 +0000 (12:05 +0000)]
drbd: debugfs: deal with destructor racing with open of debugfs file

Try to close the race between open() and debugfs_remove_recursive()
from inside an object destructor.
Once open succeeds, the object should stay around.
Open should not succeed if the object has already reached its destructor.

This may be overkill, but to make that happen, we check for existence of
a parent directory, "stale-ness" of "this" dentry, and serialize
kref_get_unless_zero() on the outermost object relevant for this file
with d_delete() on this dentry (using the parent's i_mutex).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add in_flight_summary data
Lars Ellenberg [Fri, 2 May 2014 11:20:05 +0000 (13:20 +0200)]
drbd: debugfs: add in_flight_summary data

To help diagnosing "high latency" or "hung" IO situations on DRBD,
present per drbd resource group a summary of operations currently in progress.

First item is a list of oldest drbd_request objects
waiting for various things:
 * still being prepared
 * waiting for activity log transaction
 * waiting for local disk
 * waiting to be sent
 * waiting for peer acknowledgement ("receive ack", "write ack")
 * waiting for peer epoch acknowledgement ("barrier ack")

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: debugfs: add basic hierarchy
Lars Ellenberg [Fri, 2 May 2014 11:19:51 +0000 (13:19 +0200)]
drbd: debugfs: add basic hierarchy

Add new debugfs hierarchy /sys/kernel/debug/
  drbd/
    resources/
      $resource_name/connections/peer/$volume_number/
      $resource_name/volumes/$volume_number/
    minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/

Followup commits will populate this hierarchy with files containing
statistics, diagnostic information and some attribute data.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: track details of bitmap IO
Lars Ellenberg [Mon, 5 May 2014 22:44:59 +0000 (00:44 +0200)]
drbd: track details of bitmap IO

Track start and submit time of bitmap operations, and
add pending bitmap IO contexts to a new pending_bitmap_io list.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: register peer requests on read_ee early
Lars Ellenberg [Thu, 8 May 2014 08:08:05 +0000 (10:08 +0200)]
drbd: register peer requests on read_ee early

Initialize peer_request with timestamp and proper empty list head.
Add peer_request to list early, so debugfs can find this request and
report it as "preparing", even if we sleep before we actually submit it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: track timing details of peer_requests
Lars Ellenberg [Mon, 5 May 2014 21:42:24 +0000 (23:42 +0200)]
drbd: track timing details of peer_requests

To be able to present timing details in debugfs,
we need to track preparation/submit times of peer requests.

Track peer request flags early,
before they are put on the epoch_entry lists.

Waiting for activity log transactions may be a major latency factor.
We want to be able to present the peer_request state accurately in
debugfs, and what it is waiting for.

Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
Set it only *after* calling drbd_al_begin_io(),
clear it as soon as we call drbd_al_complete_io().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: improve throttling decisions of background resynchronisation
Lars Ellenberg [Fri, 20 Dec 2013 10:22:13 +0000 (11:22 +0100)]
drbd: improve throttling decisions of background resynchronisation

Background resynchronisation does some "side-stepping", or throttles
itself, if it detects application IO activity, and the current resync
rate estimate is above the configured "cmin-rate".

What was not detected: if there is no application IO,
because it blocks on activity log transactions.

Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
and count non-zero as application IO activity.
This counter is exposed at proc_details level 2 and above.

Also make sure to release the currently locked resync extent
if we side-step due to such voluntary throttling.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: add caching oldest request pointers for replication stages
Lars Ellenberg [Fri, 22 Nov 2013 12:00:12 +0000 (13:00 +0100)]
drbd: add caching oldest request pointers for replication stages

A request that is to be shipped to the peer goes through a few stages:
- queued
- sent, waiting for ack
- ack received, waiting for "barrier ack", which is re-order epoch being
  closed on the peer by acknowledging a "cache flush" equivalent
  on the lower level device.

In the later two stages, depending on protocol, we may have already
completed this request to the upper layers, so it won't be found anymore
on device->pending_master_completion[] lists.

Track the oldest request yet to be sent (req_next), the oldest not yet
acknowledged (req_ack_pending) and the oldest "still waiting for
something from the peer" (req_not_net_done), doing short list walks on
the transfer log to find the next pending one whenever such a request
makes progress.

Now we have a fast way to look up the oldest requests,
don't do a transfer log walk every time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: add lists to find oldest pending requests
Lars Ellenberg [Fri, 22 Nov 2013 11:52:03 +0000 (12:52 +0100)]
drbd: add lists to find oldest pending requests

Adding requests to per-device fifo lists as soon as possible after
allocating them leaves a simple list_first_entry_or_null() to find the
oldest request, regardless what it is still waiting for.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: gather detailed timing statistics for drbd_requests
Lars Ellenberg [Fri, 22 Nov 2013 11:32:01 +0000 (12:32 +0100)]
drbd: gather detailed timing statistics for drbd_requests

Record (in jiffies) how much time a request spends in which stages.
Followup commits will use and present this additional timing information
so we can better locate and tackle the root causes of latency spikes,
or present the backlog for asynchronous replication.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: track meta data IO intent, start and submit time
Lars Ellenberg [Tue, 1 Apr 2014 21:53:30 +0000 (23:53 +0200)]
drbd: track meta data IO intent, start and submit time

For diagnostic purposes, track intent, start time
and latest submit time of meta data IO.

Move separate members from struct drbd_device
into the embeded struct drbd_md_io.
s/md_io_(page|in_use)/md_io.\1/

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: fix drbd_destroy_device reference count updates
Lars Ellenberg [Wed, 14 May 2014 19:34:47 +0000 (21:34 +0200)]
drbd: fix drbd_destroy_device reference count updates

drbd_destroy_device means to give up reference counts
on the connection(s) reachable via the peer_device(s).

It must not do that by iterating via device->resource->connections,
resource and connections may have already been disassociated
by drbd_free_resource, and we'd leak connection refs.

Instead, iterate via device->peer_devices->connection.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: poison free'd device, resource and connection structs
Lars Ellenberg [Wed, 14 May 2014 19:35:21 +0000 (21:35 +0200)]
drbd: poison free'd device, resource and connection structs

Now that we have additional asynchronous kref_get/kref_put
via debugfs, make sure we catch access after free.

Poison struct drbd_device, drbd_connection and drbd_resource
before kfree() with 0xfd, 0xfc, and 0xf2, respectively.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: also keep track of trim -> zero-out fallback peer_requests
Lars Ellenberg [Wed, 23 Apr 2014 10:25:23 +0000 (12:25 +0200)]
drbd: also keep track of trim -> zero-out fallback peer_requests

To be able to find and present such zero-out fallback peer_requests
in debugfs, we add those to "active_ee", once that list drained.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: consistently use list_add_tail for peer_request tracking
Lars Ellenberg [Wed, 23 Apr 2014 10:15:35 +0000 (12:15 +0200)]
drbd: consistently use list_add_tail for peer_request tracking

Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...)
consistently "oldest first".  That way finding the oldest not yet
successfully processed request is simply list_first_entry_or_null.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: drop drbd_md_flush
Lars Ellenberg [Fri, 25 Apr 2014 11:27:50 +0000 (13:27 +0200)]
drbd: drop drbd_md_flush

The only user of drbd_md_flush was bm_rw(),
and it is always followed by either a drbd_md_sync(),
or an al_write_transaction(), which, if so configured,
both end up submiting a FLUSH|FUA request anyways.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: add drbd_queue_work_if_unqueued helper
Lars Ellenberg [Mon, 28 Apr 2014 09:43:21 +0000 (11:43 +0200)]
drbd: add drbd_queue_work_if_unqueued helper

We sometimes do
    if (list_empty(&w.list))
drbd_queue_work(&q, &w.list);

Removal (list_del_init) may happen outside all locks, after all
pending work entries have been moved to an on-stack local work list.

For not dynamically allocated, but embeded, work structs,
we must avoid to re-add until it really was removed.

Move that list_empty check inside the spin_lock(&q->q_lock)
within the helper function, and change to list_empty_careful().

This may have been the reason for a list_add corruption
inside drbd_queue_work().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: drbd_rs_number_requests: fix unit mismatch in comparison
Lars Ellenberg [Tue, 22 Apr 2014 14:37:16 +0000 (16:37 +0200)]
drbd: drbd_rs_number_requests: fix unit mismatch in comparison

We try to limit the number of "in-flight" resync requests.
One condition for that is the amount of requested data should not exceed
half of what can be covered by our "max-buffers" setting.

However we compared number of 4k pages with number of in-flight 512 Byte
sectors, and this extra throttle triggered much earlier than intended.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: cosmetic: change all printk(level, ...) to pr_<level>(...)
Lars Ellenberg [Thu, 27 Mar 2014 13:10:55 +0000 (14:10 +0100)]
drbd: cosmetic: change all printk(level, ...) to pr_<level>(...)

Cosmetic change only.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: clear CRASHED_PRIMARY only after successful resync
Lars Ellenberg [Thu, 27 Mar 2014 09:34:13 +0000 (10:34 +0100)]
drbd: clear CRASHED_PRIMARY only after successful resync

If we lost a disk during the first resync after primary crash,
we could have prematurely cleared the CRASHED_PRIMARY flag.
Testing on C_CONNECTED is not what we meant there,
but testing for both peers to become D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: improve resync request throttling due to sendbuf size
Lars Ellenberg [Fri, 31 Jan 2014 13:55:12 +0000 (14:55 +0100)]
drbd: improve resync request throttling due to sendbuf size

If we throttle resync because the socket sendbuffer is filling up,
tell TCP about it, so it may expand the sendbuffer for us.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agoblock: Convert last uses of __FUNCTION__ to __func__
Joe Perches [Tue, 25 Mar 2014 19:35:05 +0000 (12:35 -0700)]
block: Convert last uses of __FUNCTION__ to __func__

Just about all of these have been converted to __func__,
so convert the last uses.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrivers/block: Use RCU_INIT_POINTER(x, NULL) in drbd/drbd_state.c
Monam Agarwal [Sat, 22 Mar 2014 20:32:29 +0000 (02:02 +0530)]
drivers/block: Use RCU_INIT_POINTER(x, NULL) in drbd/drbd_state.c

This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: short-circuit in maybe_pull_ahead
Lars Ellenberg [Thu, 20 Mar 2014 13:04:35 +0000 (14:04 +0100)]
drbd: short-circuit in maybe_pull_ahead

If we already "pulled ahead", we can short-circuit,
and avoid logging the same messages over and over again.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: application writes may set-in-sync in protocol != C
Lars Ellenberg [Thu, 20 Mar 2014 10:19:22 +0000 (11:19 +0100)]
drbd: application writes may set-in-sync in protocol != C

If "dirty" blocks are written to during resync,
that brings them in-sync.

By explicitly requesting write-acks during resync even in protocol != C,
we now can actually respect this.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: New net configuration option socket-check-timeout
Philipp Reisner [Tue, 18 Mar 2014 13:24:35 +0000 (14:24 +0100)]
drbd: New net configuration option socket-check-timeout

In setups involving a DRBD-proxy and connections that experience a lot of
buffer-bloat it might be necessary to set ping-timeout to an
unusual high value. By default DRBD uses the same value to wait if a newly
established TCP-connection is stable. Since the DRBD-proxy is usually located
in the same data center such a long wait time may hinder DRBD's connect process.

In such setups socket-check-timeout should be set to
at least to the round trip time between DRBD and DRBD-proxy. I.e. in most
cases to 1.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: Limit the time we are waiting for the first packet on an accepted socket
Philipp Reisner [Tue, 18 Mar 2014 13:40:13 +0000 (14:40 +0100)]
drbd: Limit the time we are waiting for the first packet on an accepted socket

Before the patch
'drbd: Keep the listening socket open while trying to connect to the peer'

the newly created socket inherited the receive timeout from the listen
socket. The listen socket had a receive timeout of connect-intervall
+- 30% random jitter.

The real issue is that after the mentioned patch we had no timeout at all.
Now use 4 times the ping-timeout.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: implement csums-after-crash-only
Lars Ellenberg [Tue, 18 Mar 2014 11:30:09 +0000 (12:30 +0100)]
drbd: implement csums-after-crash-only

Checksum based resync trades CPU cycles for network bandwidth,
in situations where we expect much of the to-be-resynced blocks
to be actually identical on both sides already.

In a "network hickup" scenario, it won't help:
all to-be-resynced blocks will typically be different.

The use case is for the resync of *potentially* different blocks
after crash recovery -- the crash recovery had marked larger areas
(those covered by the activity log) as need-to-be-resynced,
just in case. Most of those blocks will be identical.

This option makes it possible to configure checksum based resync,
but only actually use it for the first resync after primary crash.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: don't implicitly resize Diskless node beyond end of device
Lars Ellenberg [Tue, 18 Mar 2014 11:22:14 +0000 (12:22 +0100)]
drbd: don't implicitly resize Diskless node beyond end of device

During handshake, we compare backend sizes, and user set limits,
and agree on what device size we are going to expose.

We remember that last-agreed-size in our meta data.

But if we come up diskless, we have to accept what the peer
presents us with. We used to accept the peers maximum potential
capacity (backend size), which is wrong, and could lead to IO errors
due to access beyond end of device.

Instead, we need to accept the peer's current size.
Unless that is communicated as 0, in which case we
accept the backend size, or the user set limit, if set.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: fix bogus resync stats in /proc/drbd
Lars Ellenberg [Tue, 11 Mar 2014 12:47:55 +0000 (13:47 +0100)]
drbd: fix bogus resync stats in /proc/drbd

We intentionally do not serialize /proc/drbd access with
internal state changes or statistic updates.

Because of that, cat /proc/drbd  may race with resync just being
finished, still see the sync state, and find information about
number of blocks still to go, but then find the total number
of blocks within this resync has just been reset to 0
when accessing it.

This now produces bogus numbers in the resync speed estimates.

Fix by accessing all relevant data only once,
and fixing it up if "still to go" happens to be more than "total".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: Remove unnecessary/unused code
Andreas Gruenbacher [Thu, 27 Feb 2014 19:49:54 +0000 (20:49 +0100)]
drbd: Remove unnecessary/unused code

Get rid of dump_stack() debug statements.

There is no point whatsoever in registering and unregistering a reboot
notifier that doesn't do anything.

The intention was to switch to an "emergency read-only" mode,
so we won't have to resync the full activity log just because
we had been Primary before the reboot.

Once we have that implemented, we may re-introduce the reboot notifier.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: silence -Wmissing-prototypes warnings
Lars Ellenberg [Thu, 27 Feb 2014 08:46:18 +0000 (09:46 +0100)]
drbd: silence -Wmissing-prototypes warnings

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: drop wrong debugging aid
Lars Ellenberg [Wed, 26 Feb 2014 23:02:21 +0000 (00:02 +0100)]
drbd: drop wrong debugging aid

The textual representation of resync extents in /proc/drbd presented
with proc_details >= 3 was wrong, it used bitnumbers as bitmasks.

It was not particularly useful either, and I doubt anyone has even tried
to look at it in the last few years. Drop it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: get rid of drbd_queue_work_front
Lars Ellenberg [Tue, 11 Feb 2014 10:15:36 +0000 (11:15 +0100)]
drbd: get rid of drbd_queue_work_front

The last user was al_write_transaction, if called with "delegate",
and the last user to call it with "delegate = true" was the receiver
thread, which has no need to delegate, but can call it himself.

Finally drop the delegate parameter, drop the extra
w_al_write_transaction callback, and drop drbd_queue_work_front.

Do not (yet) change dequeue_work_item to dequeue_work_batch, though.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: use drbd_device_post_work() in more places
Lars Ellenberg [Tue, 11 Feb 2014 08:47:58 +0000 (09:47 +0100)]
drbd: use drbd_device_post_work() in more places

This replaces the md_sync_work member of struct drbd_device
by a new MD_SYNC "work bit" in device->flags.

This replaces the resync_start_work member of struct drbd_device
by a new RS_START "work bit" in device->flags.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: make sure disk cleanup happens in worker context
Lars Ellenberg [Tue, 11 Feb 2014 08:30:49 +0000 (09:30 +0100)]
drbd: make sure disk cleanup happens in worker context

The recent fix to put_ldev() (correct ordering of access to local_cnt
and state.disk; memory barrier in __drbd_set_state) guarantees
that the cleanup happens exactly once.

However it does not yet guarantee that the cleanup happens from worker
context, the last put_ldev() may still happen from atomic context,
which must not happen: blkdev_put() may sleep.

Fix this by scheduling the cleanup to the worker instead,
using a couple more bits in device->flags and a new helper,
drbd_device_post_work().

Generalized the "resync progress" work to cover these new work bits.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: close race when detaching from disk
Lars Ellenberg [Tue, 11 Feb 2014 07:57:18 +0000 (08:57 +0100)]
drbd: close race when detaching from disk

BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: bd_release+0x21/0x70
Process drbd_w_t7146
Call Trace:
 close_bdev_exclusive
 drbd_free_ldev [drbd]
 drbd_ldev_destroy [drbd]
 w_after_state_ch [drbd]

Race probably went like this:
  state.disk = D_FAILED

... first one to hit zero during D_FAILED:
   put_ldev() /* ----------------> 0 */
     i = atomic_dec_return()
     if (i == 0)
       if (state.disk == D_FAILED)
         schedule_work(go_diskless)
                                /* 1 <------ */ get_ldev_if_state()
   go_diskless()
      do_some_pre_cleanup()                     corresponding put_ldev():
      force_state(D_DISKLESS)   /* 0 <------ */ i = atomic_dec_return()
                                                if (i == 0)
        atomic_inc() /* ---------> 1 */
        state.disk = D_DISKLESS
        schedule_work(after_state_ch)           /* execution pre-empted by IRQ ? */

   after_state_ch()
     put_ldev()
       i = atomic_dec_return()  /* 0 */
       if (i == 0)
         if (state.disk == D_DISKLESS)            if (state.disk == D_DISKLESS)
           drbd_ldev_destroy()                      drbd_ldev_destroy();

Trying to fix this by checking the disk state *before* the
atomic_dec_return(), which implies memory barriers, and by inserting
extra memory barriers around the state assignment in __drbd_set_state().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: explicitly submit meta data requests with REQ_NOIDLE
Lars Ellenberg [Tue, 11 Feb 2014 07:56:53 +0000 (08:56 +0100)]
drbd: explicitly submit meta data requests with REQ_NOIDLE

For some reason we have assumed NOIDLE was implied
by one of the other flags we set. It is not (anymore?).
Explicitly set REQ_NOIDLE for synchronous meta data updates,
or we can seriously starve random writes when using CFQ.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: move set_disk_ro() to after we persisted the new role
Lars Ellenberg [Wed, 5 Feb 2014 05:28:08 +0000 (06:28 +0100)]
drbd: move set_disk_ro() to after we persisted the new role

This probably does not have any real life impact,
but we should first persist any potentially new UUID
and other meta data flags, as well as our new role,
before we allow/disallow write access.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: trigger tcp_push_pending_frames() for PING and PING_ACK
Lars Ellenberg [Wed, 5 Feb 2014 05:13:53 +0000 (06:13 +0100)]
drbd: trigger tcp_push_pending_frames() for PING and PING_ACK

This should reduce latency for such in-DRBD-protocol "pings",
and may help reduce spurious disconnect/reconnect cycles due to
 "PingAck did not arrive in time."

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: re-add lost conf_mutex protection in drbd_set_role
Lars Ellenberg [Wed, 5 Feb 2014 05:17:01 +0000 (06:17 +0100)]
drbd: re-add lost conf_mutex protection in drbd_set_role

The conf_update mutex used to be held while clearing the
net_conf->discard_my_data flag inside drbd_set_role.

It was moved into drbd_adm_set_role with
    drbd: allow parallel promote/demote actions
but then replaced at that location by the newly introduced adm_mutex with
    drbd: Fix a potential deadlock in drbdsetup, introduce resource->adm_mutex

And I simply forgot to put it back in at the original location.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: stop the meta data sync timer before open coded meta data sync
Lars Ellenberg [Mon, 27 Jan 2014 15:04:14 +0000 (16:04 +0100)]
drbd: stop the meta data sync timer before open coded meta data sync

If we re-write all meta data due to resize, we have open-coded write-out
of our meta data super block. Stop the md_sync_timer, it would just
trigger scary but in this case spurious "timer expired" messages.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: fix resync finished detection
Lars Ellenberg [Mon, 27 Jan 2014 14:58:22 +0000 (15:58 +0100)]
drbd: fix resync finished detection

This fixes one recent regresion,
and one long existing bug.

The bug:
drbd_try_clear_on_disk_bm() assumed that all "count" bits have to be
accounted in the resync extent corresponding to the start sector.

Since we allow application requests to cross our "extent" boundaries,
this assumption is no longer true, resulting in possible misaccounting,
scary messages
("BAD! sector=12345s enr=6 rs_left=-7 rs_failed=0 count=58 cstate=..."),
and potentially, if the last bit to be cleared during resync would
reside in previously misaccounted resync extent, the resync would never
be recognized as finished, but would be "stalled" forever, even though
all blocks are in sync again and all bits have been cleared...

The regression was introduced by
    drbd: get rid of atomic update on disk bitmap works

For an "empty" resync (rs_total == 0), we must not "finish" the
resync on the SyncSource before the SyncTarget knows all relevant
information (sync uuid).  We need to wait for the full round-trip,
the SyncTarget will then explicitly notify us.

Also for normal, non-empty resyncs (rs_total > 0), the resync-finished
condition needs to be tested before the schedule() in wait_for_work, or
it is likely to be missed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: fix a race stopping the worker thread
Lars Ellenberg [Fri, 27 Dec 2013 16:17:25 +0000 (17:17 +0100)]
drbd: fix a race stopping the worker thread

We may implicitly call drbd_send() from inside wait_for_work(),
via maybe_send_barrier().

If the "stop" signal was send just before that, drbd_send() would call
flush_signals(), and we would run an unbounded schedule() afterwards.

Fix: check for thread_state == RUNNING before we schedule()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: get rid of atomic update on disk bitmap works
Lars Ellenberg [Fri, 20 Dec 2013 10:39:48 +0000 (11:39 +0100)]
drbd: get rid of atomic update on disk bitmap works

Just trigger the occasional lazy bitmap write-out during resync
from the central wait_for_work() helper.

Previously, during resync, bitmap pages would be written out separately,
synchronously, one at a time, at least 8 times each (every 512 bytes
worth of bitmap cleared).

Now we trigger "merge friendly" bulk write out of all cleared pages
every two seconds during resync, and once the resync is finished.
Most pages will be written out only once.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: allow write-ordering policy to be bumped up again
Lars Ellenberg [Fri, 20 Dec 2013 10:17:02 +0000 (11:17 +0100)]
drbd: allow write-ordering policy to be bumped up again

Previously, once you disabled flushes as a means of enforcing
write-ordering, you'd need to detach/re-attach to enable them again.

Allow drbdsetup disk-options to re-enable previously disabled
write-ordering policy options at runtime.

While at it fix RCU in drbd_bump_write_ordering()
max_allowed_wo() uses rcu_dereference, therefore it must
be called within rcu_read_lock()/rcu_read_unlock()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: refactor use of first_peer_device()
Lars Ellenberg [Fri, 22 Nov 2013 11:40:58 +0000 (12:40 +0100)]
drbd: refactor use of first_peer_device()

Reduce the number of calls to first_peer_device(). Instead, call
first_peer_device() just once to assign a local variable peer_device.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: reduce number of spinlock drop/re-aquire cycles
Lars Ellenberg [Wed, 4 Dec 2013 11:07:09 +0000 (12:07 +0100)]
drbd: reduce number of spinlock drop/re-aquire cycles

Instead of dropping and re-aquiring the spinlock around the submit,
just remember that we want to submit, and do that only once we have
dropped the spinlock for good.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: rename drbd_free_bc() to drbd_free_ldev()
Philipp Reisner [Fri, 22 Nov 2013 15:48:14 +0000 (16:48 +0100)]
drbd: rename drbd_free_bc() to drbd_free_ldev()

Since the member of drbd_device is called ldev

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: device->ldev is not guaranteed on an D_ATTACHING disk
Philipp Reisner [Fri, 22 Nov 2013 12:22:13 +0000 (13:22 +0100)]
drbd: device->ldev is not guaranteed on an D_ATTACHING disk

Some parts of the code assumed that get_ldev_if_state(device, D_ATTACHING)
is sufficient to access the ldev member of the device object. That was
wrong. ldev may not be there or might be freed at any time if the device
has a disk state of D_ATTACHING.

bm_rw()
  Documented that drbd_bm_read() is only called from drbd_adm_attach.
  drbd_bm_write() is only called when a reference is held, and it is
  documented that a caller has to hold a reference before calling
  drbd_bm_write()

drbd_bm_write_page()
  Use get_ldev() instead of get_ldev_if_state(device, D_ATTACHING)

drbd_bmio_set_n_write()
  No longer use get_ldev_if_state(device, D_ATTACHING). All callers
  hold a reference to ldev now.

drbd_bmio_clear_n_write()
  All callers where holding a reference of ldev anyways. Remove the
  misleading get_ldev_if_state(device, D_ATTACHING)

drbd_reconsider_max_bio_size()
  Removed the get_ldev_if_state(device, D_ATTACHING). All callers
  now pass a struct drbd_backing_dev* when they have a proper
  reference, or a NULL pointer.
  Before this fix, the receiver could trigger a NULL pointer
  deref when in drbd_reconsider_max_bio_size()

drbd_bump_write_ordering()
  Used get_ldev_if_state(device, D_ATTACHING) with the wrong assumption.
  Remove it, and allow the caller to pass in a struct drbd_backing_dev*
  when the caller knows that accessing this bdev is safe.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agodrbd: Move write_ordering from connection to resource
Philipp Reisner [Fri, 22 Nov 2013 14:53:41 +0000 (15:53 +0100)]
drbd: Move write_ordering from connection to resource

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
9 years agoblock: virtio-blk: support multi virt queues per virtio-blk device
Ming Lei [Thu, 26 Jun 2014 09:41:48 +0000 (17:41 +0800)]
block: virtio-blk: support multi virt queues per virtio-blk device

Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance can be improved.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoinclude/uapi/linux/virtio_blk.h: introduce feature of VIRTIO_BLK_F_MQ
Ming Lei [Thu, 26 Jun 2014 09:41:47 +0000 (17:41 +0800)]
include/uapi/linux/virtio_blk.h: introduce feature of VIRTIO_BLK_F_MQ

Current virtio-blk spec only supports one virtual queue for transfering
data between VM and host, and inside VM all kinds of operations on
the virtual queue needs to hold one lock, so cause below problems:

- bad scalability
- bad throughput

This patch requests to introduce feature of VIRTIO_BLK_F_MQ
so that more than one virtual queues can be used to virtio-blk
device, then above problems can be solved or eased.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock SG_IO: add SG_FLAG_Q_AT_HEAD flag
Douglas Gilbert [Tue, 1 Jul 2014 16:48:05 +0000 (10:48 -0600)]
block SG_IO: add SG_FLAG_Q_AT_HEAD flag

After the SG_IO ioctl was copied into the block layer and
later into the bsg driver, subtle differences emerged.

One difference is the way injected commands are queued through
the block layer (i.e. this is not SCSI device queueing nor SATA
NCQ). Summarizing:
  - SG_IO on block layer device: blk_exec*(at_head=false)
  - sg device SG_IO: at_head=true
  - bsg device SG_IO: at_head=true

Some time ago Boaz Harrosh introduced a sg v4 flag called
BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A
recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag"
allowed the sg driver default to be overridden. This patch
allows a SG_IO ioctl sent to a block layer device to have
its default overridden.

ChangeLog:
    - introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause
      commands that are injected via a block layer
      device SG_IO ioctl to set at_head=true
    - make comments clearer about queueing in sg.h since the
      header is used both by the sg device and block layer
      device implementations of the SG_IO ioctl.
    - introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility
      (it does nothing) and update comments.

Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
Akinobu Mita [Sun, 25 May 2014 12:43:34 +0000 (21:43 +0900)]
block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge

SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
buffer in bytes as int type.  The value needs to be capped at the request
queue's max_sectors.  But integer overflow is not correctly handled in
the calculation when converting max_sectors from sectors to bytes.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
Akinobu Mita [Sun, 25 May 2014 12:43:33 +0000 (21:43 +0900)]
block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX

BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
short value to the argument pointer.  So if the max_sector is greater
than USHRT_MAX, the upper 16 bits of that is just discarded.

In such case, USHRT_MAX is more preferable than the lower 16 bits of
max_sectors.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock/partitions/efi.c: kerneldoc fixing
Fabian Frederick [Thu, 12 Jun 2014 18:26:01 +0000 (20:26 +0200)]
block/partitions/efi.c: kerneldoc fixing

Adding function documentation and fixing kerneldoc warnings
('field: description' uniformization).

Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock/partitions/msdos.c: code clean-up
Fabian Frederick [Thu, 12 Jun 2014 18:16:57 +0000 (20:16 +0200)]
block/partitions/msdos.c: code clean-up

checkpatch fixing:
WARNING: Missing a blank line after declarations
WARNING: space prohibited between function name and open parenthesis '('
ERROR: spaces required around that '<' (ctx:VxV)

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock/partitions/amiga.c: replace nolevel printk by pr_err
Fabian Frederick [Thu, 12 Jun 2014 18:04:52 +0000 (20:04 +0200)]
block/partitions/amiga.c: replace nolevel printk by pr_err

Also add no prefix pr_fmt to avoid any future default format update

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock/partitions/aix.c: replace count*size kzalloc by kcalloc
Fabian Frederick [Thu, 12 Jun 2014 17:45:17 +0000 (19:45 +0200)]
block/partitions/aix.c: replace count*size kzalloc by kcalloc

kcalloc manages count*sizeof overflow.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agobio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
Gu Zheng [Tue, 1 Jul 2014 16:36:47 +0000 (10:36 -0600)]
bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload

Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
Martin introduces the function bip_integrity_vecs(get the useful vectors)
to fix the issue about nr_vecs for inline integrity vectors that reported
by David Milburn.

But it seems that bip_integrity_vecs() will return the wrong number if the
bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
because in that case, the bip_inline_vecs[0] is malloced directly.  So
here we add the bip_max_vcnt to record the count of vector slots, and
cleanup the function bip_integrity_vecs().

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblk-mq: use percpu_ref for mq usage count
Tejun Heo [Tue, 1 Jul 2014 16:34:38 +0000 (10:34 -0600)]
blk-mq: use percpu_ref for mq usage count

Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight.  The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.

This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq.  percpu_ref can serve the same purpose after the recent
changes.  This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.

blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put.  blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
Tejun Heo [Tue, 1 Jul 2014 16:33:02 +0000 (10:33 -0600)]
blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()

Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified.  Let's flatten it into
its caller.

This patch doesn't make any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblk-mq: decouble blk-mq freezing from generic bypassing
Tejun Heo [Tue, 1 Jul 2014 16:31:13 +0000 (10:31 -0600)]
blk-mq: decouble blk-mq freezing from generic bypassing

blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order.  This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.

However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones.  This causes problems as bypassing assumes that request
processing is online.  blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.

Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones.  This shouldn't break
anything but again highlights that something isn't quite right here.

The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms.  The only intersecting purpose
that they serve is during queue cleanup.  Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.

* request_queue->mq_freeze_depth is added and
  blk_mq_[un]freeze_queue() now operate on this counter instead of
  ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
  but the counter is tested directly.  This will be further updated by
  later changes.

* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
  blk_mq_freeze_queue().  Queue cleanup path now calls
  blk_mq_freeze_queue() directly.

* blk_queue_enter()'s fast path condition is simplified to simply
  check @q->mq_freeze_depth.  Previously, the condition was

!blk_queue_dying(q) &&
    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

  mq_freeze_depth is incremented right after dying is set and
  blk_queue_init_done() exception isn't necessary as blk-mq doesn't
  start frozen, which only leaves the blk_queue_bypass() test which
  can be replaced by @q->mq_freeze_depth test.

This change simplifies the code and reduces confusion in the area.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblock, blk-mq: draining can't be skipped even if bypass_depth was non-zero
Tejun Heo [Tue, 1 Jul 2014 16:29:17 +0000 (10:29 -0600)]
block, blk-mq: draining can't be skipped even if bypass_depth was non-zero

Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero.  The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining.  The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.

Fix it by draining regardless of bypass_depth.  We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot.  INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoblk-mq: fix a memory ordering bug in blk_mq_queue_enter()
Tejun Heo [Wed, 18 Jun 2014 15:21:08 +0000 (11:21 -0400)]
blk-mq: fix a memory ordering bug in blk_mq_queue_enter()

blk-mq uses a percpu_counter to keep track of how many usages are in
flight.  The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.

blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.

Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
9 years agoMerge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu...
Jens Axboe [Tue, 1 Jul 2014 16:19:04 +0000 (10:19 -0600)]
Merge branch 'for-3.17' of git://git./linux/kernel/git/tj/percpu into for-3.17/core

Merge the percpu_ref changes from Tejun, he says they are stable now.