Andreas Gruenbacher [Thu, 9 Dec 2010 14:03:57 +0000 (15:03 +0100)]
 
drbd: Use the standard bool, true, and false keywords
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Thu, 9 Dec 2010 13:23:27 +0000 (14:23 +0100)]
 
drbd: This code is dead now
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Thu, 9 Dec 2010 13:02:35 +0000 (14:02 +0100)]
 
drbd: Another small enum drbd_state_rv cleanup
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Tue, 7 Dec 2010 23:39:32 +0000 (00:39 +0100)]
 
drbd: Be more explicit about functions that return an enum drbd_state_rv
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Wed, 8 Dec 2010 00:06:16 +0000 (01:06 +0100)]
 
drbd: Rename enum drbd_state_ret_codes to enum drbd_state_rv
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Wed, 8 Dec 2010 12:33:11 +0000 (13:33 +0100)]
 
drbd: Rename enum drbd_ret_codes to enum drbd_ret_code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Tue, 7 Dec 2010 09:43:29 +0000 (10:43 +0100)]
 
drbd: Get rid of unnecessary macros (2)
The FAULT_ACTIVE macro just wraps the drbd_insert_fault macro for no
apparent reason.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Tue, 7 Dec 2010 02:01:41 +0000 (03:01 +0100)]
 
drbd: Get rid of unnecessary macros (1)
This macro doesn't save much code, but makes things a lot harder to read.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Mon, 13 Dec 2010 16:48:19 +0000 (17:48 +0100)]
 
drbd: Rename drbd_make_request_26 to drbd_make_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Thu, 9 Dec 2010 15:23:43 +0000 (16:23 +0100)]
 
drbd: Remove left-over prototype
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Thu, 9 Dec 2010 15:08:46 +0000 (16:08 +0100)]
 
drbd: Make sure that drbd_send() has sent the right number of bytes
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Lars Ellenberg [Thu, 9 Dec 2010 14:21:02 +0000 (15:21 +0100)]
 
drbd: fix incomplete error message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Andreas Gruenbacher [Wed, 8 Dec 2010 18:02:09 +0000 (19:02 +0100)]
 
drbd: Removed an unnecessary #undef
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Sun, 5 Dec 2010 13:11:14 +0000 (14:11 +0100)]
 
drbd: fix regression, we need to close drbd epochs during normal operation
commit 
e2041475e6ddb081734d161f6421977323f5a9b9
drbd: Starting with protocol 96 we can allow app-IO while receiving the bitmap
Contained a bad chunk that tried to optimize away drbd barriers during
bitmap exchange, but accidentally dropped them for normal mode as well.
Impact: depending on activity log size and access pattern, activity log
extents may not be recycled in time, causeing IO to block indefinetely.
Fix: skip drbd barriers only if there is no connection to send them on,
or the request being completed has not been on the network at all.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Fri, 3 Dec 2010 15:04:24 +0000 (16:04 +0100)]
 
drbd: Implemented the before-resync-source handler
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Fri, 3 Dec 2010 14:22:48 +0000 (15:22 +0100)]
 
drbd: --force option for disconnect
As the network connection can be lost at any time, a --force option
for disconnect is just a matter of completeness.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Wed, 24 Nov 2010 09:11:14 +0000 (10:11 +0100)]
 
drbd: add packet_type 27 (return_code_only) to netlink api
In case we ever should add an other packet type,
we must not reuse 27, as that currently used for
"empty" return code only replies.
Document it as such.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Wed, 24 Nov 2010 09:41:45 +0000 (10:41 +0100)]
 
drbd: use kzalloc and memset(,0,) to start with clean buffers in drbd_nl
Make sure we start with clean buffers to not accidentally send garbage
back to userspace. Note: has not been observed; but just in case.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Wed, 24 Nov 2010 09:37:35 +0000 (10:37 +0100)]
 
drbd: remove /proc/drbd before unregistering from netlink
There still exists a (theoretical) race on module unload, where
/proc/drbd may still exist, but the netlink callback has been
unregistered already, allowing drbdsetup to shout without listeners,
and get no reply.
Reorder remove_proc_entry and unregister of netlink callback.
drbdsetup first checks for existence of the proc entry,
and if that is missing, won't even try to contact the module.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Wed, 24 Nov 2010 09:33:02 +0000 (10:33 +0100)]
 
drbd: increase module count on /proc/drbd access
If someone holds /proc/drbd open, previously rmmod would
"succeed" in starting the unload, but then block on remove_proc_entry,
leading to a situation where the lsmod does not show drbd anymore,
but /proc/drbd being still there (but no longer accessible).
I'd rather have rmmod fail up front in this case.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Mon, 22 Nov 2010 14:49:17 +0000 (15:49 +0100)]
 
drbd: Removed 20 seconds upper bound for side-stepping
Given low-enough network bandwidth combined with a IO
pattern that hammers onto a single RS-extent, side-stepping
might be necessary for much longer times.
Changed the code to print a single informal message after
20 seconds, but it keeps on stepping aside forever.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Mon, 22 Nov 2010 13:18:47 +0000 (14:18 +0100)]
 
drbd: Becoming sync target may not happen out of < C_WF_REPORT_PARAMS
This patch is acutally a necessary addendum to the patch
"fix for spurious full sync (becoming sync target looked like invalidate)"
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 10 Nov 2010 11:08:37 +0000 (12:08 +0100)]
 
drbd: Starting with protocol 96 we can allow app-IO while receiving the bitmap
* C_STARTING_SYNC_S, C_STARTING_SYNC_T In these states the bitmap gets
  written to disk. Locking out of app-IO is done by using the
  drbd_queue_bitmap_io() and drbd_bitmap_io() functions these days.
  It is no longer necessary to lock out app-IO based on the connection
  state.
  App-IO that may come in after the BITMAP_IO flag got cleared before the
  state transition to C_SYNC_(SOURCE|TARGET) does not get mirrored, sets
  a bit in the local bitmap, that is already set, therefore changes nothing.
* C_WF_BITMAP_S In this state we send updates (P_OUT_OF_SYNC packets).
  With that we make sure they have the same number of bits when going
  into the C_SYNC_(SOURCE|TARGET) connection state.
* C_UNCONNECTED: The receiver starts, no need to lock out IO.
* C_DISCONNECTING: in drbd_disconnect() we had a wait_event()
  to wait until ap_bio_cnt reaches 0. Removed that.
* C_TIMEOUT, C_BROKEN_PIPE, C_NETWORK_FAILURE
  C_PROTOCOL_ERROR, C_TEAR_DOWN: Same as C_DISCONNECTING
* C_WF_REPORT_PARAMS: IO still possible since that is still
  like C_WF_CONNECTION.
And we do not need to send barriers in C_WF_BITMAP_S connection state.
Allow concurrent accesses to the bitmap when receiving the bitmap.
Everything gets ORed anyways.
A drbd_free_tl_hash() is in after_state_chg_work(). At that point
all the work items of the last connections must have been processed.
Introduced a call to drbd_free_tl_hash() into drbd_free_mdev()
for paranoia reasons.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 17 Nov 2010 15:54:36 +0000 (16:54 +0100)]
 
drbd: Improvements in sanitize_state()
The relevant change is that the state change to C_FW_BITMAP_S should
implicitly change pdsk to C_CONSISTENT. (Think of it as C_OUTDATED, only
without the guarantee that the peer has the outdated written to its
meta data)
At that opportunity I restructured the switch statement so that it
gets evaluated every time. (Has declarative character)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Tue, 16 Nov 2010 14:30:44 +0000 (15:30 +0100)]
 
drbd: Fixed race condition in drbd_queue_bitmap_io
May only test for ap_bio_cnt == 0 under req_lock. It can increase
only under req_lock.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 17 Nov 2010 17:24:19 +0000 (18:24 +0100)]
 
drbd: Fixed inc_ap_bio()
The condition must be checked after perpare_to_wait(). The old
implementaion could loose wakeup events. Never observed in real
life.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Tue, 16 Nov 2010 09:07:53 +0000 (10:07 +0100)]
 
drbd: use test_and_set_bit() to decide if bm_io_work should be queued
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Tue, 9 Nov 2010 16:45:06 +0000 (17:45 +0100)]
 
drbd: Begin to account BIO processing time before inc_ap_bio()
Since inc_ap_bio() might sleep already
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Tue, 9 Nov 2010 12:59:41 +0000 (13:59 +0100)]
 
drbd: Implemented side-stepping in drbd_res_begin_io()
Before:
  drbd_rs_begin_io() locked app-IO out of an RS extent, and
  waited then until all previous app-IO in that area finished.
  (But not only until the disk-IO was finished but until the
   barrier/epoch ack came in for that == round trip time latency ++)
After:
  As soon as a new app-IO waits wants to start new IO on that
  RS extent, drbd_rs_begin_io() steps aside (clearing the
  BME_NO_WRITES flag again). It retries after 100ms.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Sun, 7 Nov 2010 17:02:56 +0000 (18:02 +0100)]
 
drbd: Make some functions static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Sun, 7 Nov 2010 14:56:29 +0000 (15:56 +0100)]
 
drbd: Implemented priority inheritance for resync requests
We only issue resync requests if there is no significant application IO
going on. = Application IO has higher priority than resnyc IO.
If application IO can not be started because the resync process locked
an resync_lru entry, start the IO operations necessary to release the
lock ASAP.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Fri, 29 Oct 2010 10:44:20 +0000 (12:44 +0200)]
 
drbd: Do not cleanup resync LRU for the Ahead/Behind SyncSource/SyncTarget transitions
This one should be replaced with moving this cleanup to the
'right' position.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 27 Oct 2010 15:32:36 +0000 (17:32 +0200)]
 
drbd: When proxy's buffer drained off go into regular resync mode
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 27 Oct 2010 12:33:00 +0000 (14:33 +0200)]
 
drbd: New packet for Ahead/Behind mode: P_OUT_OF_SYNC
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 27 Oct 2010 10:21:30 +0000 (12:21 +0200)]
 
drbd: Implemented two new connection states Ahead/Behind
In this connection mode, the ahead node no longer replicates
application IO. The behind's disk becomes out dated.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Wed, 27 Oct 2010 09:12:07 +0000 (11:12 +0200)]
 
drbd: New configuration parameters for dealing with network congestion
net {
    on_congestion {block|pull-ahead|disconnect};
    congestion-fill {sectors};
    congestion-extents {al-extents};
}
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Philipp Reisner [Tue, 26 Oct 2010 14:02:27 +0000 (16:02 +0200)]
 
drbd: Track the numbers of sectors in flight
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Wed, 17 Nov 2010 21:25:03 +0000 (22:25 +0100)]
 
drbd: Renamed write_flags_to_bio() to wire_flags_to_bio()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Thu, 11 Nov 2010 21:41:04 +0000 (22:41 +0100)]
 
drbd: restore compatibility with 32bit kernels
With commit
drbd: further converge progress display of resync and online-verify
accidentally an u64/u64 div was introduced, causing an unresolvable
symbol __udivdi3 to be reference. Actually for that division, 32bit are
still suficient for now, so we can revert to unsigned long instead.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Thu, 11 Nov 2010 14:19:07 +0000 (15:19 +0100)]
 
drbd: properly use max_hw_sectors to limit the our bio size
To ease tracking of bios in some hash tables, we want it to
not cross certain boundaries (128k, used to be 32k).
We limit the maximum bio size using queue parameters.
Historically some defines and variables we use there have been named
max_segment_size, which was misguided. Rename them to max_bio_size,
and use [blk_]queue_max_hw_sectors where appropriate.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Thu, 11 Nov 2010 09:47:05 +0000 (10:47 +0100)]
 
drbd: debug: limit nelink-broadcast of request on digest mismatch to 32k
We used to be limited to 32k requests,
but have increased that limit to 128k now.
This part of the code can only deal with 32k,
it would scramble arbitrary pages for larger requests.
As it is used for debugging only anyways,
it is ok to simply truncate the dumped data here.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Wed, 10 Nov 2010 09:36:52 +0000 (10:36 +0100)]
 
drbd: detect modification of in-flight buffers
With data-integrity digest enabled, double-check on the sending side
for modifications by upper layers of buffers under write back,
so we can tell it appart from corruption on the "wire".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Tue, 9 Nov 2010 13:15:24 +0000 (14:15 +0100)]
 
drbd: further converge progress display of resync and online-verify
Show progressbar and ETA always, with proc_details >= 1 also show the
current sector position for both resync and online-verify on both nodes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Tue, 9 Nov 2010 13:12:10 +0000 (14:12 +0100)]
 
drbd: fix potential wrap of 32bit oos:%lu display in /proc/drbd
When converting bits (4k resolution, still) to kB, we shift left.  If it
was a large number of bits on a 32bit box (>= 4 TiB storage), we may
wrap the 32bit unsigned long base type, resulting in incorrect display.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 09:05:47 +0000 (10:05 +0100)]
 
drbd: use the resync controller for online-verify requests as well
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 09:04:07 +0000 (10:04 +0100)]
 
drbd: factor out drbd_rs_number_requests
Preparation patch to be able to use the auto-throttling resync controller
for online-verify requests as well.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:55:18 +0000 (09:55 +0100)]
 
drbd: factor out drbd_rs_controller_reset
Preparation patch to be able to use the auto-throttling resync controller
for online-verify requests as well.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:52:46 +0000 (09:52 +0100)]
 
drbd: show progress bar and ETA for online-verify
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:48:01 +0000 (09:48 +0100)]
 
drbd: advance progress step marks for online-verify
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:23:37 +0000 (09:23 +0100)]
 
drbd: factor out advancement of resync marks for progress reporting
This is in preparation to unify progress reporting of
online-verify and resync requests.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:43:15 +0000 (09:43 +0100)]
 
drbd: initialize online-verify progress tracking on verify target
For partial (resumed) online verify, initialize the resync step marks
once we know what the online verify start sector is.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:39:06 +0000 (09:39 +0100)]
 
drbd: improve online-verify progress tracking
For a partial (resumed) online-verify, initialize rs_total not to total
bits, but to number of bits to check in this run, to match the meaning
rs_total has for actual resync.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Lars Ellenberg [Fri, 5 Nov 2010 08:56:33 +0000 (09:56 +0100)]
 
drbd: only reset online-verify start sector if verify completed
For network hickups during online-verify, on the next verify
triggered, we by default want to resume where it left off.
After any replication link interruption, there will be a (possibly
empty) resync.  Do not reset online-verify start sector if some resync
completed, that would defeats the purpose.
Only reset the start sector once a verify run is completed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Jens Axboe [Thu, 10 Mar 2011 07:58:35 +0000 (08:58 +0100)]
 
Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core
Conflicts:
	block/blk-core.c
	block/blk-flush.c
	drivers/md/raid1.c
	drivers/md/raid10.c
	drivers/md/raid5.c
	fs/nilfs2/btnode.c
	fs/nilfs2/mdt.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Wed, 9 Mar 2011 07:27:37 +0000 (08:27 +0100)]
 
blk-throttle: Use blk_plug in throttle dispatch
Use plug in throttle dispatch also as we are dispatching a bunch of
bios in throttle context and some of them might merge.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Wed, 9 Mar 2011 10:56:30 +0000 (11:56 +0100)]
 
block: kill off REQ_UNPLUG
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Thu, 3 Mar 2011 01:12:18 +0000 (20:12 -0500)]
 
aio: remove request submission batching
This should be useless now that we have on-stack plugging. So lets just
kill it.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Shaohua Li [Thu, 1 Jul 2010 05:55:01 +0000 (07:55 +0200)]
 
fs: make aio plug
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Tue, 22 Jun 2010 10:52:14 +0000 (12:52 +0200)]
 
fs: make mpage read/write_pages() plug
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Mon, 19 Apr 2010 08:04:38 +0000 (10:04 +0200)]
 
read-ahead: use plugging
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Thu, 24 Jun 2010 13:05:37 +0000 (15:05 +0200)]
 
fs: make generic file read/write functions plug
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Thu, 10 Mar 2011 07:52:07 +0000 (08:52 +0100)]
 
block: remove per-queue plugging
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Tue, 8 Mar 2011 12:19:51 +0000 (13:19 +0100)]
 
block: initial patch for on-stack per-task plugging
This patch adds support for creating a queuing context outside
of the queue itself. This enables us to batch up pieces of IO
before grabbing the block device queue lock and submitting them to
the IO scheduler.
The context is created on the stack of the process and assigned in
the task structure, so that we can auto-unplug it if we hit a schedule
event.
The current queue plugging happens implicitly if IO is submitted to
an empty device, yet callers have to remember to unplug that IO when
they are going to wait for it. This is an ugly API and has caused bugs
in the past. Additionally, it requires hacks in the vm (->sync_page()
callback) to handle that logic. By switching to an explicit plugging
scheme we make the API a lot nicer and can get rid of the ->sync_page()
hack in the vm.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Fri, 16 Apr 2010 19:13:15 +0000 (21:13 +0200)]
 
scsi: convert to blk_delay_queue()
It was always abuse to reuse the plugging infrastructure for this,
convert it to the (new) real API for delaying queueing a bit. A
default delay of 3 msec is defined, to match the previous
behaviour.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Fri, 16 Apr 2010 19:11:21 +0000 (21:11 +0200)]
 
ide-cd: convert to blk_delay_queue() for a short pause
It was always abuse to reuse the plugging infrastructure for this,
convert it to the (new) real API for delaying queueing a bit.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Acked-by: David S. Miller <davem@davemloft.net>
Jens Axboe [Wed, 2 Mar 2011 16:08:00 +0000 (11:08 -0500)]
 
block: add API for delaying work/request_fn a little bit
Currently we use plugging for that, but as plugging is going away,
we need an alternative mechanism.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Tejun Heo [Wed, 9 Mar 2011 18:54:29 +0000 (19:54 +0100)]
 
staging: Convert to bdops->check_events()
Convert two staging drivers - blkvsc_drv and cyasblkdev_block - from
->media_changed() to ->check_events().  The former always indicated
media changed while the latter always indicated media not changed.
Not sure what the drivers are trying to achieve but keep the original
behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
pktcdvd: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
pktcdvd needs to forward all event related operations to the
underlying device.  Forward ->check_events() instead of
->media_changed() and inherit disk->[async_]events.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Peter Osterlund <petero2@telia.com>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
umem: Drop dummy ->media_changed()
umem doesn't implement media changed detection and there's no need to
implement dummy callback anymore.  Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
s390/tape_block: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
s390/tape_block buffers media changed state and clears it on
revalidation.  It will behave correctly with kernel event polling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
i2o_block: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
i2o_block buffers media changed state and clears it after reporting.
It will behave correctly with kernel event polling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
xsysace: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
xsysace buffers media changed state and clears it on revalidation.  It
will behave correctly with kernel event polling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
ub: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
ub buffers media changed state and clears it on revalidation.  It will
behave correctly with kernel event polling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
swim[3]: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
Both swim and swim3 buffer media changed state and clear it on
revalidation.  They will behave correctly with kernel event polling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Laurent Vivier <laurent@lvivier.info>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
dac960: Convert to bdops->check_events()
Convert from ->media_changed() to ->check_events().
DAC960 media change notification seems to be one way (once set, never
cleared) and will generate spurious events when polled once the
condition triggers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
paride: Convert to bdops->check_events()
Convert paride drivers from ->media_changed() to ->check_events().
pcd and pd buffer and clear events after reporting; however, pf
unconditionally reports MEDIA_CHANGE and will generate spurious events
when polled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Waugh <tim@cyberelk.net>
Tejun Heo [Wed, 9 Mar 2011 18:54:28 +0000 (19:54 +0100)]
 
gdrom,viocd: Convert to bdops->check_events()
Convert gdrom and viocd from ->media_changed() to ->check_events().
It's unclear how the conditions are cleared and it's possible that it
may generate spurious events when polled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:27 +0000 (19:54 +0100)]
 
floppy,{ami|ata}flop: Convert to bdops->check_events()
Convert the floppy drivers from ->media_changed() to ->check_events().
Both floppy and ataflop buffer media changed state bit and clear them
on revalidation and will behave correctly with kernel event polling.
I can't tell how amiflop clears its event and it's possible that it
may generate spurious events when polled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:27 +0000 (19:54 +0100)]
 
ide: Convert to bdops->check_events()
Convert ->media_changed() to the new ->check_events() method.  The
conversion is mostly mechanical.  The only notable change is that
cdrom now doesn't generate any event if @slot_nr isn't CDSL_CURRENT.
It used to return -EINVAL which would be treated as media changed.  As
media changer isn't supported anyway, this doesn't make any
difference.
This makes ide emit the standard disk events and allows kernel event
polling.  Currently, only MEDIA_CHANGE event is implemented.  Adding
support for EJECT_REQUEST shouldn't be difficult; however, given that
ide driver is already deprecated, it probably is best to leave it
alone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-ide@vger.kernel.org
Tejun Heo [Wed, 9 Mar 2011 18:54:27 +0000 (19:54 +0100)]
 
block: Don't check events while open is in progress
Not all block drivers clear events immediately after reporting.  Some
do so in ->revalidate_disk() or other steps during ->open().  There is
a slim chance event poll may happen between the clearing event check
from check_disk_change() and the actual clearing of the events which
would result in spurious events.
Block event checks while block device open is in progress.  There is
no need to kick explicit event check afterwards as events are always
checked during open.
-v2: The original patch could have called disk_unblock_events() with
     an already released or %NULL @disk causing oops.  Fixed by making
     sure references are put after disk_unblock_events() is called.
     It also makes the error path of __blkdev_get() a bit simpler.
     This problem was reported by Jens.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:27 +0000 (19:54 +0100)]
 
block: Don't check events on close unless it was blocked
The block event mechanism currently always checks events when the
device is being closed regardless of the open mode.  The intention was
to allow detection of EJECT_REQUEST when a device is closed whether
disk event polling is enabled or not.
This is unnecessary as, for devices of interest, events are checked
from either userland or kernel and in the former case ->check_events()
is performed on open of each poll attempt anyway.  Furthermore, this
unconditional event check on close makes the code susceptible to event
loop if the block driver doesn't clear reported events correctly - an
event triggers userland to open and close the device which in turn
causes another event, rinse and repeat.
Check events on close only if it was blocked by excl write open.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Tejun Heo [Wed, 9 Mar 2011 18:54:27 +0000 (19:54 +0100)]
 
block: Don't implicitly trigger event check on disk_unblock_events()
Currently, disk_unblock_events() implicitly kick event check if the
block count reaches zero.  This behavior is not described in the
comment and hinders with future changes.  Make the unblocker
explicitly check events by calling disk_check_events() as necessary.
This patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Justin TerAvest [Tue, 8 Mar 2011 18:45:00 +0000 (19:45 +0100)]
 
blk-cgroup: Lower minimum weight from 100 to 10.
We've found that we still get good, useful isolation at weights this
low. I'd like to adjust the minimum so that any other changes can take
these values into account.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Martin K. Petersen [Tue, 8 Mar 2011 07:28:01 +0000 (08:28 +0100)]
 
block: biovec_slab vs. CONFIG_BLK_DEV_INTEGRITY
The block integrity subsystem no longer uses the bio_vec slabs so this
code can safely be compiled in.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Mon, 7 Mar 2011 20:09:32 +0000 (21:09 +0100)]
 
blk-throttle: Some cleanups and race fixes in limit  update code
When throttle group limits are updated through cgroups, a thread is
woken up to process these updates. While reviewing that code, oleg noted
couple of race conditions existed in the code and he also suggested that
code can be simplified.
This patch fixes the races simplifies the code based on Oleg's suggestions:
	- Use xchg().
	- Introduced a common function throtl_update_blkio_group_common()
          which is shared now by all iops/bps update functions.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixed a merge issue, throtl_schedule_delayed_work() takes throtl_data
as the argument now, not the queue.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Mon, 7 Mar 2011 20:05:14 +0000 (21:05 +0100)]
 
blk-throttle: process limit change only through one function
With the help of cgroup interface one can go and upate the bps/iops
limits of existing group. Once the limits are udpated, a thread is
woken up to see if some blocked group needs recalculation based on new
limits and needs to be requeued.
There was also a piece of code where I was checking for group limit
update when a fresh bio comes in. This patch gets rid of that piece of
code and keeps processing the limit change at one place
throtl_process_limit_change().  It just keeps the code simple and easy
to understand.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Mon, 7 Mar 2011 08:40:21 +0000 (09:40 +0100)]
 
Merge branch 'block-for-2.6.39-core' of ssh:///linux/kernel/git/tj/misc into for-2.6.39/core
Gui Jianfeng [Mon, 7 Mar 2011 08:28:09 +0000 (09:28 +0100)]
 
cfq-iosched: Fix update_vdisktime logic
The update_vdisktime logic is broken since commit
b54ce60eb7f61f8e314b8b241b0469eda3bb1d42, st->min_vdisktime never makes
a progress. Fix it.
Thanks Vivek for pointing it out.
Signed-off-by: Gui Jianfeng <guijianfen@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Shaohua Li [Mon, 7 Mar 2011 08:26:29 +0000 (09:26 +0100)]
 
cfq-iosched: give busy sync queue no dispatch limit
If there are a sync and an async queue and the sync queue's think time
is small, we can ignore the sync queue's dispatch quantum. Because the
sync queue will always preempt the async queue, we don't need to care
about async's latency.  This can fix a performance regression of
aiostress test, which is introduced by commit 
f8ae6e3eb825. The issue
should exist even without the commit, but the commit amplifies the
impact.
The initial post does the same optimization for RT queue too, but since
I have no real workload for it, Vivek suggests to drop it.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Mon, 7 Mar 2011 07:59:06 +0000 (08:59 +0100)]
 
cfq-iosched: fix race in cfq_set_request()
We need to hold the queue lock over the reference increment,
it's not atomic anymore.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Tejun Heo [Fri, 4 Mar 2011 18:09:02 +0000 (19:09 +0100)]
 
Merge branch 'for-linus' of ../linux-2.6-block into block-for-2.6.39/core
This merge creates two set of conflicts.  One is simple context
conflicts caused by removal of throtl_scheduled_delayed_work() in
for-linus and removal of throtl_shutdown_timer_wq() in
for-2.6.39/core.
The other is caused by commit 
255bb490c8 (block: blk-flush shouldn't
call directly into q->request_fn() __blk_run_queue()) in for-linus
crashing with FLUSH reimplementation in for-2.6.39/core.  The conflict
isn't trivial but the resolution is straight-forward.
* __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
  should be called with @force_kblockd set to %true.
* elv_insert() in blk_kick_flush() should use
  %ELEVATOR_INSERT_REQUEUE.
Both changes are to avoid invoking ->request_fn() directly from
request completion path and closely match the changes in the commit
255bb490c8.
Signed-off-by: Tejun Heo <tj@kernel.org>
Petr Uzel [Thu, 3 Mar 2011 16:48:50 +0000 (11:48 -0500)]
 
block: kill loop_mutex
Following steps lead to deadlock in kernel:
dd if=/dev/zero of=img bs=512 count=1000
losetup -f img
mkfs.ext2 /dev/loop0
mount -t ext2 -o loop /dev/loop0 mnt
umount mnt/
Stacktrace:
[<
c102ec04>] irq_exit+0x36/0x59
[<
c101502c>] smp_apic_timer_interrupt+0x6b/0x75
[<
c127f639>] apic_timer_interrupt+0x31/0x38
[<
c101df88>] mutex_spin_on_owner+0x54/0x5b
[<
fe2250e9>] lo_release+0x12/0x67 [loop]
[<
c10c4eae>] __blkdev_put+0x7c/0x10c
[<
c10a4da5>] fput+0xd5/0x1aa
[<
fe2250cf>] loop_clr_fd+0x1a9/0x1b1 [loop]
[<
fe225110>] lo_release+0x39/0x67 [loop]
[<
c10c4eae>] __blkdev_put+0x7c/0x10c
[<
c10a59d9>] deactivate_locked_super+0x17/0x36
[<
c10b6f37>] sys_umount+0x27e/0x2a5
[<
c10b6f69>] sys_oldumount+0xb/0xe
[<
c1002897>] sysenter_do_call+0x12/0x26
[<
ffffffff>] 0xffffffff
Regression since 
2a48fc0ab24241755dc9, which introduced the private
loop_mutex as part of the BKL removal process.
As per [1], the mutex can be safely removed.
[1] http://www.gossamer-threads.com/lists/linux/kernel/1341930
Addresses: https://bugzilla.novell.com/show_bug.cgi?id=669394
Addresses: https://bugzilla.kernel.org/show_bug.cgi?id=29172
Signed-off-by: Petr Uzel <petr.uzel@suse.cz>
Cc: stable@kernel.org
Reviewed-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Tao Ma [Thu, 3 Mar 2011 15:53:20 +0000 (10:53 -0500)]
 
blktrace: Remove blk_fill_rwbs_rq.
If we enable trace events to trace block actions, We use
blk_fill_rwbs_rq to analyze the corresponding actions
in request's cmd_flags, but we only choose the minor 2 bits
from it, so most of other flags(e.g, REQ_SYNC) are missing.
For example, with a sync write we get:
write_test-2409  [001]   160.013869: block_rq_insert: 3,64 W 0 () 258135 + =
8 [write_test]
Since now we have integrated the flags of both bio and request,
it is safe to pass rq->cmd_flags directly to blk_fill_rwbs and
blk_fill_rwbs_rq isn't needed any more.
With this patch, after a sync write we get:
write_test-2417  [000]   226.603878: block_rq_insert: 3,64 WS 0 () 258135 +=
 8 [write_test]
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Thu, 3 Mar 2011 00:05:33 +0000 (19:05 -0500)]
 
block: Move blk_throtl_exit() call to blk_cleanup_queue()
Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
written in such a way that it needs queue lock. In blk_release_queue()
there is no gurantee that ->queue_lock is still around.
Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
one problem.
  https://lkml.org/lkml/2010/10/23/86
  And a quick fix moved blk_throtl_exit() to blk_release_queue().
        commit 
7ad58c028652753814054f4e3ac58f925e7343f4
        Author: Jens Axboe <jaxboe@fusionio.com>
        Date:   Sat Oct 23 20:40:26 2010 +0200
        block: fix use-after-free bug in blk throttle code
This patch reverts above change and does not try to shutdown the
throtl work in blk_sync_queue(). By avoiding call to
throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
the problem reported by Ingo.
blk_sync_queue() seems to be used only by md driver and it seems to be
using it to make sure q->unplug_fn is not called as md registers its
own unplug functions and it is about to free up the data structures
used by unplug_fn(). Block throttle does not call back into unplug_fn()
or into md. So there is no need to cancel blk throttle work.
In fact I think cancelling block throttle work is bad because it might
happen that some bios are throttled and scheduled to be dispatched later
with the help of pending work and if work is cancelled, these bios might
never be dispatched.
Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
blk_release_queue() time. That should be safe as we are also calling
blk_throtl_exit() which should make sure all the throttling related
data structures are cleaned up.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Thu, 3 Mar 2011 00:04:50 +0000 (19:04 -0500)]
 
loop: No need to initialize ->queue_lock explicitly before calling blk_cleanup_queue()
Now we initialize ->queue_lock at queue allocation time so driver does
not have to worry about initializing it before calling
blk_cleanup_queue().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Thu, 3 Mar 2011 00:04:42 +0000 (19:04 -0500)]
 
block: Initialize ->queue_lock to internal lock at queue allocation time
There does not seem to be a clear convention whether q->queue_lock is
initialized or not when blk_cleanup_queue() is called. In the past it
was not necessary but now blk_throtl_exit() takes up queue lock by
default and needs queue lock to be available.
In fact elevator_exit() code also has similar requirement just that it
is less stringent in the sense that elevator_exit() is called only if
elevator is initialized.
Two problems have been noticed because of ambiguity about spin lock
status.
      - If a driver calls blk_alloc_queue() and then soon calls
        blk_cleanup_queue() almost immediately, (because some other
	driver structure allocation failed or some other error happened)
	then blk_throtl_exit() will run into issues as queue lock is not
	initialized. Loop driver ran into this issue recently and I
	noticed error paths in md driver too. Similar error paths should
	exist in other drivers too.
      - If some driver provided external spin lock and zapped the lock
        before blk_cleanup_queue(), then it can lead to issues.
So this patch initializes the default queue lock at queue allocation time.
block throttling code is one of the users of queue lock and it is
initialized at the queue allocation time, so it makes sense to
initialize ->queue_lock also to internal lock. A driver can overide that
lock later. This will take care of the issue where a driver does not have
to worry about initializing the queue lock to default before calling
blk_cleanup_queue()
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Liu Yuan [Wed, 2 Mar 2011 16:00:15 +0000 (11:00 -0500)]
 
block/genhd: Change some numerals into macros
Rename the numerals in the diskstats_show() into the macros.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Tejun Heo [Wed, 2 Mar 2011 13:48:06 +0000 (08:48 -0500)]
 
block: blk-flush shouldn't call directly into q->request_fn() __blk_run_queue()
blk-flush decomposes a flush into sequence of multiple requests.  On
completion of a request, the next one is queued; however, block layer
must not implicitly call into q->request_fn() directly from completion
path.  This makes the queue behave unexpectedly when seen from the
drivers and violates the assumption that q->request_fn() is called
with process context + queue_lock.
This patch makes blk-flush the following two changes to make sure
q->request_fn() is not called directly from request completion path.
- blk_flush_complete_seq_end_io() now asks __blk_run_queue() to always
  use kblockd instead of calling directly into q->request_fn().
- queue_next_fseq() uses ELEVATOR_INSERT_REQUEUE instead of
  ELEVATOR_INSERT_FRONT so that elv_insert() doesn't try to unplug the
  request queue directly.
Reported by Jan in the following threads.
 http://thread.gmane.org/gmane.linux.ide/48778
 http://thread.gmane.org/gmane.linux.ide/48786
stable: applicable to v2.6.37.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Tejun Heo [Wed, 2 Mar 2011 13:48:05 +0000 (08:48 -0500)]
 
block: add @force_kblockd to __blk_run_queue()
__blk_run_queue() automatically either calls q->request_fn() directly
or schedules kblockd depending on whether the function is recursed.
blk-flush implementation needs to be able to explicitly choose
kblockd.  Add @force_kblockd.
All the current users are converted to specify %false for the
parameter and this patch doesn't introduce any behavior change.
stable: This is prerequisite for fixing ide oops caused by the new
        blk-flush implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Justin TerAvest [Tue, 1 Mar 2011 20:05:08 +0000 (15:05 -0500)]
 
cfq-iosched: Always provide group isolation.
Effectively, make group_isolation=1 the default and remove the tunable.
The setting group_isolation=0 was because by default we idle on
sync-noidle tree and on fast devices, this can be very harmful for
throughput.
However, this problem can also be addressed by tuning slice_idle and
possibly group_idle on faster storage devices.
This change simplifies the CFQ code by removing the feature entirely.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>