Documentation/block/barrier.txt

   1 I/O Barriers
   2 ============
   3 Tejun Heo <htejun@gmail.com>, July 22 2005
   4
   5 I/O barrier requests are used to guarantee ordering around the barrier
   6 requests.  Unless you're crazy enough to use disk drives for
   7 implementing synchronization constructs (wow, sounds interesting...),
   8 the ordering is meaningful only for write requests for things like
   9 journal checkpoints.  All requests queued before a barrier request
  10 must be finished (made it to the physical medium) before the barrier
  11 request is started, and all requests queued after the barrier request
  12 must be started only after the barrier request is finished (again,
  13 made it to the physical medium).
  14
  15 In other words, I/O barrier requests have the following two properties.
  16
  17 1. Request ordering
  18
  19 Requests cannot pass the barrier request.  Preceding requests are
  20 processed before the barrier and following requests after.
  21
  22 Depending on what features a drive supports, this can be done in one
  23 of the following three ways.
  24
  25 i. For devices which have queue depth greater than 1 (TCQ devices) and
  26 support ordered tags, block layer can just issue the barrier as an
  27 ordered request and the lower level driver, controller and drive
  28 itself are responsible for making sure that the ordering constraint is
  29 met.  Most modern SCSI controllers/drives should support this.
  30
  31 NOTE: SCSI ordered tag isn't currently used due to limitation in the
  32       SCSI midlayer, see the following random notes section.
  33
  34 ii. For devices which have queue depth greater than 1 but don't
  35 support ordered tags, block layer ensures that the requests preceding
  36 a barrier request finishes before issuing the barrier request.  Also,
  37 it defers requests following the barrier until the barrier request is
  38 finished.  Older SCSI controllers/drives and SATA drives fall in this
  39 category.
  40
  41 iii. Devices which have queue depth of 1.  This is a degenerate case
  42 of ii.  Just keeping issue order suffices.  Ancient SCSI
  43 controllers/drives and IDE drives are in this category.
  44
  45 2. Forced flushing to physical medium
  46
  47 Again, if you're not gonna do synchronization with disk drives (dang,
  48 it sounds even more appealing now!), the reason you use I/O barriers
  49 is mainly to protect filesystem integrity when power failure or some
  50 other events abruptly stop the drive from operating and possibly make
  51 the drive lose data in its cache.  So, I/O barriers need to guarantee
  52 that requests actually get written to non-volatile medium in order.
  53
  54 There are four cases,
  55
  56 i. No write-back cache.  Keeping requests ordered is enough.
  57
  58 ii. Write-back cache but no flush operation.  There's no way to
  59 guarantee physical-medium commit order.  This kind of devices can't to
  60 I/O barriers.
  61
  62 iii. Write-back cache and flush operation but no FUA (forced unit
  63 access).  We need two cache flushes - before and after the barrier
  64 request.
  65
  66 iv. Write-back cache, flush operation and FUA.  We still need one
  67 flush to make sure requests preceding a barrier are written to medium,
  68 but post-barrier flush can be avoided by using FUA write on the
  69 barrier itself.
  70
  71
  72 How to support barrier requests in drivers
  73 ------------------------------------------
  74
  75 All barrier handling is done inside block layer proper.  All low level
  76 drivers have to are implementing its prepare_flush_fn and using one
  77 the following two functions to indicate what barrier type it supports
  78 and how to prepare flush requests.  Note that the term 'ordered' is
  79 used to indicate the whole sequence of performing barrier requests
  80 including draining and flushing.
  81
  82 typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
  83
  84 int blk_queue_ordered(request_queue_t *q, unsigned ordered,
  85                       prepare_flush_fn *prepare_flush_fn,
  86                       unsigned gfp_mask);
  87
  88 int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
  89                              prepare_flush_fn *prepare_flush_fn,
  90                              unsigned gfp_mask);
  91
  92 The only difference between the two functions is whether or not the
  93 caller is holding q->queue_lock on entry.  The latter expects the
  94 caller is holding the lock.
  95
  96 @q                      : the queue in question
  97 @ordered                : the ordered mode the driver/device supports
  98 @prepare_flush_fn       : this function should prepare @rq such that it
  99                           flushes cache to physical medium when executed
 100 @gfp_mask               : gfp_mask used when allocating data structures
 101                           for ordered processing
 102
 103 For example, SCSI disk driver's prepare_flush_fn looks like the
 104 following.
 105
 106 static void sd_prepare_flush(request_queue_t *q, struct request *rq)
 107 {
 108         memset(rq->cmd, 0, sizeof(rq->cmd));
 109         rq->flags |= REQ_BLOCK_PC;
 110         rq->timeout = SD_TIMEOUT;
 111         rq->cmd[0] = SYNCHRONIZE_CACHE;
 112 }
 113
 114 The following seven ordered modes are supported.  The following table
 115 shows which mode should be used depending on what features a
 116 device/driver supports.  In the leftmost column of table,
 117 QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
 118
 119 The table is followed by description of each mode.  Note that in the
 120 descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
 121 used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
 122 preceding step must be complete before proceeding to the next step.
 123 '->' indicates that the next step can start as soon as the previous
 124 step is issued.
 125
 126             write-back cache    ordered tag     flush           FUA
 127 -----------------------------------------------------------------------
 128 NONE            yes/no          N/A             no              N/A
 129 DRAIN           no              no              N/A             N/A
 130 DRAIN_FLUSH     yes             no              yes             no
 131 DRAIN_FUA       yes             no              yes             yes
 132 TAG             no              yes             N/A             N/A
 133 TAG_FLUSH       yes             yes             yes             no
 134 TAG_FUA         yes             yes             yes             yes
 135
 136
 137 QUEUE_ORDERED_NONE
 138         I/O barriers are not needed and/or supported.
 139
 140         Sequence: N/A
 141
 142 QUEUE_ORDERED_DRAIN
 143         Requests are ordered by draining the request queue and cache
 144         flushing isn't needed.
 145
 146         Sequence: drain => barrier
 147
 148 QUEUE_ORDERED_DRAIN_FLUSH
 149         Requests are ordered by draining the request queue and both
 150         pre-barrier and post-barrier cache flushings are needed.
 151
 152         Sequence: drain => preflush => barrier => postflush
 153
 154 QUEUE_ORDERED_DRAIN_FUA
 155         Requests are ordered by draining the request queue and
 156         pre-barrier cache flushing is needed.  By using FUA on barrier
 157         request, post-barrier flushing can be skipped.
 158
 159         Sequence: drain => preflush => barrier
 160
 161 QUEUE_ORDERED_TAG
 162         Requests are ordered by ordered tag and cache flushing isn't
 163         needed.
 164
 165         Sequence: barrier
 166
 167 QUEUE_ORDERED_TAG_FLUSH
 168         Requests are ordered by ordered tag and both pre-barrier and
 169         post-barrier cache flushings are needed.
 170
 171         Sequence: preflush -> barrier -> postflush
 172
 173 QUEUE_ORDERED_TAG_FUA
 174         Requests are ordered by ordered tag and pre-barrier cache
 175         flushing is needed.  By using FUA on barrier request,
 176         post-barrier flushing can be skipped.
 177
 178         Sequence: preflush -> barrier
 179
 180
 181 Random notes/caveats
 182 --------------------
 183
 184 * SCSI layer currently can't use TAG ordering even if the drive,
 185 controller and driver support it.  The problem is that SCSI midlayer
 186 request dispatch function is not atomic.  It releases queue lock and
 187 switch to SCSI host lock during issue and it's possible and likely to
 188 happen in time that requests change their relative positions.  Once
 189 this problem is solved, TAG ordering can be enabled.
 190
 191 * Currently, no matter which ordered mode is used, there can be only
 192 one barrier request in progress.  All I/O barriers are held off by
 193 block layer until the previous I/O barrier is complete.  This doesn't
 194 make any difference for DRAIN ordered devices, but, for TAG ordered
 195 devices with very high command latency, passing multiple I/O barriers
 196 to low level *might* be helpful if they are very frequent.  Well, this
 197 certainly is a non-issue.  I'm writing this just to make clear that no
 198 two I/O barrier is ever passed to low-level driver.
 199
 200 * Completion order.  Requests in ordered sequence are issued in order
 201 but not required to finish in order.  Barrier implementation can
 202 handle out-of-order completion of ordered sequence.  IOW, the requests
 203 MUST be processed in order but the hardware/software completion paths
 204 are allowed to reorder completion notifications - eg. current SCSI
 205 midlayer doesn't preserve completion order during error handling.
 206
 207 * Requeueing order.  Low-level drivers are free to requeue any request
 208 after they removed it from the request queue with
 209 blkdev_dequeue_request().  As barrier sequence should be kept in order
 210 when requeued, generic elevator code takes care of putting requests in
 211 order around barrier.  See blk_ordered_req_seq() and
 212 ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
 213
 214 Note that block drivers must not requeue preceding requests while
 215 completing latter requests in an ordered sequence.  Currently, no
 216 error checking is done against this.
 217
 218 * Error handling.  Currently, block layer will report error to upper
 219 layer if any of requests in an ordered sequence fails.  Unfortunately,
 220 this doesn't seem to be enough.  Look at the following request flow.
 221 QUEUE_ORDERED_TAG_FLUSH is in use.
 222
 223  [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
 224                                           still in elevator
 225
 226 Let's say request [2], [3] are write requests to update file system
 227 metadata (journal or whatever) and [barrier] is used to mark that
 228 those updates are valid.  Consider the following sequence.
 229
 230  i.     Requests [0] ~ [post] leaves the request queue and enters
 231         low-level driver.
 232  ii.    After a while, unfortunately, something goes wrong and the
 233         drive fails [2].  Note that any of [0], [1] and [3] could have
 234         completed by this time, but [pre] couldn't have been finished
 235         as the drive must process it in order and it failed before
 236         processing that command.
 237  iii.   Error handling kicks in and determines that the error is
 238         unrecoverable and fails [2], and resumes operation.
 239  iv.    [pre] [barrier] [post] gets processed.
 240  v.     *BOOM* power fails
 241
 242 The problem here is that the barrier request is *supposed* to indicate
 243 that filesystem update requests [2] and [3] made it safely to the
 244 physical medium and, if the machine crashes after the barrier is
 245 written, filesystem recovery code can depend on that.  Sadly, that
 246 isn't true in this case anymore.  IOW, the success of a I/O barrier
 247 should also be dependent on success of some of the preceding requests,
 248 where only upper layer (filesystem) knows what 'some' is.
 249
 250 This can be solved by implementing a way to tell the block layer which
 251 requests affect the success of the following barrier request and
 252 making lower lever drivers to resume operation on error only after
 253 block layer tells it to do so.
 254
 255 As the probability of this happening is very low and the drive should
 256 be faulty, implementing the fix is probably an overkill.  But, still,
 257 it's there.
 258
 259 * In previous drafts of barrier implementation, there was fallback
 260 mechanism such that, if FUA or ordered TAG fails, less fancy ordered
 261 mode can be selected and the failed barrier request is retried
 262 automatically.  The rationale for this feature was that as FUA is
 263 pretty new in ATA world and ordered tag was never used widely, there
 264 could be devices which report to support those features but choke when
 265 actually given such requests.
 266
 267  This was removed for two reasons 1. it's an overkill 2. it's
 268 impossible to implement properly when TAG ordering is used as low
 269 level drivers resume after an error automatically.  If it's ever
 270 needed adding it back and modifying low level drivers accordingly
 271 shouldn't be difficult.