Documentation/pci-error-recovery.txt

   1
   2                        PCI Error Recovery
   3                        ------------------
   4                          May 31, 2005
   5
   6                Current document maintainer:
   7            Linas Vepstas <linas@austin.ibm.com>
   8
   9
  10 Some PCI bus controllers are able to detect certain "hard" PCI errors
  11 on the bus, such as parity errors on the data and address busses, as
  12 well as SERR and PERR errors.  These chipsets are then able to disable
  13 I/O to/from the affected device, so that, for example, a bad DMA
  14 address doesn't end up corrupting system memory.  These same chipsets
  15 are also able to reset the affected PCI device, and return it to
  16 working condition.  This document describes a generic API form
  17 performing error recovery.
  18
  19 The core idea is that after a PCI error has been detected, there must
  20 be a way for the kernel to coordinate with all affected device drivers
  21 so that the pci card can be made operational again, possibly after
  22 performing a full electrical #RST of the PCI card.  The API below
  23 provides a generic API for device drivers to be notified of PCI
  24 errors, and to be notified of, and respond to, a reset sequence.
  25
  26 Preliminary sketch of API, cut-n-pasted-n-modified email from
  27 Ben Herrenschmidt, circa 5 april 2005
  28
  29 The error recovery API support is exposed to the driver in the form of
  30 a structure of function pointers pointed to by a new field in struct
  31 pci_driver. The absence of this pointer in pci_driver denotes an
  32 "non-aware" driver, behaviour on these is platform dependant.
  33 Platforms like ppc64 can try to simulate pci hotplug remove/add.
  34
  35 The definition of "pci_error_token" is not covered here. It is based on
  36 Seto's work on the synchronous error detection. We still need to define
  37 functions for extracting infos out of an opaque error token. This is
  38 separate from this API.
  39
  40 This structure has the form:
  41
  42 struct pci_error_handlers
  43 {
  44         int (*error_detected)(struct pci_dev *dev, pci_error_token error);
  45         int (*mmio_enabled)(struct pci_dev *dev);
  46         int (*resume)(struct pci_dev *dev);
  47         int (*link_reset)(struct pci_dev *dev);
  48         int (*slot_reset)(struct pci_dev *dev);
  49 };
  50
  51 A driver doesn't have to implement all of these callbacks. The
  52 only mandatory one is error_detected(). If a callback is not
  53 implemented, the corresponding feature is considered unsupported.
  54 For example, if mmio_enabled() and resume() aren't there, then the
  55 driver is assumed as not doing any direct recovery and requires
  56 a reset. If link_reset() is not implemented, the card is assumed as
  57 not caring about link resets, in which case, if recover is supported,
  58 the core can try recover (but not slot_reset() unless it really did
  59 reset the slot). If slot_reset() is not supported, link_reset() can
  60 be called instead on a slot reset.
  61
  62 At first, the call will always be :
  63
  64         1) error_detected()
  65
  66         Error detected. This is sent once after an error has been detected. At
  67 this point, the device might not be accessible anymore depending on the
  68 platform (the slot will be isolated on ppc64). The driver may already
  69 have "noticed" the error because of a failing IO, but this is the proper
  70 "synchronisation point", that is, it gives a chance to the driver to
  71 cleanup, waiting for pending stuff (timers, whatever, etc...) to
  72 complete; it can take semaphores, schedule, etc... everything but touch
  73 the device. Within this function and after it returns, the driver
  74 shouldn't do any new IOs. Called in task context. This is sort of a
  75 "quiesce" point. See note about interrupts at the end of this doc.
  76
  77         Result codes:
  78                 - PCIERR_RESULT_CAN_RECOVER:
  79                   Driever returns this if it thinks it might be able to recover
  80                   the HW by just banging IOs or if it wants to be given
  81                   a chance to extract some diagnostic informations (see
  82                   below).
  83                 - PCIERR_RESULT_NEED_RESET:
  84                   Driver returns this if it thinks it can't recover unless the
  85                   slot is reset.
  86                 - PCIERR_RESULT_DISCONNECT:
  87                   Return this if driver thinks it won't recover at all,
  88                   (this will detach the driver ? or just leave it
  89                   dangling ? to be decided)
  90
  91 So at this point, we have called error_detected() for all drivers
  92 on the segment that had the error. On ppc64, the slot is isolated. What
  93 happens now typically depends on the result from the drivers. If all
  94 drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would
  95 re-enable IOs on the slot (or do nothing special if the platform doesn't
  96 isolate slots) and call 2). If not and we can reset slots, we go to 4),
  97 if neither, we have a dead slot. If it's an hotplug slot, we might
  98 "simulate" reset by triggering HW unplug/replug though.
  99
 100 >>> Current ppc64 implementation assumes that a device driver will
 101 >>> *not* schedule or semaphore in this routine; the current ppc64
 102 >>> implementation uses one kernel thread to notify all devices;
 103 >>> thus, of one device sleeps/schedules, all devices are affected.
 104 >>> Doing better requires complex multi-threaded logic in the error
 105 >>> recovery implementation (e.g. waiting for all notification threads
 106 >>> to "join" before proceeding with recovery.)  This seems excessively
 107 >>> complex and not worth implementing.
 108
 109 >>> The current ppc64 implementation doesn't much care if the device
 110 >>> attempts i/o at this point, or not.  I/O's will fail, returning
 111 >>> a value of 0xff on read, and writes will be dropped. If the device
 112 >>> driver attempts more than 10K I/O's to a frozen adapter, it will
 113 >>> assume that the device driver has gone into an infinite loop, and
 114 >>> it will panic the the kernel.
 115
 116         2) mmio_enabled()
 117
 118         This is the "early recovery" call. IOs are allowed again, but DMA is
 119 not (hrm... to be discussed, I prefer not), with some restrictions. This
 120 is NOT a callback for the driver to start operations again, only to
 121 peek/poke at the device, extract diagnostic information, if any, and
 122 eventually do things like trigger a device local reset or some such,
 123 but not restart operations. This is sent if all drivers on a segment
 124 agree that they can try to recover and no automatic link reset was
 125 performed by the HW. If the platform can't just re-enable IOs without
 126 a slot reset or a link reset, it doesn't call this callback and goes
 127 directly to 3) or 4). All IOs should be done _synchronously_ from
 128 within this callback, errors triggered by them will be returned via
 129 the normal pci_check_whatever() api, no new error_detected() callback
 130 will be issued due to an error happening here. However, such an error
 131 might cause IOs to be re-blocked for the whole segment, and thus
 132 invalidate the recovery that other devices on the same segment might
 133 have done, forcing the whole segment into one of the next states,
 134 that is link reset or slot reset.
 135
 136         Result codes:
 137                 - PCIERR_RESULT_RECOVERED
 138                   Driver returns this if it thinks the device is fully
 139                   functionnal and thinks it is ready to start
 140                   normal driver operations again. There is no
 141                   guarantee that the driver will actually be
 142                   allowed to proceed, as another driver on the
 143                   same segment might have failed and thus triggered a
 144                   slot reset on platforms that support it.
 145
 146                 - PCIERR_RESULT_NEED_RESET
 147                   Driver returns this if it thinks the device is not
 148                   recoverable in it's current state and it needs a slot
 149                   reset to proceed.
 150
 151                 - PCIERR_RESULT_DISCONNECT
 152                   Same as above. Total failure, no recovery even after
 153                   reset driver dead. (To be defined more precisely)
 154
 155 >>> The current ppc64 implementation does not implement this callback.
 156
 157         3) link_reset()
 158
 159         This is called after the link has been reset. This is typically
 160 a PCI Express specific state at this point and is done whenever a
 161 non-fatal error has been detected that can be "solved" by resetting
 162 the link. This call informs the driver of the reset and the driver
 163 should check if the device appears to be in working condition.
 164 This function acts a bit like 2) mmio_enabled(), in that the driver
 165 is not supposed to restart normal driver I/O operations right away.
 166 Instead, it should just "probe" the device to check it's recoverability
 167 status. If all is right, then the core will call resume() once all
 168 drivers have ack'd link_reset().
 169
 170         Result codes:
 171                 (identical to mmio_enabled)
 172
 173 >>> The current ppc64 implementation does not implement this callback.
 174
 175         4) slot_reset()
 176
 177         This is called after the slot has been soft or hard reset by the
 178 platform.  A soft reset consists of asserting the adapter #RST line
 179 and then restoring the PCI BARs and PCI configuration header. If the
 180 platform supports PCI hotplug, then it might instead perform a hard
 181 reset by toggling power on the slot off/on. This call gives drivers
 182 the chance to re-initialize the hardware (re-download firmware, etc.),
 183 but drivers shouldn't restart normal I/O processing operations at
 184 this point.  (See note about interrupts; interrupts aren't guaranteed
 185 to be delivered until the resume() callback has been called). If all
 186 device drivers report success on this callback, the patform will call
 187 resume() to complete the error handling and let the driver restart
 188 normal I/O processing.
 189
 190 A driver can still return a critical failure for this function if
 191 it can't get the device operational after reset.  If the platform
 192 previously tried a soft reset, it migh now try a hard reset (power
 193 cycle) and then call slot_reset() again.  It the device still can't
 194 be recovered, there is nothing more that can be done;  the platform
 195 will typically report a "permanent failure" in such a case.  The
 196 device will be considered "dead" in this case.
 197
 198         Result codes:
 199                 - PCIERR_RESULT_DISCONNECT
 200                 Same as above.
 201
 202 >>> The current ppc64 implementation does not try a power-cycle reset
 203 >>> if the driver returned PCIERR_RESULT_DISCONNECT. However, it should.
 204
 205         5) resume()
 206
 207         This is called if all drivers on the segment have returned
 208 PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks.
 209 That basically tells the driver to restart activity, tht everything
 210 is back and running. No result code is taken into account here. If
 211 a new error happens, it will restart a new error handling process.
 212
 213 That's it. I think this covers all the possibilities. The way those
 214 callbacks are called is platform policy. A platform with no slot reset
 215 capability for example may want to just "ignore" drivers that can't
 216 recover (disconnect them) and try to let other cards on the same segment
 217 recover. Keep in mind that in most real life cases, though, there will
 218 be only one driver per segment.
 219
 220 Now, there is a note about interrupts. If you get an interrupt and your
 221 device is dead or has been isolated, there is a problem :)
 222
 223 After much thinking, I decided to leave that to the platform. That is,
 224 the recovery API only precies that:
 225
 226  - There is no guarantee that interrupt delivery can proceed from any
 227 device on the segment starting from the error detection and until the
 228 restart callback is sent, at which point interrupts are expected to be
 229 fully operational.
 230
 231  - There is no guarantee that interrupt delivery is stopped, that is, ad
 232 river that gets an interrupts after detecting an error, or that detects
 233 and error within the interrupt handler such that it prevents proper
 234 ack'ing of the interrupt (and thus removal of the source) should just
 235 return IRQ_NOTHANDLED. It's up to the platform to deal with taht
 236 condition, typically by masking the irq source during the duration of
 237 the error handling. It is expected that the platform "knows" which
 238 interrupts are routed to error-management capable slots and can deal
 239 with temporarily disabling that irq number during error processing (this
 240 isn't terribly complex). That means some IRQ latency for other devices
 241 sharing the interrupt, but there is simply no other way. High end
 242 platforms aren't supposed to share interrupts between many devices
 243 anyway :)
 244
 245
 246 Revised: 31 May 2005 Linas Vepstas <linas@austin.ibm.com>