Documentation/filesystems/relayfs.txt

   1
   2 relayfs - a high-speed data relay filesystem
   3 ============================================
   4
   5 relayfs is a filesystem designed to provide an efficient mechanism for
   6 tools and facilities to relay large and potentially sustained streams
   7 of data from kernel space to user space.
   8
   9 The main abstraction of relayfs is the 'channel'.  A channel consists
  10 of a set of per-cpu kernel buffers each represented by a file in the
  11 relayfs filesystem.  Kernel clients write into a channel using
  12 efficient write functions which automatically log to the current cpu's
  13 channel buffer.  User space applications mmap() the per-cpu files and
  14 retrieve the data as it becomes available.
  15
  16 The format of the data logged into the channel buffers is completely
  17 up to the relayfs client; relayfs does however provide hooks which
  18 allow clients to impose some structure on the buffer data.  Nor does
  19 relayfs implement any form of data filtering - this also is left to
  20 the client.  The purpose is to keep relayfs as simple as possible.
  21
  22 This document provides an overview of the relayfs API.  The details of
  23 the function parameters are documented along with the functions in the
  24 filesystem code - please see that for details.
  25
  26 Semantics
  27 =========
  28
  29 Each relayfs channel has one buffer per CPU, each buffer has one or
  30 more sub-buffers. Messages are written to the first sub-buffer until
  31 it is too full to contain a new message, in which case it it is
  32 written to the next (if available).  Messages are never split across
  33 sub-buffers.  At this point, userspace can be notified so it empties
  34 the first sub-buffer, while the kernel continues writing to the next.
  35
  36 When notified that a sub-buffer is full, the kernel knows how many
  37 bytes of it are padding i.e. unused.  Userspace can use this knowledge
  38 to copy only valid data.
  39
  40 After copying it, userspace can notify the kernel that a sub-buffer
  41 has been consumed.
  42
  43 relayfs can operate in a mode where it will overwrite data not yet
  44 collected by userspace, and not wait for it to consume it.
  45
  46 relayfs itself does not provide for communication of such data between
  47 userspace and kernel, allowing the kernel side to remain simple and
  48 not impose a single interface on userspace. It does provide a set of
  49 examples and a separate helper though, described below.
  50
  51 klog and relay-apps example code
  52 ================================
  53
  54 relayfs itself is ready to use, but to make things easier, a couple
  55 simple utility functions and a set of examples are provided.
  56
  57 The relay-apps example tarball, available on the relayfs sourceforge
  58 site, contains a set of self-contained examples, each consisting of a
  59 pair of .c files containing boilerplate code for each of the user and
  60 kernel sides of a relayfs application; combined these two sets of
  61 boilerplate code provide glue to easily stream data to disk, without
  62 having to bother with mundane housekeeping chores.
  63
  64 The 'klog debugging functions' patch (klog.patch in the relay-apps
  65 tarball) provides a couple of high-level logging functions to the
  66 kernel which allow writing formatted text or raw data to a channel,
  67 regardless of whether a channel to write into exists or not, or
  68 whether relayfs is compiled into the kernel or is configured as a
  69 module.  These functions allow you to put unconditional 'trace'
  70 statements anywhere in the kernel or kernel modules; only when there
  71 is a 'klog handler' registered will data actually be logged (see the
  72 klog and kleak examples for details).
  73
  74 It is of course possible to use relayfs from scratch i.e. without
  75 using any of the relay-apps example code or klog, but you'll have to
  76 implement communication between userspace and kernel, allowing both to
  77 convey the state of buffers (full, empty, amount of padding).
  78
  79 klog and the relay-apps examples can be found in the relay-apps
  80 tarball on http://relayfs.sourceforge.net
  81
  82
  83 The relayfs user space API
  84 ==========================
  85
  86 relayfs implements basic file operations for user space access to
  87 relayfs channel buffer data.  Here are the file operations that are
  88 available and some comments regarding their behavior:
  89
  90 open()   enables user to open an _existing_ buffer.
  91
  92 mmap()   results in channel buffer being mapped into the caller's
  93          memory space. Note that you can't do a partial mmap - you must
  94          map the entire file, which is NRBUF * SUBBUFSIZE.
  95
  96 read()   read the contents of a channel buffer.  The bytes read are
  97          'consumed' by the reader i.e. they won't be available again
  98          to subsequent reads.  If the channel is being used in
  99          no-overwrite mode (the default), it can be read at any time
 100          even if there's an active kernel writer.  If the channel is
 101          being used in overwrite mode and there are active channel
 102          writers, results may be unpredictable - users should make
 103          sure that all logging to the channel has ended before using
 104          read() with overwrite mode.
 105
 106 poll()   POLLIN/POLLRDNORM/POLLERR supported.  User applications are
 107          notified when sub-buffer boundaries are crossed.
 108
 109 close() decrements the channel buffer's refcount.  When the refcount
 110         reaches 0 i.e. when no process or kernel client has the buffer
 111         open, the channel buffer is freed.
 112
 113
 114 In order for a user application to make use of relayfs files, the
 115 relayfs filesystem must be mounted.  For example,
 116
 117         mount -t relayfs relayfs /mnt/relay
 118
 119 NOTE:   relayfs doesn't need to be mounted for kernel clients to create
 120         or use channels - it only needs to be mounted when user space
 121         applications need access to the buffer data.
 122
 123
 124 The relayfs kernel API
 125 ======================
 126
 127 Here's a summary of the API relayfs provides to in-kernel clients:
 128
 129
 130   channel management functions:
 131
 132     relay_open(base_filename, parent, subbuf_size, n_subbufs,
 133                callbacks)
 134     relay_close(chan)
 135     relay_flush(chan)
 136     relay_reset(chan)
 137     relayfs_create_dir(name, parent)
 138     relayfs_remove_dir(dentry)
 139     relayfs_create_file(name, parent, mode, fops, data)
 140     relayfs_remove_file(dentry)
 141
 142   channel management typically called on instigation of userspace:
 143
 144     relay_subbufs_consumed(chan, cpu, subbufs_consumed)
 145
 146   write functions:
 147
 148     relay_write(chan, data, length)
 149     __relay_write(chan, data, length)
 150     relay_reserve(chan, length)
 151
 152   callbacks:
 153
 154     subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
 155     buf_mapped(buf, filp)
 156     buf_unmapped(buf, filp)
 157     create_buf_file(filename, parent, mode, buf, is_global)
 158     remove_buf_file(dentry)
 159
 160   helper functions:
 161
 162     relay_buf_full(buf)
 163     subbuf_start_reserve(buf, length)
 164
 165
 166 Creating a channel
 167 ------------------
 168
 169 relay_open() is used to create a channel, along with its per-cpu
 170 channel buffers.  Each channel buffer will have an associated file
 171 created for it in the relayfs filesystem, which can be opened and
 172 mmapped from user space if desired.  The files are named
 173 basename0...basenameN-1 where N is the number of online cpus, and by
 174 default will be created in the root of the filesystem.  If you want a
 175 directory structure to contain your relayfs files, you can create it
 176 with relayfs_create_dir() and pass the parent directory to
 177 relay_open().  Clients are responsible for cleaning up any directory
 178 structure they create when the channel is closed - use
 179 relayfs_remove_dir() for that.
 180
 181 The total size of each per-cpu buffer is calculated by multiplying the
 182 number of sub-buffers by the sub-buffer size passed into relay_open().
 183 The idea behind sub-buffers is that they're basically an extension of
 184 double-buffering to N buffers, and they also allow applications to
 185 easily implement random-access-on-buffer-boundary schemes, which can
 186 be important for some high-volume applications.  The number and size
 187 of sub-buffers is completely dependent on the application and even for
 188 the same application, different conditions will warrant different
 189 values for these parameters at different times.  Typically, the right
 190 values to use are best decided after some experimentation; in general,
 191 though, it's safe to assume that having only 1 sub-buffer is a bad
 192 idea - you're guaranteed to either overwrite data or lose events
 193 depending on the channel mode being used.
 194
 195 Channel 'modes'
 196 ---------------
 197
 198 relayfs channels can be used in either of two modes - 'overwrite' or
 199 'no-overwrite'.  The mode is entirely determined by the implementation
 200 of the subbuf_start() callback, as described below.  In 'overwrite'
 201 mode, also known as 'flight recorder' mode, writes continuously cycle
 202 around the buffer and will never fail, but will unconditionally
 203 overwrite old data regardless of whether it's actually been consumed.
 204 In no-overwrite mode, writes will fail i.e. data will be lost, if the
 205 number of unconsumed sub-buffers equals the total number of
 206 sub-buffers in the channel.  It should be clear that if there is no
 207 consumer or if the consumer can't consume sub-buffers fast enought,
 208 data will be lost in either case; the only difference is whether data
 209 is lost from the beginning or the end of a buffer.
 210
 211 As explained above, a relayfs channel is made of up one or more
 212 per-cpu channel buffers, each implemented as a circular buffer
 213 subdivided into one or more sub-buffers.  Messages are written into
 214 the current sub-buffer of the channel's current per-cpu buffer via the
 215 write functions described below.  Whenever a message can't fit into
 216 the current sub-buffer, because there's no room left for it, the
 217 client is notified via the subbuf_start() callback that a switch to a
 218 new sub-buffer is about to occur.  The client uses this callback to 1)
 219 initialize the next sub-buffer if appropriate 2) finalize the previous
 220 sub-buffer if appropriate and 3) return a boolean value indicating
 221 whether or not to actually go ahead with the sub-buffer switch.
 222
 223 To implement 'no-overwrite' mode, the userspace client would provide
 224 an implementation of the subbuf_start() callback something like the
 225 following:
 226
 227 static int subbuf_start(struct rchan_buf *buf,
 228                         void *subbuf,
 229                         void *prev_subbuf,
 230                         unsigned int prev_padding)
 231 {
 232         if (prev_subbuf)
 233                 *((unsigned *)prev_subbuf) = prev_padding;
 234
 235         if (relay_buf_full(buf))
 236                 return 0;
 237
 238         subbuf_start_reserve(buf, sizeof(unsigned int));
 239
 240         return 1;
 241 }
 242
 243 If the current buffer is full i.e. all sub-buffers remain unconsumed,
 244 the callback returns 0 to indicate that the buffer switch should not
 245 occur yet i.e. until the consumer has had a chance to read the current
 246 set of ready sub-buffers.  For the relay_buf_full() function to make
 247 sense, the consumer is reponsible for notifying relayfs when
 248 sub-buffers have been consumed via relay_subbufs_consumed().  Any
 249 subsequent attempts to write into the buffer will again invoke the
 250 subbuf_start() callback with the same parameters; only when the
 251 consumer has consumed one or more of the ready sub-buffers will
 252 relay_buf_full() return 0, in which case the buffer switch can
 253 continue.
 254
 255 The implementation of the subbuf_start() callback for 'overwrite' mode
 256 would be very similar:
 257
 258 static int subbuf_start(struct rchan_buf *buf,
 259                         void *subbuf,
 260                         void *prev_subbuf,
 261                         unsigned int prev_padding)
 262 {
 263         if (prev_subbuf)
 264                 *((unsigned *)prev_subbuf) = prev_padding;
 265
 266         subbuf_start_reserve(buf, sizeof(unsigned int));
 267
 268         return 1;
 269 }
 270
 271 In this case, the relay_buf_full() check is meaningless and the
 272 callback always returns 1, causing the buffer switch to occur
 273 unconditionally.  It's also meaningless for the client to use the
 274 relay_subbufs_consumed() function in this mode, as it's never
 275 consulted.
 276
 277 The default subbuf_start() implementation, used if the client doesn't
 278 define any callbacks, or doesn't define the subbuf_start() callback,
 279 implements the simplest possible 'no-overwrite' mode i.e. it does
 280 nothing but return 0.
 281
 282 Header information can be reserved at the beginning of each sub-buffer
 283 by calling the subbuf_start_reserve() helper function from within the
 284 subbuf_start() callback.  This reserved area can be used to store
 285 whatever information the client wants.  In the example above, room is
 286 reserved in each sub-buffer to store the padding count for that
 287 sub-buffer.  This is filled in for the previous sub-buffer in the
 288 subbuf_start() implementation; the padding value for the previous
 289 sub-buffer is passed into the subbuf_start() callback along with a
 290 pointer to the previous sub-buffer, since the padding value isn't
 291 known until a sub-buffer is filled.  The subbuf_start() callback is
 292 also called for the first sub-buffer when the channel is opened, to
 293 give the client a chance to reserve space in it.  In this case the
 294 previous sub-buffer pointer passed into the callback will be NULL, so
 295 the client should check the value of the prev_subbuf pointer before
 296 writing into the previous sub-buffer.
 297
 298 Writing to a channel
 299 --------------------
 300
 301 kernel clients write data into the current cpu's channel buffer using
 302 relay_write() or __relay_write().  relay_write() is the main logging
 303 function - it uses local_irqsave() to protect the buffer and should be
 304 used if you might be logging from interrupt context.  If you know
 305 you'll never be logging from interrupt context, you can use
 306 __relay_write(), which only disables preemption.  These functions
 307 don't return a value, so you can't determine whether or not they
 308 failed - the assumption is that you wouldn't want to check a return
 309 value in the fast logging path anyway, and that they'll always succeed
 310 unless the buffer is full and no-overwrite mode is being used, in
 311 which case you can detect a failed write in the subbuf_start()
 312 callback by calling the relay_buf_full() helper function.
 313
 314 relay_reserve() is used to reserve a slot in a channel buffer which
 315 can be written to later.  This would typically be used in applications
 316 that need to write directly into a channel buffer without having to
 317 stage data in a temporary buffer beforehand.  Because the actual write
 318 may not happen immediately after the slot is reserved, applications
 319 using relay_reserve() can keep a count of the number of bytes actually
 320 written, either in space reserved in the sub-buffers themselves or as
 321 a separate array.  See the 'reserve' example in the relay-apps tarball
 322 at http://relayfs.sourceforge.net for an example of how this can be
 323 done.  Because the write is under control of the client and is
 324 separated from the reserve, relay_reserve() doesn't protect the buffer
 325 at all - it's up to the client to provide the appropriate
 326 synchronization when using relay_reserve().
 327
 328 Closing a channel
 329 -----------------
 330
 331 The client calls relay_close() when it's finished using the channel.
 332 The channel and its associated buffers are destroyed when there are no
 333 longer any references to any of the channel buffers.  relay_flush()
 334 forces a sub-buffer switch on all the channel buffers, and can be used
 335 to finalize and process the last sub-buffers before the channel is
 336 closed.
 337
 338 Creating non-relay files
 339 ------------------------
 340
 341 relay_open() automatically creates files in the relayfs filesystem to
 342 represent the per-cpu kernel buffers; it's often useful for
 343 applications to be able to create their own files alongside the relay
 344 files in the relayfs filesystem as well e.g. 'control' files much like
 345 those created in /proc or debugfs for similar purposes, used to
 346 communicate control information between the kernel and user sides of a
 347 relayfs application.  For this purpose the relayfs_create_file() and
 348 relayfs_remove_file() API functions exist.  For relayfs_create_file(),
 349 the caller passes in a set of user-defined file operations to be used
 350 for the file and an optional void * to a user-specified data item,
 351 which will be accessible via inode->u.generic_ip (see the relay-apps
 352 tarball for examples).  The file_operations are a required parameter
 353 to relayfs_create_file() and thus the semantics of these files are
 354 completely defined by the caller.
 355
 356 See the relay-apps tarball at http://relayfs.sourceforge.net for
 357 examples of how these non-relay files are meant to be used.
 358
 359 Creating relay files in other filesystems
 360 -----------------------------------------
 361
 362 By default of course, relay_open() creates relay files in the relayfs
 363 filesystem.  Because relay_file_operations is exported, however, it's
 364 also possible to create and use relay files in other pseudo-filesytems
 365 such as debugfs.
 366
 367 For this purpose, two callback functions are provided,
 368 create_buf_file() and remove_buf_file().  create_buf_file() is called
 369 once for each per-cpu buffer from relay_open() to allow the client to
 370 create a file to be used to represent the corresponding buffer; if
 371 this callback is not defined, the default implementation will create
 372 and return a file in the relayfs filesystem to represent the buffer.
 373 The callback should return the dentry of the file created to represent
 374 the relay buffer.  Note that the parent directory passed to
 375 relay_open() (and passed along to the callback), if specified, must
 376 exist in the same filesystem the new relay file is created in.  If
 377 create_buf_file() is defined, remove_buf_file() must also be defined;
 378 it's responsible for deleting the file(s) created in create_buf_file()
 379 and is called during relay_close().
 380
 381 The create_buf_file() implementation can also be defined in such a way
 382 as to allow the creation of a single 'global' buffer instead of the
 383 default per-cpu set.  This can be useful for applications interested
 384 mainly in seeing the relative ordering of system-wide events without
 385 the need to bother with saving explicit timestamps for the purpose of
 386 merging/sorting per-cpu files in a postprocessing step.
 387
 388 To have relay_open() create a global buffer, the create_buf_file()
 389 implementation should set the value of the is_global outparam to a
 390 non-zero value in addition to creating the file that will be used to
 391 represent the single buffer.  In the case of a global buffer,
 392 create_buf_file() and remove_buf_file() will be called only once.  The
 393 normal channel-writing functions e.g. relay_write() can still be used
 394 - writes from any cpu will transparently end up in the global buffer -
 395 but since it is a global buffer, callers should make sure they use the
 396 proper locking for such a buffer, either by wrapping writes in a
 397 spinlock, or by copying a write function from relayfs_fs.h and
 398 creating a local version that internally does the proper locking.
 399
 400 See the 'exported-relayfile' examples in the relay-apps tarball for
 401 examples of creating and using relay files in debugfs.
 402
 403 Misc
 404 ----
 405
 406 Some applications may want to keep a channel around and re-use it
 407 rather than open and close a new channel for each use.  relay_reset()
 408 can be used for this purpose - it resets a channel to its initial
 409 state without reallocating channel buffer memory or destroying
 410 existing mappings.  It should however only be called when it's safe to
 411 do so i.e. when the channel isn't currently being written to.
 412
 413 Finally, there are a couple of utility callbacks that can be used for
 414 different purposes.  buf_mapped() is called whenever a channel buffer
 415 is mmapped from user space and buf_unmapped() is called when it's
 416 unmapped.  The client can use this notification to trigger actions
 417 within the kernel application, such as enabling/disabling logging to
 418 the channel.
 419
 420
 421 Resources
 422 =========
 423
 424 For news, example code, mailing list, etc. see the relayfs homepage:
 425
 426     http://relayfs.sourceforge.net
 427
 428
 429 Credits
 430 =======
 431
 432 The ideas and specs for relayfs came about as a result of discussions
 433 on tracing involving the following:
 434
 435 Michel Dagenais         <michel.dagenais@polymtl.ca>
 436 Richard Moore           <richardj_moore@uk.ibm.com>
 437 Bob Wisniewski          <bob@watson.ibm.com>
 438 Karim Yaghmour          <karim@opersys.com>
 439 Tom Zanussi             <zanussi@us.ibm.com>
 440
 441 Also thanks to Hubertus Franke for a lot of useful suggestions and bug
 442 reports.