Copy avoidance in networked file systems
----------------------------------------

  Jose' Carlos Brustoloni 
  Bell Labs, Lucent Technologies 
  jcb@research.bell-labs.com


The point-to-point bandwidth of gigabit networks can surpass 
the main memory copy bandwidth of many current hosts. Therefore, 
researchers have been devoting considerable attention to the problem
of copy avoidance in network I/O. In particular, a recent study shows 
that copying can be avoided without modifying the semantics of existing
networking APIs [1].

In contrast, far less attention has been recently devoted to copy
avoidance in file I/O. This neglect may be motivated by several 
subtle misperceptions:
 
1) Disks are far slower than main memory (or gigabit networks).

   This is indeed true, but copy avoidance can still be worthwhile 
   in file I/O because:
   (a) copy avoidance can reduce CPU utilization and significantly 
       improve the throughput of file servers, which often are CPU-bound, and
   (b) caching can avoid physical disk I/O and greatly speed up file systems.

2) Many copy avoidance techniques for network I/O are not useful in file I/O. 

   Indeed, copy avoidance techniques for network I/O
   often exploit the fact that buffers are ephemeral, i.e. are
   deallocated as soon as processing of the corresponding input or
   output request completes. On the contrary, buffers used in file I/O
   often are cached. For example, emulated copy [1] is a copy avoidance
   scheme for network I/O that uses input alignment and page swapping
   on input and TCOW, a form of copy-on-write, on output. If used on
   file input, page swapping would corrupt the file system cache
   with the previous contents of the client input buffers.
   If used on file output, TCOW would allow cached output pages to 
   be corrupted because, after output completes, the output reference 
   is lost and therefore the pages can be overwritten or reused.

   This does not mean, however, that copy avoidance is unattainable
   in file I/O. Systems usually also offer mapped file I/O,
   which allows file data to be passed between applications and the 
   operating system by page mapping and unmapping. Mapped files are a practical
   solution that is already widely available for copy avoidance in file I/O.

3) Copying between mapped files and network I/O buffers can be unavoidable 
   because of page alignment constraints.

   For example, in a networked file server, data may be received from
   the network for output to the file system. The data will usually be
   preceded by an application-layer header specifying the file and
   offset from the beginning of the file (for simplicity, let us assume
   that the offset is multiple of the page size). This header can make 
   copy avoidance difficult because (a) the application must read the 
   header to determine the file and (b) the header may make the following data 
   arbitrarily aligned, whereas data must be page-aligned for mapped file I/O.





   However, I show that: 

   (a) If the network adapter supports system-aligned buffering 
       (early demultiplexing or buffer snap-off) [2], then the application 
       can peek at the header and, after decoding it, input the data 
       directly to the correct mapped file region, using emulated copy. 
       Data is passed between network and file system with copy avoidance 
       and without any modifications in existing APIs.

   (b) Even without such adapter support, copy avoidance is possible with
       header patching, a novel software optimization.

       Let h' be the preferred alignment for input from the network (usually
       equal to the length of any unstripped protocol headers below the 
       application layer), h be the length of the application-layer header, 
       and l be the data length (less than or equal to the network's 
       maximum transmission unit minus the lengths of headers at network 
       or higher layers). h' must be fixed and known by both sender and 
       receiver. On the contrary, h and l may vary from packet to packet.
       Using header patching, the sender transmits the application-layer
       header followed by the data starting at file offset o + h' + h
       and of length l - h' - h, followed by data starting at file offset 
       o and of length h' + h (to achieve this out-of-order transmission, 
       the sender may use, e.g., Unix's writev call with a gather list).
       The receiver peeks at the first h bytes of the input
       (using, e.g., Unix's recv with MSG_PEEK flag),
       decodes the application-layer header, and determines the address a
       corresponding to file offset o (multiple of the page size) 
       in the correct mapped file region. The receiver then inputs l - h'
       bytes to address a + h', followed by h' + h bytes to address a. 
       This causes most of the data to be passed by page swapping, after which
       the data corresponding to offset o and of length h' + h is
       patched on top of the application- and lower-layer 
       headers at address a.  After patching, the input buffer starts at 
       the correct offset in the mapped file region and runs uninterrupted 
       for length l with the data in correct order, as illustrated by the 
       following figure.

                      +----+---+----------+----+
        Packet:       | h' | h |    d1    | d0 |
                      +----+---+----------+----+

        Pooled NW     +----+---+----------+      +----+--------------+
        buffers:      | h' | h |    d1    |      | d0 |              |
                      +----+---+----------+      +----+--------------+
                      |    |                       
              reverse | ^  |      ^                |
              copyout | |  |      | swap           |
                      |    |      v                |
        Mapped        +----+---+----------+        |
        file:         |    | h |    d1    |        |
                      +----+---+----------+        |
                      \--------/                   |
                           ^                       |
                           |                       |
                    patch  +-----------------------+

My experiments on the Credit Net ATM network at 512 Mbps show that 
copy avoidance can substantially improve the performance of networked 
file systems. Because of cache effects, copy avoidance benefits are 
synergistic: Greatest benefits are obtained when copying is avoided on the
entire end-to-end data path, including network and file I/O.
Additionally, the experiments confirm each of the above claims.

References
----------

[1] J. Brustoloni and P. Steenkiste. ``Effects of Buffering
    Semantics on I/O performance'', in Proc. OSDI'96,
    USENIX, Oct. 1996, pp. 277-291. Also available from
    http://www.cs.cmu.edu/~jcb/.

[2] J. Brustoloni and P. Steenkiste. ``Copy Emulation in
    Checksummed, Multiple-Packet Communication'', in 
    Proc. INFOCOM'97, IEEE, April 1997. Also available from
    http://www.cs.cmu.edu/~jcb/.

---------------------------------------------------------------------------

Work performed while at the School of Computer Science, 
Carnegie Mellon University.
To be presented at Gigabit Networking Workshop - GBN'98.