Scaling of End-to-End Eatency with Network Transmission Rate

Jose' Carlos Brustoloni and Peter Steenkiste
School of Computer Science Carnegie Mellon University
jcb@cs.cmu.edu, prs@cs.cmu.edu

1. Introduction

In [1], we introduced a novel taxonomy and demonstrated several new optimizations for I/O data passing between applications and operating systems. We showed experimentally that the emulated share data passing scheme, which preserves the API of copy semantics, has the best performance in the taxonomy (including move semantics). Additionally, we showed that the emulated copy data passing scheme, which preserves not only the API but also the integrity guarantees of copy semantics and that therefore can transparently replace the latter, has performance almost as good as that of emulated share.

We modeled in [1] how end-to-end latency varies with data passing semantics and scales with CPU, network, and memory speeds. We validated our model on computers with various CPU and memory speeds connected by the CreditNet ATM network at 155 Mbps.

This work validates the scaling model of [1] with respect to network speeds. We report end-to-end latencies over the CreditNet ATM network at 513 Mbps and compare them with those predicted by the scaling model from the measurements at 155 Mbps [1].

2. Experimental results

Figure 1 shows the latency for communication between applications running on Micron P166 personal computers connected by the CreditNet ATM network at 513 Mbps raw transmission rate. The computers each have a 166 MHz Pentium CPU and 32 MB of main memory. We used the NetBSD 1.1 operating system augmented with an implementation of the Genie I/O framework [1], through which applications accessed the network. We report averages over five runs after a ``warm-up'' run.

Figure 1 : End-to-end latency for AAL5 packets at 513 Mbps raw transmission rate, with early demultiplexing.

Table 1 shows the least-squares linear fit of the curves of Figure 1, along with the latencies estimated from Table 6 of [1] using the scaling model. The throughput for single 60 KB AAL5 packets predicted by the scaling model is 132, 348, or 384 Mbps for copy, emulated copy, or emulated share semantics, respectively. The corresponding measured throughputs were 136, 341, and 380 Mbps (Note that the latter two are comparable to the main memory copy bandwidth of the computers used, 351 Mbps).

---------------+---+--------------------------
Semantics      |   |   Early demultiplexing 
---------------+---+--------------------------
Copy           | E |   0.0581 B + 141  
               | A |   0.0561 B + 129 
---------------+---+--------------------------
Emulated copy  | E |   0.0205 B + 153
               | A |   0.0212 B + 139 
---------------+---+--------------------------
Emulated share | E |   0.0186 B + 137 
               | A |   0.0192 B + 119 
---------------+---+--------------------------
Table 1: Estimated (E) and actual (A) end-to-end latencies, in usec. B is the data length in bytes.

3. Related work

Our measured thoughput for *single* 60 KB AAL5 packets using emulated copy, 341 Mbps, compares favorably with the cached volatile Fbuf throughput for the same data length using *multiple* UDP packets, 265 Mbps [4]. Our performance also compares favorably with the best ttcp performance, 265 Mbps, reported for the Solaris zero-copy TCP scheme on ATM cards with hardware checksumming [3]. Note that these comparisons are actually biased against emulated copy, because: (1) we used a slower network (513 vs. 622 Mbps); and (2) we report performance for single packets, whereas both the performance of [4] and of [3] benefit from the pipelining of a sliding-window protocol.

Also note that emulated copy is transparently compatible with the copy semantics of APIs such as those of Unix and Windows NT, for application buffers of arbitrary alignment, location, and length, whereas Fbufs implement entirely different semantics, and the Solaris scheme only works for application buffers that are page-aligned and of length multiple of the page size.

In [2], we discuss the hardware support required for emulated copy in multiple-packet communication.

4. Conclusion

The good fit between estimated and actual latencies in Table 1 suggests that the scaling model of [1] is accurate also with respect to the effect of network speeds.

References

[1] J. Brustoloni and P. Steenkiste. ``Effects of Buffering Semantics on I/O performance'', in Proc. OSDI'96, USENIX, Oct. 1996, pp. 277-291. Also available from http://www.cs.cmu.edu/~jcb

[2] J. Brustoloni and P. Steenkiste. ``Copy Emulation in Checksummed, Multiple-Packet Communication'', in Proc. INFOCOM'97, IEEE, April 1997. Also available from http://www.cs.cmu.edu/~jcb [3] H. J. Chu. ``Zero-Copy TCP in Solaris'', in Proc. Winter Tech. Conf., USENIX, Jan. 1996.

[4] P. Druschel and L. Peterson. ``Fbufs: A High-Bandwidth Cross-Domain Transfer Facility", in Proc. 14th SOSP, ACM, Dec. 1993, pp. 189-202.


[Up] [Back] [Forward]
[TCGN] [ComSoc] [IEEE]
Last updated 6 March 1997
James P.G. Sterbenz <jpgs@ieee.org>