Saturday, November 18, 2006

The cold hard truth about TCP/IP performance over the WAN

TCP is a transport technology that is commonly used for the electronic movement of data between servers and other devices. Many storage vendors are beginning to use the TCP transport for replicating data between storage devices. However they are finding out that TCP as a transport has some basic limitations that cause many applications to perform poorly, especially over distance. TCP/IP performs sufficiently over short-distance LAN environments; however it was not well designed for transmission over Wide Area Networks (WANs). This article explores the challenges of TCP performance over the WAN and ways to mitigate these performance challenges with new data center appliances.

TCP Challenges

Window Size Limitations

Window size is the amount of data that is allowed to be outstanding (in the air) at any given point-in-time by the transport software. The available window size on a given bandwidth pipe is the rate of the bandwidth times the round-trip delay or latency. Using a cross-country OC-3 link (approximately 60ms based on a total 6000-mile roundtrip) creates an available data window of 155 Mbps X 60ms = 1,163 Kbytes. A DS3 satellite connection (540ms roundtrip) creates an available data window of 45 Mbps X 540ms = 3,038 Kbytes.

When this is contrasted with standard and even enhanced versions of TCP, there is a very large gap between the available window and the window utilized. Most standard TCP implementations are limited to 65-Kbyte windows. There are a few enhanced TCP versions capable of using up to 512 Kbytes or larger windows. Either case means an incredibly large amount of "dead air" and very inefficient bandwidth utilization resulting in poor performance for applications that are typically mission-critical.

Slow Start by Design

TCP data transfers start slowly and ramp-up to their maximum transfer rate, resulting in poor performance for short sessions. Slow start is used to avoid congestion due to assumptions that large numbers of sessions will be competing for the bandwidth.

[GRAPHIC OMITTED]

Inefficient Error Recovery

During error recovery, TCP causes the entire stream from any lost portion to be retransmitted in its entirety. High-bit-error rates or packet-loss scenarios will cause large amounts of bandwidth to be wasted in resending data that has already been successfully received, all with the long latency time of the path. Each retransmission is additionally subjected to the performance penalty issues of slow start, which was explained above.

Packet Loss is Disruptive

Packet loss describes an error condition in which data packets appear to be transmitted correctly at one end of a connection, but never arrive at the other end. This is mainly due to:

* Poor network conditions causing damage to packets in transit.

* The packet was deliberately dropped by a router and/or switch because of WAN congestion.

Packet loss can be disruptive to applications that must move data within windows of time. With more data that must be moved on a regular basis and the fact that backup windows are not growing to meet the data demands, packet loss can have a negative impact on meeting service-level agreements and production for many organizations.

The Figure shows a standard TCP stream of data running over an OC-12 (622Mbs).

Session Free-For-All is Not Free

Each TCP session is throttled and contends for network resources independently, which can cause over-subscription of resources relative to each individual session.

The net result of these issues is very poor bandwidth utilization. The typical bandwidth utilization for large data transfers over long-haul networks is usually less than 30%, and more often less than 10%. As fast as bandwidth costs are dropping, they are still not free.

How to Mitigate TCP/IP Performance Issues

Consider Using an IP Application Accelerator (Appliance)

Many new data center appliances are being used to optimize data delivery for IP applications. Some appliances mitigate performance issues by simply caching the data and/or compressing the data prior to transfer. Others have the ability to mitigate several TCP issues because of the superior architecture.

Whatever technology is used, it is important the appliances have the ability to mitigate latency issues, compress the data and shield the application from network disruptions. It is also important that these new data center appliances are transparent to operations and provide the same transparency to the IP application.

Transport Protocol Conversion

Some data center appliances provide alternative transport delivery mechanisms between appliances. In doing so, they re-ceive the optimized buffers from the local application and deliver them to the destination appliance for subsequent delivery to the remote application process. Alternative transport technologies are responsible for maintaining acknowledgements of data buffers and resending buffers when required. It is important to maintain a flow control mechanism on each connection, in order to optimize the performance of each connection to match the available bandwidth and network capacity.