Page to collect gist of RFCs or general networking ideas.

Active Queue Management (AQM) by Matthew Macy

Active Queue Management is an effort to avoid the latency increases (and increase in time in the feedback loop) and bursty losses caused by naive tail drop in intermediate buffering. The concept was introduced along with a discussion of the queue management algorithm "RED" (Random Early Detect/Drop) by RFC 2309. The most current RFC is 7567.

The usual mix of long high throughput and short low latency flows place conflicting demands on the queue occupancy of a switch:

RED:

Recommendations on Queue Management and Congestion Avoidance in the Internet
https://tools.ietf.org/html/rfc2309

IETF Recommendations Regarding Active Queue Management
https://tools.ietf.org/html/rfc7567

https://en.wikipedia.org/wiki/Active_queue_management<<BR>>

Explicit Congestion Notification (ECN) by Matthew Macy At its core ECN in TCP allows compliant routers to provide compliant senders with notification of "virtual drops" as a congestion indicator to halve its congestion window. This allows the sender to not wait for the retransmit timeout or repeated ACKS to learn of a congestion event and allows the receiver to avoid latency induced by drop/retransmit. ECN relies on some form of AQM in the intermediate routers/switches to determine the marking the CE (congestion encountered) bit IP header, it is then the receiver's responsibility to mark the ECE (ECN-Echo) in the TCP header of the subsequent ACK. The receiver will continue to send packets marked with the ECE bit until it receives a packet with the CWR (Congestion Window Reduced) bit set. Note that although this last design decision makes it robust in the presence of ack loss (the original version ECN specifies that ACKs / SYNs / SYN-ACKs not be marked as ECN capable and thus are not eligible for marking), it limits the use of ECN to once per RTT. As we'll see later this leads to interoperability issues with DCTCP.

ECN is negotiated at connection time. In FreeBSD it is configured by the sysctl 'net.inet.tcp.ecn.enable':

0

Disable ECN.

1

Allow incoming connections to request ECN. Outgoing connections will request ECN.

2 (default)

Allow incoming connections to request ECN. Outgoing connections will not request ECN.

The last time a survey was done, 2.7% of the internet would not respond to a SYN negotiating ECN. This isn't fatal as subsequent SYNs will switch to not requesting ECN. This just adds the default RTO to connection establishment (3s in FreeBSD, 1s per RFC6298 - discussed later).

Linux has some very common sense configurability improvements. Its ECN knob takes on _3_ values: 0) no request / no accept 1) no request / accept 2) request / accept. The default is (1), supporting it for those adventurous enough to request it. The route command can specify ECN by subnet. In effect allowing servers / clients to only use it within a data center or between compliant data centers.

ECN sees very little usage due to continued compatibility concerns. Although the difficulty of correctly tuning maxth and minth in RED and many other AQM mechanisms is not specific to ECN, RED et al are necessary to use ECN and thus further add to associated difficulties of its use.

Talks: More Accurate ECN Feedback in TCP (AccECN) - https://www.ietf.org/proceedings/90/slides/slides-90-tcpm-10.pdf

ECN is slow, does not report condition extent, just it's existence. It lacks inter- operability with DCTCP. Need to add mechanism for negotiating finer-grained, adaptive congestion notification.

RFCS:

A Proposal to add Explicit Congestion Notification (ECN) to IP
https://tools.ietf.org/html/rfc2481 Initial proposal.

The Addition of Explicit Congestion Notification (ECN) to IP
https://tools.ietf.org/html/rfc3168 Elaboration and further specification of how to tie it in to TCP.

Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets
https://tools.ietf.org/html/rfc5562 Sometimes referred to as ECN+. This extends ECN to SYN/ACK packets. Note that SYN packets are still not covered, being considered a potential security hole.

Problem Statement and Requirements for Increased Accuracy in Explicit Congestion Notification (ECN) Feedback
https://tools.ietf.org/html/rfc7560

(AccECN) Problem Statement and Requirements for Increased Accuracy in Explicit Congestion Notification (ECN) Feedback

Data Center Transmission Control Protocol (DCTCP) by Matthew Macy The Microsoft & Stanford developed CC protocol uses simplified switch RED/ECN CE marking to provide fine grained congestion notification to senders. RED is enabled in the switch but minth=maxth=K, where K is an empirically determined constant that is a function of bandwidth and desired switch utilization vs rate of convergence. Common values for K are 5 for 1Gbps and 60 for 10Gbps. The value for 40Gbps is presumably on the order of 240. The sender's congestion window is scaled back once per RTT as function of (#ECE/(#segments in window))/2. In the degenerate case of all segments being marked window is scaled back a la a loss in Reno. In the steady state latencies are much lower than in Reno due to considerably reduced switch occupancy.

There is currently no mechanism for negotiating CC protocols and DCTCP's reliance on continuous ECE notifications is incompatible with ECN's continuous repeating of the same ECE until a CWR is received. In effect ECN support has to be sucessfully negotiated when establishing the connection, but the receiver has to instead provide one ECE per new CE seen.

RFC: Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters
https://tools.ietf.org/pdf/draft-ietf-tcpm-dctcp-00.pdf

The window scaling constant is referred to as 'alpha'. Alpha=0 corresponds to no congestion, alpha=1 corresponds to a loss event in Reno or an ECE mark in standard ECN - resulting in a halving of the congestion window. 'g' is the feedback gain, 'M' is the fraction of bytes marked to bytes sent. Alpha and the congestion window 'cwnd' are calculated as follows:

alpha = alpha * (1 - g) + g * M

cwnd = cwnd * (1 - alpha/2)

To cope with delayed acks DCTCP specifies the following state machine - CE refers to DCTCP.CE, a new Boolean TCP state variable, "DCTCP Congestion Encountered" - which is initialized to false and stored in the Transmission Control Block (TCB).

The clear implication of this is that if the ack is delayed by more than m, as in different assumptions between peers or dropped ACKs, the signal can underestimate the level of encountered congestion. None of the literature suggests that this has been a problem in practice.

[Section 3.4 of RFC] Handling of SYN, SYN-ACK, RST Packets

[Section 4] Implementation Issues - the implementation must choose a suitable estimation gain (feedback gain)

- the implementation must decide when to use DCTCP. DCTCP may not be

- It is RECOMMENDED that the implementation deal with loss episodes in

- To prevent incast throughput collapse, the minimum RTO (MinRTO) should be

- It is also RECOMMENDED that an implementation allow configuration of

- [RFC3168] forbids the ECN-marking of pure ACK packets, because of the

[Section 5] Deployment Issues - DCTCP and conventional TCP congestion control do not coexist well in

- Since DCTCP relies on congestion marking by the switches, DCTCP can

- DCTCP requires changes on both the sender and the receiver, so both

[Section 6]
Known Issues

- DCTCP relies on the sender?s ability to reconstruct the stream of CE

- The effect of packet drops on DCTCP under real world conditions has not been analyzed.

- Much like standard TCP, DCTCP is biased against flows with longer

Papers: Data Center TCP [DCTCP10]
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf

The original DCTCP SIGCOMM paper by Stanford and Microsoft Research. It is very accessible even for those of us not well versed in CC protocols.

Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter [MORGANSTANLEY]
https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.pdf

Real world experience deploying DCTCP on Linux at Morgan Stanley.

Per-packet latency in ms

Mean 4.01 0.0422 Median 4.06 0.0395 Maximum 4.20 0.0850 Minimum 3.32 0.0280 sigma 0.167 0.0106

Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support [BSDCAN]
https://www.bsdcan.org/2015/schedule/attachments/315_dctcp-bsdcan2015-paper.pdf<<BR>>

Proposes a variant of DCTCP that can be deployed only on one endpoint of a connection, provided the peer is ECN-capable.
ODTCP changes:

DCTCP improvements:

Data Center TCP (DCTCP)
http://www.ietf.org/proceedings/80/slides/iccrg-3.pdf<<BR>> Case studies, workloads, latency and flow completion time of TCP vs DCTCP. Interesting set of slides worth skimming.

Analysis of DCTCP: Stability, Convergence, and Fairness [ADCTCP]
http://sedcl.stanford.edu/files/dctcp-analysis.pdf<<BR>> Follow up mathematical analysis of DCTCP using a fluid model. Contains interesting graphs showing how the gain factor affects the convergence rate between two flows.

Using Data Center TCP (DCTCP) in the Internet [ADCTCP]
- http://www.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_GLOBECOM_14_40260.pdf<<BR>> Investigates what would be needed to deploy DCTCP incrementally outside the data center.

Incast Transmission Control Protocol (ICTCP)
In ICTCP the receiver plays a direct role in estimating the per-flow available bandwidth and actively re-sizes each connection's receive window accordingly.

- http://research.microsoft.com/pubs/141115/ictcp.pdf

Quantum Congestion Notification (QCN) by Matthew Macy

Congestion control in ethernet. Introduced as part of the IEEE 802.2 Standards Body discussions for Data Center Bridging [DCB] motivated by the needs of FCoE. The initial congestion control protocol was standardized as 802.1Qau. Unlike the single bit of congestion information per-packet in TCP QCN uses 6-bits.

The algorithm is composed of two main parts: Switch or Control Point (CP) Dynamics and Rate Limiter or Reaction Point (RP) Dynamics.

[Taken from "Internet Congestion Control" by Subir Varna, ch. 8]

TransportProtocols/tcp_rfc_notes (last edited 2020-04-03T00:48:52+0000 by KevinZheng)