Providing Reliability in Data Center Ethernet QIN Dan Department of Electrical and Electronic Engineering The University of Hong Kong May 4, 2019 ABSTRACT With the vigorous development of various network technologies, the data center network has been widely used. Ethernet has become the main infrastructure of the data center network because of its wide deployment, simplicity, low cost, and maturity. However, Ethernet cannot provide reliable transmission, and some applications in the data center network cannot use the Transmission Control Protocol (TCP) for reliable transmission. Therefore, it is necessary to provide a link-layer congestion notification mechanism (CN) for Data Center Ethernet. To support this congestion notification technique, researchers have developed some approaches, such as Backward Congestion Notification mechanism (BCN), Forward Explicit Congestion Notification mechanism (FECN), Enhanced Forward Explicit Congestion Notification mechanism (E-FECN) and Quantized Congestion Notification mechanism (QCN). Each method has its own pros and cons. In this paper, we make a survey of the four mechanisms in the Data Center Ethernet and also propose a possible optimization scheme to Quantized Congestion Notification mechanism. 1. Introduction Since the invention of Ethernet in the mid-1970s, Ethernet has continued to develop and grow and has maintained its leading position in the field of communication technology [1]. With the rise of the Internet, the degree of automation in various workplaces has become increasingly high, and Ethernet has also become increasingly common. Nowadays, Ethernet has become the most prevalent wired LAN technology so far, almost all organizations use Ethernet to connect computers in their workplaces [2]. The reason why Ethernet is so successful is that Ethernet is not only the first widely deployed high-speed local area network but also simple and cost efficient compared to other technologies such as token ting, FDDI and ATM [3]. Due to so many inherent advantages of Ethernet, it has become the main infrastructure in the Data Center Ethernet. 2 Since the original design of traditional Ethernet is to provide best-effort communication in the local area network, it cannot guarantee the delivery of data packets to the destination, which makes Ethernet an unreliable networking technology [4-7]. A reliable network structure needs to provide reliability for each data packet injected into the network, and this characteristic of Ethernet is provided by the upper layer protocol of the network [7-9]. In fact, Ethernet relies on the Transmission Control Protocol (TCP) to provide reliability [7, 10]. When network congestion occurs, the switch discards the data packets, and the TCP protocol determines whether network congestion occurs by detecting the discarding of data packets. The main reason why Ethernet cannot provide reliability is that the path between the source and the destination of the Ethernet consists of multiple hops, and the intermediate hops did not feedback its buffer occupancy in time [6-8]. When the buffer of the intermediate hop is full, the hop discards the data packets that arrive later, and the sender cannot know this. Therefore, a mechanism that can provide Ethernet reliability should be provided. Some applications, such as the File Transfer Protocol (FTP), use the TCP protocol of the transport layer to provide reliable transmissions. The discarding of packets at the data link layer is acceptable because the TCP protocol retransmits the drops. However, some applications cannot incorporate reliability into upper-layer protocols, such as real-time applications or storage traffic in the Data Center Ethernet [2]. These applications require high throughput and low latency. In order to keep the latency low and avoid a lot of queuing delays, the queue length should be kept small. In addition, packet loss is unacceptable because every packet is critical and the retransmission of the packet will cause unacceptable delay. Therefore, a hop-by-hop flow control mechanism is needed to provide reliability for such applications [5, 7, 11]. The mechanism can provide hop-by-hop reliability regardless of the transport protocol. For this mechanism, some researchers have proposed a technology called Congestion Notification (CN). Congestion notification is a technology that runs on the link layer. This technology detects the congestion level by monitoring the length of output queue on the switch in the Data Center Ethernet and pushes the congestion message to the rate regulators running on the edge of the network. The rate regulators adjust the flow rate according to the congestion feedback message received from the switch to avoid frame loss [12]. In this survey, we summarize and discuss the congestion notification mechanisms in the Data Center Ethernet. Researchers have proposed some approaches, such as BCN [13, 14], FECN [15, 27], enhanced FECN (E-FECN) [16], and QCN [17]. 3 The remainder of this paper is organized as follows. Section 2 elucidates the Data Center Ethernet and its congestion control. Backward Congestion Notification mechanism (BCN), Forward Explicit Congestion Notification mechanism (FECN), Enhanced Forward Explicit Congestion Notification mechanism (E-FECN) and Quantized Congestion Notification mechanism (QCN) are described in Section 3, 4, 5 and 6 respectively. Section 7 analyzes the pros and cons of the four congestion notification mechanisms. Finally, conclusion and future work are discussed in Section 8. 2. Background 2.1 Data Center Ethernet With the vigorous development of various network technologies in recent years, many companies have established large data centers composed of a large number of storage devices, which can store, access and process large amounts of data [18]. What plays a vital role in a large data center is the Data Center Ethernet that supports internal communication between a large number of computing resources. The Data Center Ethernet not only contains storage devices, but also has servers which are used to manage data and Ethernet switches for connecting the devices. It mainly provide not only online services (such as web search), but also distributed storage systems (such as MapReduce [20]) and distributed file systems (such as GFS). Due to the powerful functions of the Data Center Ethernet, it is widely used in the government, business, education and financial sectors [19]. 2.2 Congestion Control in Data Center Ethernet Although the data center network is different from the traditional local area network and wide area network, it still faces a series of problems. For example, the number of servers in the data center network is increasing at an exponential rate, which leads to problems in the interconnection, congestion notification, and robustness of the network [21, 22]. In addition, various applications in the data center network have different requirements for traffic, such as avoiding data packet loss and providing low latency [1], which further brings challenges to the network. More importantly, the data center network is equipped with a large number of terminal stations, which makes the network susceptible to link congestion. For example, frequent congestion of the output link of the core switch will cause the loss and delay of a large number of data packets, thereby greatly reducing the throughput of the entire data center system. In order to achieve high reliability, the data center network chooses to use fiber channel. However, due to the high cost of the technology, the IEEE 802.1 Standards Committee decided to use Ethernet as the data center network infrastructure [23, 24]. 4 Although Ethernet has been invented for many years, it still lacks a congestion control mechanism. TCP is used to provide reliability for Ethernet. It infers the occurrence of congestion by detecting the drops of data packets, but the retransmission mechanism of data packets leads to an increase in delay [25]. At the same time, the reliability of some applications using UDP cannot be guaranteed. Therefore, the Data Center Ethernet must provide a congestion control mechanism at the data link layer. According to the network topology, each frame usually needs to pass through multiple intermediate switches [1]. The output queue of the intermediate switches grows in length when network congestion occurs and discards frames when the length exceeds a certain value defined. When the downstream switch lacks sufficient resource buffers to receive frames, it is necessary to use a congestion control mechanism, which can reduce the loss of the frames in the network. To support these congestion control technique, researchers have developed some approaches, such as Backward Congestion Notification mechanism (BCN) [13,14], Forward Explicit Congestion Notification mechanism (FECN) [15,27], Enhanced Forward Explicit Congestion Notification mechanism (EFECN) [16] and Quantized Congestion Notification mechanism (QCN) [17]. Among them, BCN and QCN perform flow control by sending a feedback message to the upstream device. However, FECN and E-FECN send congestion messages to end stations for flow control. 3. Backward Congestion Notification Mechanism (BCN) Bergamasco firstly proposed the backward congestion notification mechanism (BCN) at Cisco [13, 14]. It is also called the Ethernet Congestion Management (ECM). BCN is a rate-based closed-loop feedback control mechanism. As shown in Fig. 1, the sources are equipped with rate regulators, which are used to adjust the rate of individual flows. The core switch in which the congestion detector is integrated determines whether there is congestion by monitoring the length of its output queue which indicates the buffer utilization. When congestion happens, the core switch sends a BCN feedback message, which requires the source to adjust the rate, to the sources. The sources update the rate of its regulator according to the BCN feedback message received from core switch. There are mainly three processes in Backward Congestion Notification mechanism: Congestion Detection, Congestion Signaling and Source Reaction. A. Congestion Detection BCN has congestion detection by checking the buffer utilization level which is measured by two thresholds Qeq and Qsc. Qeq indicates equilibrium queue length. Qsc indicates severe congestion queue length [26]. The core switch samples the incoming packets with probability Pm. After sampling, the switch 5 calculates the congestion measure ππ to determine the congestion level on the link. The core switch sends different BCN feedback messages according to different congestion levels. The congestion measure ππ is calculated as following: ππ = ππππ (π‘) − Wπππππ‘π (π‘) = (πππ − π(π‘)) − π(ππ− ππ ) Here, ππππ (π‘)is the queue offset at a certain moment defined as: ππππ (π‘) = π(π‘) − πππ W is a non-negative constant weight, πππππ‘π is the change in queue length compared to the previous sample and is defined as the difference between the queue length at this moment ππ and the queue length at the last sampling ππ . If ππ is positive, the sources should increase their rates, otherwise, the sources decrease their rates. Fig -1: Backward Congestion Notification (BCN) B. Congestion Signaling BCN has two types of signals: BCN normal messages, and BCN STOP messages. The congestion measure ππ and the ID for the switch (CPID) are included in the BCN normal message. It is noted that the source tags each packet that passes through it after receiving the BCN message. The different BCN messages, which uses the 802.1Q tag format [13], are generated as follows: If the packet is not tagged by the source: ο· If q(t)<Qeq, the core switch will not send BCN message. ο· If Qeq<q(t)≤Qsc, the core switch sends a normal BCN message. ο· If q(t)> Qsc, the switch sends a BCN STOP message. If the packet is tagged by the source: ο· If q(t)<Qeq and the CPID field of the packet matches the CPID of the core switch, the switch sends a positive BCN message. ο· If Qeq<q(t)≤ Qsc, the core switch sends a normal BCN message. ο· If q(t)> Qsc, the core switch sends a BCN STOP message. The source sets the rate of the rate regulator to 0 and continues for any period of time after receiving the BCN STOP message and then sets the rate to C/K to restore the rate, where K is the number of flows in the network and C is the capacity of the bottleneck link. 6 C. Source Reaction When the source receives a normal BCN message, it calculates the new rate by using ππ and adapts the settings for the rate regulator. These adjustments are defined by the Additive Increase Multiplicative Decrease (AIMD) algorithm. 4. Forward Explicit Congestion Notification Mechanism (FECN) Forward Explicit Congestion Notification mechanism (FECN) is a closed-loop explicit rate feedback control scheme [15, 27]. As shown in Fig. 2, the sources use the rate-based load sensor to periodically detect congestion on the path from the source to the destination. For this purpose, the sources periodically send the probe message which contains the rate field. During the transmission of the probe message from the source to the destination, each switch on the path checks the rate field of the probe message. If the value of the rate field is greater than the available bandwidth, the switch modifies the value and advertises the rate to all flows which pass through the switch, so all flows are treated fairly in the FECN mechanism. When the probe message finally returns from the destination to the source, the value in its rate field is the exact rate that the flow should follow. The rate regulator of the source updates the sending rate according to this value. As the source adjusts the rate by using an explicit rate, the FECN mechanism has a slow response and depends on the round trip time (RTT). It is noted that Rate Discovery Tag (RDTag) is included in the probing message. Fig -2: Forward Explicit Congestion Notification (FECN) 5. Enhanced Forward Explicit Congestion Notification Mechanism (E-FECN) Enhanced Forward Explicit Congestion Notification (E-FECN) mechanism is a hybrid notification congestion management scheme [16]. As shown in Fig. 3, E-FECN requires not only the normal probing message of FECN, but also the BCN feedback message of BCN. When severe congestion happens, that is, the queue length is greater than the congestion severe threshold (Qsc), the switch sends BCN messages to the source directly [28]. This BCN message requires the rate regulator at the source to reduce to a low initial rate. 7 The basic algorithm is as follows: At the sources: a. The sources start at full rate, and the rate drops after receiving BCN messages or FECN messages. b. The rate regulators limit the rates of some congested flow. C. If the source receives the FECN message, flows are set to the proper rates as indicated in the FECN message. At the switches: a. E-FECN behaves the same operations as those in FECN. b. The switches monitor the length of the queue. c. If the queue length becomes greater than Qsc, which is the severe congestion threshold, the BCN message is sent back to the source. In order to maintain the rate consistency, the switch sets the advertised rate, which is sent in FECN message, for all sources to the minimum value. Fig -3: Enhanced Forward Explicit Congestion Notification (E-FECN) 6. Quantized Congestion Notification Mechanism (QCN) Quantized Congestion Notification mechanism (QCN) is very similar to BCN mechanism. It is also a closed-loop rate based congestion control scheme. The basic principle of the QCN is to dynamically update the rate of flow according to the state of the network. When network congestion occurs, the switch (congestion point) sends the feedback message containing congestion information to the source (reaction point) [18]. This scheme is mainly composed of two parts: Congestion Point and Reaction Point. As shown in the Fig. 4, congestion point detects congestion by checking the output queue of switch. The purpose is to keep the size of the queue at a balanced value Qeq. The congested point sends a congestion signal to the source by sending a feedback message reflecting the congestion level at that point. Reaction point is used to reduce the rate of flow after receiving the feedback message from the congestion point. Unlike the BCN, it increases the rate when it detects additional available bandwidth or needs to recover the lost bandwidth. In fact, the QCN congestion point algorithm is almost the same as the BCN. The only difference is that when there is no congestion in the network, the congestion point in QCN will not send positive feedback to the source, that is, the reaction point needs to evaluate by itself to increase the rate. 8 The reaction point of QCN increases the rate through three steps: Fast Recovery (FR), Active Increase (AI) and Hyper Active Increase (HAI). FR is used to quickly recover the last reduced rate of the source's rate regulator. AI is mainly for detecting additional bandwidth. HAI occurs when multiple sessions at a congestion point are closed and the bandwidth released by them can be consumed by other sessions. Fig -4: Quantized Congestion Notification (QCN) 7. Performance Analysis and Discussion We discuss and analyze the advantages and disadvantages of BCN, FECN, EFECN, and QCN congestion notification mechanisms in the following aspects. 1. Fairness BCN and QCN only send the feedback message to the source of the sampled packet, so they achieve only proportional fairness. However, since the switches in FECN and E-FECN set all flows passing through it to the same rate, and all sources have received the same feedback, FECN and E-FECN can reach a perfect fair state. 2. Feedback Control BCN and QCN use backward feedback control, that is, the core switch sends feedback messages to upstream devices. FECN uses forward feedback control, that is, the source sends probe messages to the core switch. In addition to using forward feedback control, E-FECN notifies the source to perform flow control when the core switch is severely congested. 3. Rate of Convergence to Fair State BCN converges to a fair state very slowly, because the source of BCN calculates a new rate based on the AIMD algorithm which can only achieve fairness in a long-term sense. In contrast, as all sources get the same feedback, FECN and E-FECN can reach a perfect fair state after a few round-trip times. 4. Overhead The overhead of BCN is not only expensive but also unpredictable. Although the overhead of QCN is also unpredictable, it is moderate compared to the overhead of BCN, because QCN only sends negative feedback messages to the source to reduce the sending rate. The overhead of FECN is low and predictable, because the payload of the FECN feedback message is only about 20 bytes and the FECN message is sent every 1 millisecond. The overhead of E-FECN is larger than that of FECN, because E-FECN needs to send BCN message to the source compared to FECN. 9 5. Congestion Regulation The rate of the source rate regulator in BCN and QCN can be adjusted quickly, because feedback messages are sent directly from the congested point to the source once network congestion occurs. The adjustment of source rate in FECN is very slow because the probe message sent by the source must take a round trip time before it can return to the source. The adjustment of source rate in E-FECN is medium, because it adds BCN feedback message on the basis of FECN, that is, when severe congestion occurs, feedback message is sent directly from the congestion point to the source. 6. Throughput Oscillation The oscillations in throughput of BCN are large. The throughput oscillation of QCN is improved compared to that of BCN, because the source rate increase is determined by byte counter and rate increase timer together. However, the oscillations in throughput of FECN and E-FECN are small. 7. Fast Start BCN, E-FECN, and QCN all support a fast start. The source is initialized at full rate. After receiving the negative feedback message transmitted by the congested point, the source's rate will decrease. In FECN, the source starts at a lower rate and gradually moves to the equilibrium rate as a series of probe messages return. 9. Link Disconnection If the link in the network is suddenly disconnected, BCN, E-FECN and QCN can use feedback message to notify the source to reduce or stop the transmission of data packets. There is a lack of feedback messages in FECN, and the probe messages may not be able to return to the source due to the failure of the network link, so the source will maintain the sending rate, resulting in packet loss. 8. Conclusion and Future Work With the demand for high reliability and low latency in data centers, providing a congestion control mechanism has become an urgent problem to be solved in the Data Center Ethernet. This paper studies the congestion control problem of the Data Center Ethernet. Backward congestion control, forward congestion control and hybrid congestion control are currently the three main types of technologies. We discussed and analyzed the advantages and disadvantages of the four congestion notification mechanisms: BCN, FCN, E-FECN and QCN. In BCN, the source updates the rate of its regulator according to the BCN feedback message sent by the core switch to perform congestion control. In FECN, the rate of the probe message periodically sent by the source is modified on the way from the source to the destination. When the source receives the probe message returned from the target, it updates the sending rate according to the instructions in the probe message. E-FECN combines BCN and FECN. When severe congestion happens, the switch sends BCN messages to the source directly. QCN is very similar to the BCN. It also dynamically adjusts the flow according to the network status. Unlike the BCN, when there 10 is no congestion in the network, the congestion point in QCN will not notify the source to increase the sending rate. Since QCN samples data packets randomly at the congestion point, and only sends feedback messages to a source that is considered to cause congestion, which prevents the convergence of the fairness of the sending rate. So we may solve this problem through the following steps. We can first find out the flows whose sending rate is greater than the fair share, then monitor each flow, and finally feedback the congestion message of each flow to themselves to ensure fairness. References [1] Mliki, Hela, Lamia Chaari, and Lotfi Kamoun. "A comprehensive survey on carrier ethernet congestion management mechanism." Journal of Network and Computer Applications 47 (2015): 107-130. [2] Khurshid, Waqar, et al. "Comparative study of congestion notification techniques for hop-by-hop-based flow control in data centre Ethernet." IET Networks 7.4 (2018): 248-257. [3] Zhang, Yan, and Nirwan Ansari. "On architecture design, congestion notification, TCP incast and power consumption in data centers." IEEE Communications Surveys & Tutorials 15.1 (2012): 39-64. [4] Lee, Duke, et al. "FLORAX-flow-rate based hop by hop backpressure control for IEEE 802.3 x." 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No. 02EX612). IEEE, 2002. [5] Wechta, Jerzy, Martin Fricker, and Fred Halsall. "Hop-by-hop flow control as a method to improve QoS in 802.3 LANs." 1999 Seventh International Workshop on Quality of Service. IWQoS'99.(Cat. No. 98EX354). IEEE, 1999. [6] DeSanti, Claudio. "IEEE 802.1: 802.1 Qbb-Priority-based Flow Control." 2011. [7] Cisco White Papers: "Priority Flow Control: Build Reliable Layer 2 Infrastructure." CISCO, 2010. [8] IEEE: "IEEE p802.1Qbb/D2.3, May 2010 (DRAFT Amendment to IEEE Std 802.1Q-2005). " 2010, pp. 1–40. [9] Malhotra, Richa, et al. "Modeling the interaction of IEEE 802.3 x hop-byhop flow control and TCP end-to-end flow control." Next Generation Internet Networks, 2005. IEEE, 2005. [10] de los Santos, G. Rodriguez, et al. "Buffer design under bursty traffic with applications in FCoE storage area networks." IEEE communications letters 17.2 (2013): 413-416. [11] Feuser, Oliver, and Andre Wenzel. "On the effects of the ieee 802.3 x flow control in full-duplex ethernet lans." Proceedings 24th Conference on Local Computer Networks. LCN'99. IEEE, 1999. 11 [12] Pawar, Satish D., and Pallavi V. Kulkarni. "A survey on congestion notification algorithm in data centers." International Journal of Computer Applications 975 (2014): 8887. [13] Bergamasco, Davide. "Data center ethernet congestion management: Backward congestion notification." IEEE 802.1 Meeting. 2005. [14] Bergamasco, Davide, and Rong Pan. "Backward congestion notification version 2.0." IEEE 802.1 Meeting. 2005. [15] Jiang, Jinjing, Raj Jain, and Chakchai So-In. "An explicit rate control framework for lossless ethernet operation." 2008 ieee international conference on communications. IEEE, 2008. [16] So-In, Chakchai, Raj Jain, and Jinjing Jiang. "Enhanced forward explicit congestion notification (E-FECN) scheme for datacenter Ethernet networks." 2008 international symposium on performance evaluation of computer and telecommunication systems. IEEE, 2008. [17] Mliki, Hela, Lamia Chaari, and Lotfi Kamoun. "Modelling and evaluation of QCN using colored petri nets." Peer-to-Peer Networking and Applications 11.3 (2018): 486-503. [18] Devkota, Prajjwal, and AL Narasimha Reddy. "Performance of quantized congestion notification in TCP incast scenarios of data centers." 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. IEEE, 2010. [19] Zhang, Yan, and Nirwan Ansari. "On architecture design, congestion notification, TCP incast and power consumption in data centers." IEEE Communications Surveys & Tutorials 15.1 (2012): 39-64. [20] Zhang, Yan. "Congestion control, energy efficiency and virtual machine placement for data centers." (2014). [21] Li, Weihe, et al. "Survey on Traffic Management in Data Center Network: From Link Layer to Application Layer." IEEE Access 9 (2021): 3842738456. [22] Xia, Wenfeng, et al. "A survey on data center networking (DCN): Infrastructure and operations." IEEE communications surveys & tutorials 19.1 (2016): 640-656. [23] Oβ«Χ³β¬Hanlon R, Data center Ethernet, TERENA workshop, 2006. [24] Cisco System Inc, Data Center Ethernet (DCE), www.cisco.com/go, 2008. [25] Gusat, Mitchell, Cyriel Minkenberg, and Gergely János Paljak. "Flow and congestion control for datacenter networks." RZ3742 2009 (2009). [26] Jiang, J., and Raj Jain. "Simulation Modelling of BCN V2. 0 Phase 1: Model Validation." IEEE 802.1 Congestion Group Meeting. 2006. [27] Jiang J, Jain R, So-In C. "Forward explicit congestion notification (FECN) for datacenter Ethernet networks." IEEE 802.1au congestion notification group meeting, 2007. 12 [28] Joglekar, Rohit P., and P. Game. "Managing congestion in data center network using congestion notification algorithms." International Research Journal of Engineering and Technology (IRJET) 3.5 (2016): 195-200.