Investigating Serial Attached SCSI (SAS) over TCP (tSAS) Master’s Project Report by Deepti Reddy As part of the requirements for the degree of Master of Science in Computer Science University of Colorado, Colorado Springs Committee Members and Signatures Approved by Date ______________________________ Project Advisor: Dr. Edward Chow ______________________________ Member: _______________ Dr. Xiaobo Zhou ______________________________ Member: _______________ _______________ Dr. Chuan Yue 1 Acknowledgements I would like to express my sincere thanks and appreciation to Dr. Chow. His support and encouragement as my advisor helped me learn a lot and fuelled my enthusiasm for this project. He has been an excellent professor, mentor and advisor throughout my Master’s program at UCCS. Appreciation and thanks also to my committee members Dr. Zhou and Dr. Yue for their guidance and support. I would also like to thank Patricia Rea for helping me with all the required paper work during my Master’s program. 2 Investigating Serial Attached SCSI (SAS) over TCP (tSAS) ............................................................................. 5 1. Abstract ................................................................................................................................................. 5 2. Background on SCSI, iSCSI & SAS .............................................................................................................. 6 2.1 SCSI (Small Computer Systems Interface) ........................................................................................... 6 2.1.1 SCSI Architecture Model ......................................................................................................... 7 2.1.2 SCSI Command Descriptor Block .......................................................................................... 8 2.1.3 Typical SCSI IO Transfer ...................................................................................................... 10 2.1.4 Limitations of SCSI........................................................................................................... 11 2.2 iSCSI (Internet Small Computer System Interface) ........................................................................... 12 2.2.1 iSCSI Session and Phases ................................................................................................... 12 2.2.2 iSCSI PDU ............................................................................................................................... 13 2.2.3 Data Transfer between Initiator and Target(s) ............................................................. 14 2.2.4 Read/Write command sequence in iSCSI .................................................................... 16 2.3 Serial Attached SCSI (SAS) ..................................................................................................................... 17 Figure 2.3.0 – A typical SAS topology...................................................................................................... 19 2.3.1 Protocols used in SAS ..................................................................................................... 19 2.3.2 Layers of the SAS Standard ........................................................................................... 19 2.3.3 SAS Ports .......................................................................................................................... 21 2.3.4 Primitives ........................................................................................................................... 21 2.3.5 SSP frame format ............................................................................................................. 22 2.3.6 READ/WRITE command sequence............................................................................... 26 3.0. tSAS (Ethernet SAS) ......................................................................................................................... 28 3.1 Goal, Motivation and Challenges of the Project ......................................................................... 28 3.2 Project Implementation .............................................................................................................. 28 3.2.0 tSAS Topology and Command flow sequence ............................................................ 29 3.2.1 Software and Hardware solutions for tSAS implementations......................................... 33 3.2.2 Primitives ........................................................................................................................... 35 3.2.4 Task Management ................................................................................................................. 41 3.2.5 tSAS mock application to compare with an iSCSI mock application .............................. 41 3.3 Performance evaluation.............................................................................................................. 44 3.3.0 Measuring SAS performance using IOMeter in Windows and VDbench in Linux .. 44 3.3.1 Measuring iSCSI performance using IOMeter in Windows ....................................... 51 3 3.3.2 Measuring tSAS performance using the client and server mock application written and comparing it to the iSCSI client/server mock application as well as to legacy SAS and legacy iSCSI ...................................................................................................................................... 56 4.0 Similar Work.................................................................................................................................... 70 5.0 Future Direction .............................................................................................................................. 70 6.0 Conclusion (Lessons learned) .......................................................................................................... 71 7.0 References ...................................................................................................................................... 71 8.0 Appendix ......................................................................................................................................... 75 8.1 How to run the tSAS and iSCSI mock initiator (client) and target (server) application .............. 75 8.2 How to run iSCSI Server and iSCSI Target Software .................................................................... 78 8.3 How to run LeCroy SAS Analyzer Software ................................................................................. 78 8.4 WireShark to view the WireShark traces .................................................................................... 78 8.5 VDBench for Linux ....................................................................................................................... 78 8.5 IOMeter for Windows ................................................................................................................. 79 4 Investigating Serial Attached SCSI (SAS) over TCP (tSAS) Project directed by Professor Edward Chow 1. Abstract Serial Attached SCSI [1], the successor of SCSI is gaining popularity by leaps and bounds in enterprise storage systems. SAS is reliable, cheaper, faster and more scalable than its predecessor SCSI. One of the limiting features of SAS is its distance limitation. A single point to point SAS cable connection can cover only around 8 meters. To scale topologies to support a large number of devices beyond the native port count, expanders are used in SAS topologies [2]. With zoning [2] capabilities introduced in SAS2 expanders, SAS is gaining popularity in Storage Area Networks. With the growing demand for SAS in large topologies arises the need to investigate SAS over TCP (tSAS) to increase the distance and scalability of SAS. The iSCSI protocol [3] today provides similar functionality where it sends SCSI commands over TCP. However, SAS drives and SAS expanders can’t be used in an iSCSI topology such that the iSCSI HBA talks directly in-band to SAS devices making the iSCSI back-end less scalable than tSAS. The iSCSI specification is leveraged heavily for design consideration for tSAS. 5 The goal of this project is to provide research results for future industry specification for tSAS and iSCSI. The project involves understanding the iSCSI protocol as well as the SAS protocol and providing guidance on how tSAS can be designed. The project also involves investigating sending a set of SAS commands and responses over TCP/IP (tSAS) to address scalability and the distance limitations of legacy SAS. A client prototype application will be implemented to send/receive a small set of commands. A server prototype application will be implemented that receives a set of tSAS commands from the client and sends tSAS responses. The server application mocks a tSAS Initiator while the client application mocks a tSAS target. The performance of tSAS will be compared to legacy SAS and iSCSI to determine the speed and scalability of tSAS. To compare fairly with legacy iSCSI, a client and server prototype that mock an iSCSI Initiator and iSCSI target will also be implemented. 2. Background on SCSI, iSCSI & SAS 2.1 SCSI (Small Computer Systems Interface) Since work first began in 1981 on an I/O technology that was later named the Small Computer System Interface, this set of standard electronic interfaces has evolved to keep pace with a storage industry that demands more performance, manageability, flexibility, and features for high-end desktop/server connectivity each year [4]. SCSI allows connectivity with up to seven devices on a narrow bus and 15 devices on a wide bus, plus the controller [5]. The SCSI protocol is an application layer storage protocol. It's a standard for connecting peripherals to your computer via a standard hardware interface, which uses standard SCSI commands. The primary motivation for SCSI was to provide a way to logically address blocks. Logical addresses eliminate the need to physically address data blocks in terms of cylinder, head, and sector. The advantage of logical addressing is that it frees the host from having to know the physical organization of a drive [6][7][8][9]. Currently the SCSI protocol being used is SCSI-3. The SCSI standard defines the data transfer process over a SCSI bus, arbitration policies, and even device addressing [10]. Below is a snapshot of SCSI history: Type/Bus SCSI-2 (8 bit narrow) Approx. Speed 10 MB/Sec UltraSCSI (8-bit narrow) 20 MB/Sec Ultra Wide SCSI (16-bit wide) Ultra2 SCSI (16-bit wide) 40 MB/Sec 80 MB/Sec Mainly used for Scanners, Zip-drives, CDROMs CD-Recorders, Tape Drives, DVD drives Lower end Hard Disk Drives Mid range Hard Disk Drives 6 Ultra 160-SCSI (16-bit Wide) 160 MB/Sec High end Hard Dis Drives and Tape Drives Ultra-320 SCSI (16-bit Wide) 320 MB/Sec State-of-the-art Hard Disk Drives, RAID backup applications Ultra-640 SCSI (16-bit Wide) 640 MB/Sec High end Hard Disk Drives, RAID applications, Tape Drives Figure 2.1.0 – Snapshot of SCSI History [10] The SCSI protocol emerged as the predominant protocol inside host servers because of its wellstandardized and clean message-based interface [11]. Moreover, in later years, SCSI supported command queuing at the storage devices and also allowed for overlapping commands [11]. In particular, since the storage was local to the server, the preferred SCSI transport that was used was Parallel SCSI where multiple storage devices were connected to the host server using cablebased bus [11]. 2.1.1 SCSI Architecture Model The SCSI architecture model is a client-server model. The initiator (Host Bus Adapter) initiates commands and acts like the client while the target (hard disk drives, tape drives etc) responds to commands initiated by the initiator and therefore act as servers. Figure 2.1.1.0 & 2.1.1.1 show the SCSI architecture model[9][12]. 7 Figure 2.1.1.0: SCSI Standards Architecture Model [9][12] Figure 2.1.1.1: Basic SCSI Architecture[9] 2.1.2 SCSI Command Descriptor Block 8 Protocol Data Units (PDUs) are passed between the initiator and target to send commands between a client and server. A PDU in SCSI is known as a Command Descriptor Block (CDB). It is used to communicate a command from a SCSI application client to a SCSI device server. In other words, the CDB defines the operation to be performed by the server. A CDB may have a fixed length of 16 bytes or a variable length between 12 and 260 bytes. A typical 10 byte CDB format is shown below in Figure 2.1.2.0 [9] [13] [14]. Figure 2.1.2.0: 10 byte SCSI CDB SCSI Common CDB Fields: Operation Code: The first byte of the CDB consists of the operation code (opcode) and it identifies the operation being requested by the CDB. The two main Opcodes of interest for this project are Read and Write opcodes. The opcode for a Read operation is 0x28 and the opcode for a Write operation is 0x2A [9] [13] [14]. Logical block address: The logical block addresses on a logical unit or within a volume/partition begins with block zero and is contiguous up to the last logical block of that logical unit or within that volume/partition [9] [13][14]. Transfer length: The transfer length field specifies the amount of data to be transferred for each IO. This is usually the number of blocks. Some commands use transfer length to specify the requested number of bytes to be sent as defined in the command description. A transfer length of zero implies that no data will be transferred for the particular command. A command without any data and simply a response (non-DATA command) will have the transfer length set to a value of zero [9][13][14][15]. Logical Unit Numbers: 9 The SCSI protocol defines how to address the various units to which the CDB is to be delivered to. Each SCSI device (target) can be subdivided into one or more logical units (LUNs). A logical unit is simply a virtual controller that handles SCSI communications on behalf of storage devices in the target. Each logical unit has an address associated with it which is referred to as the logical unit number. Each target must have at least one LUN. If only one LUN is present, it is assigned as LUN0 [9][13][14][15]. For more details on these fields, please refer to the SCSI spec [12]. 2.1.3 Typical SCSI IO Transfer The three main phases of an IO transfer are the command phase, the data phase and the status phase. The initiator sends the command to a target. Data is then exchanged between the target and initiator. Finally, the target sends the status completion for the command to the initiator. Certain commands known as non-DATA commands do not have a data phase. Figure 2.1.3.0 shows a SCSI IO transfer for a non-data command while Figure 2.1.3.1 shows a SCSI IO transfer for a data command [7][8][9][10]. Figure 2.1.3.0: Non-Data Command Transfer[9] 10 Figure 2.1.3.1: Data I/O Operation[9] 2.1.4 Limitations of SCSI Although the SCSI protocol has been successfully used for many years, it has limited capabilities in terms of the realization of storage networks due the limitations of the SCSI bus [11]. As the need for storage and servers grew over the years, the limitations of SCSI as a technology became seemingly obvious [14]. Initially, the use of parallel cables limited the number of storage devices and the distance capability of the storage devices from the host server. The length of the bus limits the distance over which SCSI may operate (maximum of around 25 meters)[9][14]. The limits imply that adding additional storage devices means the need to purchase a host server for attaching the storage [14]. Second, the concept of attaching storage to every host server in the topology means that the storage has to be managed on a per-host basis. This is a costly solution for centers with a large number of host servers. Finally, the technology doesn’t allow for a convenient sharing of storage between several host servers, nor typically does the SCSI technology allow for easy addition or removal of storage without host server downtime [16]. Despite these limitations, the SCSI protocol is still of importance since it can be used with other protocols simply by replacing the SCSI bus with a different interconnection type such as fibre channel, IP networks etc [9][16]. Availability of high bandwidth, low latency network interconnects such as Fibre Channel (FC) and Gigabit Ethernet (GbE) along with the complexities of managing dispersed islands of data storage, led to the development of Storage Area Networks (SANs)[16]. Lately, Internet Protocol (IP) is advocated as an alternative to transport SCSI traffic over long distance [11]. Proposals like iSCSI try to standardize the encapsulation of SCSI data in TCP/IP (Transmission Control Protocol/Internet Protocol) packets [11][17]. Once 11 the data is in IP packets, it can be carried over a range of physical network connections. Today, GbE is widely used for local area networks (LANs) and campus networks [11]. 2.2 iSCSI (Internet Small Computer System Interface) The advantages of IP networks are seemingly obvious. The presence of well tested and established protocols like TCP/IP, allow IP networks both wide-area connectivity as well as proven bandwidth sharing capabilities. The emergence of Gigabit Ethernet indicates that the bandwidth requirements of serving storage over a network should not be an issue [15]. The limitations of the SCSI bus, identified in the previous section, and the increased desire for IP storage led to the development of iSCSI. iSCSI was developed as an end-to-end protocol to enable transportation of storage I/O block data over IP networks thus dispensing with the physical bus implementation as the transport mechanism[7][20][21]. iSCSI works by mapping SCSI functionality to the TCP/IP protocol. By utilizing; TCP flow control, congestion control, segmentation mechanisms, IP addressing, and discovery mechanisms, iSCSI facilitates remote backup, storage, and data mirroring [7][20][22]. The iSCSI protocol standard defines amongst other things, the way SCSI commands can be carried over the TCP/IP protocol [7][23]. 2.2.1 iSCSI Session and Phases Data is transferred between an initiator and target via an iSCSI session. An iSCSI session is a physical or logical link which carries TCP/IP protocols and iSCSI PDUs, between an initiator and target. The PDUs in turn carry SCSI commands and data in the form of SCSI CDBs [7][23]. There are four phases in a session, where the first phase, login, starts with the establishment of the first TCP connection [19]. The four phases are: 1) Initial login phase: In this phase, an initiator sends the name of the initiator and target, and specifies the authentication options. The target then responds with the authentication options the target selects[19]. 2) Security authentication phase: This phase is used to exchange authentication information (ID, password, certificate, etc.) based on the agreed authentication methods to make sure each party is actually talking to the intended party. The authentication can occur both ways such that a target can authenticate an initiator, and an initiator can also request the authentication of the target. This phase is optional[19] 3) Operational negotiating phase: The Operational negotiating phase is used to exchange certain operational parameters such as protocol data unit (PDU) length and buffer size. This phase as well is optional [19]. 12 4) Full featured phase: This is the normal phase of an iSCSI session where iSCSI commands, and data messages are transferred between an initiator and a target(s)[19]. 2.2.2 iSCSI PDU The iSCSI PDU is the equivalent of the SCSI CDB. It is used to encapsulate the SCSI CDB and any associated data. The general format of a PDU is shown in Figure 2.2.2.0. It is comprised of a number of segments, one of which is the basic header segment (BHS). The BHS is mandatory and is the segment that is mostly used. The BHS segment layout is shown in Figure 2.2.2.1. It has a fixed length of 48 bytes. The Opcode, TotalAHSLength, and DataSegmentLength fields in the BHS are mandatory fields in all iSCSI PDUs. The Additional Header Segment (AHS) begins with 4-byte Type-Length-Value (TLV) information. This field specifies the length of the actual AHS following the TLV. The Header and Data 19 digests are optional values. The purpose of these fields is to protect the integrity the authenticity of the header and data. The digest types are negotiated during the login phase [9]. Figure 2.2.2.0 – iSCSI PDU Structure 13 Figure 2.2.2.0 – Basic Header Segment (BHS) 2.2.3 Data Transfer between Initiator and Target(s) Once the full feature phase of the normal session has been established, data can be exchanged between the initiator and the target(s). The normal session is used to allow transfer of data or/from the initiator and target. Let us assume that an application on the initiator wishes to perform storage I/O to/from the target. This can be broken down into two stages: 1. Progression of the SCSI command through the initiator, and 2. Progression of the SCSI command through the target. To help assist in understanding the progression of the commands, the iSCSI protocol layering model is shown in Figure 2.2.3.0 [9]. 14 Figure 2.2.3.0 – iSCSI protocol layering model Progression of a SCSI Command through the Initiator 1. The user/kernel application on the initiator issues a system call for an I/O operation which is sent to the SCSI layer. 15 3. On receipt at the SCSI layer, the system call is converted into a SCSI command and a CDB containing this information is then constructed. The SCSI CDB is then passed to the iSCSI initiator protocol layer [9]. 4. At the iSCSI protocol layer, the SCSI CDB and any SCSI data are encapsulated into a PDU and the PDU is forwarded to the TCP/IP layer [9]. 5. At the TCP layer, a TCP header is added. The IP layer encapsulates the TCP segment by adding an IP header before the TCP header [9]. 6. The IP datagram is passed to the Ethernet Data Link Layer where it is framed with Ethernet headers and trailers. The resulting datagram is finally placed on the network [9]. Progression of a SCSI Command through the Target 1. At the target, the Ethernet frame is stripped off at the Data Link Layer. The IP datagram is passed up to the TCP/IP layer [9]. 2. The IP and TCP layers each check and strip off headers and pass iSCSI PDU up to the iSCSI layer [9]. 3. At the iSCSI layer, the SCSI CDB is extracted from the iSCSI PDU and passed along with the data to the SCSI layer [9]. 4. Finally, the SCSI layer sends the SCSI request and data to the upper layer application [9]. 2.2.4 Read/Write command sequence in iSCSI Read Operation Example +------------------+-----------------------+----------------------+ |Initiator Function| PDU Type | Target Function | +------------------+-----------------------+----------------------+ | Command request |SCSI Command (READ)>>> | | | (read) | | | +------------------+-----------------------+----------------------+ | | |Prepare Data Transfer | +------------------+-----------------------+----------------------+ | Receive Data | <<< SCSI Data-In | Send Data | +------------------+-----------------------+----------------------+ | Receive Data | <<< SCSI Data-In | Send Data | +------------------+-----------------------+----------------------+ 16 | Receive Data | <<< SCSI Data-In | Send Data | +------------------+-----------------------+----------------------+ | | <<< SCSI Response |Send Status and Sense | +------------------+-----------------------+----------------------+ | Command Complete | | | +------------------+-----------------------+----------------------+ Figure 2.2.4.1 Read Operation Example[3] Write Operation Example +------------------+-----------------------+---------------------+ |Initiator Function| PDU Type | Target Function | +------------------+-----------------------+---------------------+ | Command request |SCSI Command (WRITE)>>>| Receive command | | (write) | | and queue it | +------------------+-----------------------+---------------------+ | | | Process old commands| +------------------+-----------------------+---------------------+ | | | Ready to process | | | <<< R2T | WRITE command | +------------------+-----------------------+---------------------+ | Send Data | SCSI Data-Out >>> | Receive Data | +------------------+-----------------------+---------------------+ | | <<< R2T | Ready for data | +------------------+-----------------------+---------------------+ | | <<< R2T | Ready for data | +------------------+-----------------------+---------------------+ | Send Data | SCSI Data-Out >>> | Receive Data | +------------------+-----------------------+---------------------+ | Send Data | SCSI Data-Out >>> | Receive Data | +------------------+-----------------------+---------------------+ | | <<< SCSI Response |Send Status and Sense| +------------------+-----------------------+---------------------+ | Command Complete | | | +------------------+-----------------------+---------------------+ To learn more about the SCSI command PDU, the Ready To Transfer (R2T) PDU, SCSI Data-In PDU and the SCSI Data-Out PDU, please refer to the iSCSI specification [3]. Figure 2.2.4.1 Read Operation Example[3] 2.3 Serial Attached SCSI (SAS) SAS is the successor of SCSI technology and is becoming wide-spread as performance requirements and addressability exceeds well beyond what legacy SCSI supports. In 2004, SAS interfaces were initially introduced at 3Gb/s. Currently, supporting 6Gb/s and moving to 12Gbps by 2012, SAS interfaces have significantly increased the available bandwidth 17 offered by legacy SCSI storage systems. Though fibre channel is more scalable, it is a costly solution for use in a SAN. Table 2.3.0 compares SCSI, SAS and Fibre Channel technologies. SCSI Parallel Bus 3.2 Gbps SAS Full Duplex 3 Gbps, 6Gbps Moving to 12 Gbps Distance Devices Number of Targets 1 to 12 meters SCSI only 14 devices Connectivity Drive Form Factor Cost Single-port 3.5” 8 meters SAS & SATA 128 expanders by 1 expander. >16,000 with cascaded expanders Dual-port 2.5” Topology Speed Fibre Channel Full Duplex 2 Gbps 4 Gbps Moving to 8 Gbps 10 km Fibre Channel only 127 devices in a loop. Switched fabric can go to millions of devices Dual-port 3.5” Low Medium High Table 2.3.0 – Comparing SCSI, SAS and Fibre Channel An initiator, also called a Host Bus Adapter or Controller, is used to send commands to SAS targets. SAS controller devices have a limited number of ports. A narrow Port in SAS consisting of a single port is referred to as a PHY [1]. Expander devices in a SAS domain facilitate communication between multiple SAS devices. Expanders have a typical port count of 12 to 36 ports while SAS controllers have a typical port count of 4-16 ports. Expanders can be cascaded as well to increase scalability. One of the most significant SAS feature is the transition from 3.5” drives to 2.5” drives. This helps reduce floor space and power consumption [2]. Another advantage of using SAS targets is that a SAS hard drive is dual-ported providing a redundant path to each hard drive in case of an Initiator/Controller fail-over. Also, unlike SCSI, SAS employs a serial means of data transfer like fibre channel [25]. Serial interfaces are known to reduce crosstalk and related signal integrity issues. Figure 2.3.0 shows an example of a typical SAS topology. SAS commands originate from the HBA driver and are eventually sent to the HBA. The SAS controller/HBA sends commands to the disk drives through the expander for expander attached targets/drives. The target replies to the command through the expander. The expander simply acts like a switch and routes the commands to the appropriate target and routes the responses from a particular target to the Controller. 18 Figure 2.3.0 – A typical SAS topology 2.3.1 Protocols used in SAS The three protocols used in SAS are Serial Management protocol, Serial SCSI Protocol and SATA Tunnel Protocol. Serial Management Protocol (SMP) [1] is used to discover the SAS topology and to perform system management. The Serial SCSI Protocol (SSP) [1] is used to send SCSI commands and receive responses from SAS targets. SATA Tunnel Protocol (STP) [1] is used to communicate with SATA targets in a SAS topology. 2.3.2 Layers of the SAS Standard Below is the organization and layers of the SAS standard: 19 Figure 2.3.2.0 – Layers of the SAS Standard As can be seen from the above Figure 2.3.2.0, the SAS Physical layer consists of: a) Passive interconnect (e.g., connectors and cable assemblies); and b) Transmitter and receiver device electrical characteristics. The phy layer state machines interface between the link layer and the physical layer to keep track of dword synchronization [2]. The link layer defines primitives, address frames, and connections. Link layer state machines interface to the port layer and the phy layer and perform the identification and hard reset sequences, connection management, and SSP, STP, and SMP specific frame transmission and reception [2]. The port layer state machines interface with one or more SAS link layer state machines and one or more SSP, SMP, and STP transport layer state machines to establish port connections and disconnections. The port layer state machines also interpret or pass transmit data, receive data, commands, and confirmations between the link and transport layers. The transport layer defines frame formats. Transport layer state machines interface to the application layer and port layer and construct and parse frame contents [2]. The application layer defines SCSI, ATA, and management specific features [2]. 20 2.3.3 SAS Ports A port contains one or more phys. Ports in a device are associated with physical phys based on the identification sequence. A port is a wide port if there is more than one phy in the port. A port is a narrow port if there is only one phy in the port. In other words, a port contains groups of phy with the same SAS address, attached to another group of phys with the same SAS address [2]. Each device in the topology has a unique SAS address. Therefore, for example if a HBA is connected using PHYs 0,1,2 and 3 to expander A and PHYs 4,5,6 & 7 to expander B, PHYs 0,1,2 & 3 of the HBA are a single wide-port and PHYs 4,5 6 & 7 are part of another wide-port. Figure 2.3.3.0 – Wide Ports in SAS 2.3.4 Primitives Primitives are DWords mainly used to manage flow control. Some of the common primitives are: 1. ALIGN(s) – Used during speed negotiation of a link, rate matching of connections etc 2. AIP(s) (Arbitration in Progress) - AIP is transmitted by an expander device after a connection request to specify that the connection request is being processed and specify the status of the connection request. 3. BREAKS(s) – A phy aborts a connection requests and break a connection by transmitting the BREAK primitive sequence 4. CLOSE – A close primitive is used to close a connection 21 5. OPEN ACCEPT – Specifies a connection has been accepted 6. OPEN REJECT – These primitives are used to specify that a connection has been rejected and specifies the reason for the rejection as well. 7. ACK – Specifies the acknowledgement of a SSP frame 8. NAK – Negative acknowledgement of a SSP frame 9. RRDY – Advertise SSP frame credit 10. BROADCAST(s) – Used to notify SAS ports of events such as change in topology etc [1] To learn more about the other primitives and the primitives mentioned above, please refer to the SAS Specification. 2.3.5 SSP frame format In this project, we primarily work with SSP Read/Write commands. A typical SSP frame format is below: 22 Figure 2.3.5.0 – SSP Frame Format The Information Unit is a DATA frame, XFER_RDY frame, COMMAND frame, RESPONSE frame or TASK frame. For SSP requests of interest for this project, the information unit is either a COMMAND frame, XFER_RDY frame, DATA frame or a RESPONSE frame [2]. Command frame: 23 The COMMAND frame is sent by an SSP initiator port to request that a command be processed. The command frame consists of the logical unit number the command is intended for as well as the SCSI CDB that contains the type of command, transfer length etc [2]. Figure 2.3.5.1 – SSP Command Frame XFER_RDY frame: The XFER_RDY frame is sent by an SSP target port to request write data from the SSP initiator port during a write command or a bidirectional command [2]. Figure 2.3.5.2 – SSP XFER_RDY Frame 24 The REQUESTED OFFSET field contains the application client buffer offset of the segment of write data in the data-out buffer that the SSP initiator port may transmit to the logical unit using write DATA frames [2]. The WRITE DATA LENGTH field contains the number of bytes of write data the SSP initiator port may transmit to the logical unit using write DATA frames from the application client data-out buffer starting at the requested offset [2]. DATA frame: Figure 2.3.5.3 – SSP Data Frame A typical DATA frame in SAS is limited to 1024 bytes (1K) [2]. Response Frame: The response frame is sent by an SSP target port n response to a SSP command by an initiator [2]. 25 Figure 2.3.5.4 – SSP Response Frame A successful write/read completion will not contain any sense data. In this project, we work with successful read/write completions and therefore sense data won’t be returned by the target. 2.3.6 READ/WRITE command sequence SSP Read Sequence [20] 26 Figure 2.3.6.0 – SSP Read Sequence SSP Write Sequence[20] Figure 2.3.6.1 – SSP Write Sequence 27 3.0. tSAS (Ethernet SAS) 3.1 Goal, Motivation and Challenges of the Project The goal of this project is to investigate sending a set of SAS commands, data and responses over TCP/IP and to investigate how tSAS can performs against legacy iSCSI and legacy SAS as best as possible. Since Ethernet contains its own physical layer, SAS over TCP (tSAS) eliminates the need for the SAS physical layer overcoming the distance limitations of SAS. This overcomes the distance limitation of the Serial Attached Small Computer System Interface (SAS) physical layer interface so that SAS storage protocol may be used for communication between host systems and storage controllers in the Storage Area Network (SAN) [21]. SANs allow sharing of data storage over long distances and still permit centralized control and management [16]. More particularly, the SAN embodiments can comprise at least one host computer system and at least one storage controller that are physically separated by greater than around 8 meters which is the physical limitation of a SAS cable. An Ethernet fabric can connect the host computer system(s) and storage controller(s)[21]. The SAS storage protocol over TCP can also be used to communicate between storage controllers/hosts and SAS expanders as explained later in this section. Using gigabit Ethernet (10G/40G/100G) [32], tSAS also overcomes the 6G and 12G limitations of SAS2 (6G) and SAS3 (12G) respectively. As mentioned earlier in this paper, the main challenge of developing an tSAS client/server application is that there is no standard specification for tSAS. We will leverage Michael Ko’s patent [21] on SAS over Ethernet[27] to help us through the process of defining our tSAS protocol required for this project. Similar to iSCSI, TCP was chosen as the transport for tSAS. TCP has many features that are utilized by iSCSI. The exact same features and reasoning is behind the choice of using TCP for tSAS as well. • TCP provides reliable in-order delivery of data. • TCP provides automatic retransmission of data that was not acknowledged. • TCP provides the necessary flow control and congestion control to avoid overloading a congested network. • TCP works over a wide variety of physical media and inter-connect topologies. [23] 3.2 Project Implementation 28 3.2.0 tSAS Topology and Command flow sequence The Figure 3.2.0.0 below shows a typical usage of tSAS to expand scalability, speed and distance of legacy SAS by using a tSAS HBA. In Figure 3.2.0.0, tSAS is the protocol of communication used between a remote tSAS HBA and a tSAS controller. The tSAS controller is connected to the back-end expander and drives using legacy SAS cables. Figure 3.2.0.0 – Simple tSAS Topology All SSP frames will be encapsulated in an Ethernet frame. Figure 3.2.0.1 shows how an Ethernet frame with the SSPframe data encapsulated in it looks. The tSAS Header is the SSP Frame Header and the tSAS Data is the SSP Information Unit (Refer to Figure 2.3.4.0 for the SSP frame format). Figure 3.2.0.1 – tSAS header and data embedded in an Ethernet frame 29 The back-end of tSAS is a tSAS HBA that can receive tSAS commands, strip off the TCP header and pass on the SAS command to the expander and drives. The back-end of tSAS will talk inband to the SAS expanders and drives. The remote tSAS Initiator communicates with the tSAS target by sending tSAS commands. Figure 3.2.0.2 shows the typical SSP request/response read data flow. The tSAS SSP Request is initially sent by the tSAS Initiator to the tSAS Target over TCP. The tSAS Target strips off the TCP header and sends the SSP request using the SAS Initiator block on the tSAS Target to the SAS expander. The SAS expander sends the data frames and the SSP Response to the tSAS Target. Finally, the tSAS Target embeds the SSP data frames and response frame over TCP and sends the frames to the tSAS Initiator. A write (Figure 3.2.0.3) will look the same with the tSSP Request sent by the initiator followed by the Xfer_rdy (similar to the R2T in iSCSI) sent by the target followed by the DATA sent by the initiator and finally the tSAS response from the target. Figure 3.2.0.2 – tSAS Read SSP Request & Response Sequence Diagram. This figure doesn’t show all the SAS primitives exchanged on the SAS wire within a connection after the Open Accept 30 Figure 3.2.0.3 – tSAS WRITE SSP Request & Response Sequence Diagram. This figure doesn’t show all the SAS primitives exchanged on the SAS wire within a connection after the Open Accept SAS over Ethernet can also be used for a SAS controller to communicate with a SAS expander. In SAS1, expanders did not have support to receive SAS commands out-of-band. SAS1 controllers/HBAs would need to send commands to an expander in-band even for expander diagnosis and management. SAS HBAs/controllers have a lot more complex functionally than expanders. Diagnosing issues by sending commands in-band to expanders made it harder and time-consuming to root cause where the problem is in the SAS topology. Also, managing expanders via in-band lacked the advantage of remotely managing expanders via out-of-band over Ethernet. With the gaining popularity of zoning, expander vendors have implemented support for limited SMP zoning commands out-of-band via Ethernet in SAS2 [1]. A client management application is used to send a limited set of SMP commands out-of-band to the expander. The expander processes the commands and sends the SMP responses out-of-band to the management application. Figure 3.2.0.4 shows the communication between the client management application and the expander during a SMP command. 31 Figure 3.2.0.4 – SMP Request & Response Sequence Diagram. This figure doesn’t show the SAS primitives exchanged on the SAS wire within a connection after the Open Accept This already existing functionality on a SAS expander can be leveraged to design the tSAS functionality on an Expander to communicate via TCP with a SAS controller/HBA. Figure 3.2.0.5 shows a topology where the tSAS protocol is used for communication between the tSAS Controller and the back-end expander as well. Michael Ko’s patent doesn’t cover using tSAS to talk with expanders. However, expanders can also be designed to send commands/data/responses via TCP. 32 Figure 3.2.0.5 -Topology where tSAS is used to communicate with an expander 3.2.1 Software and Hardware solutions for tSAS implementations Similar to iSCSI, tSAS can be implemented in both hardware and software. This is one of the benefits of iSCSI and tSAS since each organization can customize their SAN configuration based on budget and the performance needed [23][24]. Software based tSAS solution: This solution is cheaper than a hardware based tSAS solution since you do not need extra hardware to implement this tSAS solution. In this solution, all tSAS processing is done by the processor and TCP/IP operations are also executed by the CPU. The NIC is merely an interface to the network, this implementation requires a great deal of CPU cycles hurting the overall performance of the system [23][24]. TCP/IP Offload engine tSAS solution: As network infrastructures have reached Gigabit speeds, network resources are becoming more abundant and the bottleneck is moving from the network to the processor. Since TCP/IP processing requires a large portion of CPU cycles, a software tSAS implementation may be used along with specialized network interface cards with TCP offload engines (TOEs) on board. NICs with integrated TOEs have hardware built into the card that allows the TCP/IP processing to be done at the interface. This prevents the TCP/IP processing from making it to the CPU freeing the system processor to spend its resources on other applications [23][24]. 33 Figure 3.2.1.0 – TCP/IP Offload Engine [23] [24] Hardware Based tSAS solution: In a hardware-based tSAS environment, the initiator and target machines contain a host bus adapter (HBA) that is responsible for both TCP/IP and tSAS processing. This will free the CPU from both TCP/IP and tSAS functions. This dramatically increases performance in those settings where the CPU may be burdened with other tasks [23][24]. 34 Software tSAS Software tSAS with TCP Offload Hardware tSAS with TCP Offload Figure 3.2.1.2 – tSAS implementations [25] . 3.2.2 Primitives In conventional SAS storage protocol, the SAS link layer uses a construct known as primitives. Primitives are special 8b/10b encoded characters that are used as frame delimiters, for out of band signaling, control sequencing, etc. Primitives were explained in section 2.3.3. These SAS primitives are defined to work in conjunction with the SAS physical layer [21].As far as primitives go, all ALIGN(s), OPEN REJECT(s), OPEN(s), CLOSE, DONE, BREAK, HARD RESET, NAKs, RRDY etc can simply be ignored on the tSAS protocol side since these are link layer primitive required only on the SAS side. For example, if an IO on the SAS side is timed out or fails due to NAKs or BREAKs, OPEN timeouts or OPEN REJECTs, the IO will simply timeout on the tSAS side to the tSAS Initiator. The primitives of interest include BROADCAST primitives, especially BROADCAST CHANGE primitive, as this primitive tells an initiator that the topology has changed and to re-discover the topology using SMP commands. However, since, as discussed above, the SAS physical layer is unnecessary, an alternate means of conveying the SAS primitives is 35 needed. In one embodiment, this can be accomplished by defining a SAS primitive to be encapsulated in an Ethernet frame [21]. The SAS Trace below in Figure 3.2.2.0 shows the primitives on a READ command exchanged on the wire between the initiator and target. The lower panel shows the primitives such as RRDys, ACKs, DONEs, CLOSEs etc exchanged during a READ command sequence. These primitives are not required in the tSAS protocol. Please refer to Appendix Section 8.0 for information on SAS trace capturing. 36 Figure 3.2.2.0 – Primitives on the SAS side [25] 37 3.2.3 Discovery Discovery in tSAS will be similar to SAS and will be accomplished by sending Serial management protocol (SMP) commands over TCP to the initiators and expanders downstream to learn the topology. The SMP Request frame will be embedded in an Ethernet frame and sent to the expander/initiator. The expander/initiator will reply to the SMP Request by sending a SMP Response frame embedded in an Ethernet frame. Figure 3.2.0.4 in section 3.2.0 shows how the SMP commands are communicated in tSAS. For more information on SMP commands and Discovery, please refer to the SAS Specification [1]. For example, on a Discover List command that is used to return information on an attached device/PHY, the SMP Discover List Command is sent by the initiator and the SMP Discover List is sent via TCP as the response. 38 Figure 3.2.3.0 – SMP Discover List Request Frame 39 Figure 3.2.3.1 – SMP Discover List Response Frame Since Ethernet frames are assembled such that they include a cyclic redundancy check (CRC) for providing data protection, a SMP frame that is encapsulated in an Ethernet frame can rely on the data protection afforded by this same cyclic redundancy check [21]. In other words, the SAS side CRC on the request and response SMP frame need not be transmitted. 40 Please refer to the SAS Specification [1] for information on these SMP commands. 3.2.4 Task Management Similar to SAS and iSCSI, a Task Frame will be sent by the initiator to another initiator/expander in the topology. A Task Management may be sent to manage a target. For instance, when IOs to a target fail, the host may request a Task Management Target Reset command to reset the target in the hope that the target is reset and cooperates after being reset. A host may request a Task Management LUN reset to reset an entire LUN and have all IOs to that LUN be failed. To learn more about the various Task Commands, please refer to the SAS [1] and iSCSI specifications [3]. 3.2.5 tSAS mock application to compare with an iSCSI mock application For the purpose of investigating iSCSI vs tSAS, a client application and a server application that communicate using iSCSI and tSAS are written. A tSAS client application will send read/write tSAS commands to the tSAS server application which will process and send responses to the client. Similarly, an iSCSI client application will send certain read/write iSCSI commands to the iSCSI server application which will process and send responses to the client. Commands are sent single threaded such that the queue depth (number of outstanding commands) is one. The algorithm used for the tSAS application and the iSCSI application is similar helping us investigate the two protocols. Initially, the tSAS application is written such that each REQUEST, RESPONSE and DATA frame is encapsulated into an independent Ethernet frame. Revisiting the SSP Format in Figure 2.3.40, the entire SSP Frame excluding the CRC is encapsulated into an Ethernet frame. Since Ethernet frames are assembled such that they include a cyclic redundancy check (CRC) for providing data protection, a SAS/SSP frame that is encapsulated in an Ethernet frame can rely on the data protection afforded by this same cyclic redundancy check [21]. In SAS, each data frame is 1K in length. In the initial design, each Ethernet frame that carries the data frame only carried 1K of DATA.. This causes the time to complete the IO to be significantly high compared to not limiting the amount of data to be sent in each frame to 1K. The performance was hence slow. The application was then revised to send more than 1K of data in each frame by maxing out the Data that can be stuffed into each Ethernet frame. This means that each Ethernet frame can contain more than just 1K of data. Below are the results from this implementation. 41 Below are the numbers of running the tSAS application when each REQUEST, RESPONSE and DATA frame is individually encapsulated into an Ethernet frame and sent across. The test bench used for this experiment is a Windows Server 2008 machine with the client and server application running such that the client sends requests and the server replies to requests. A Netgear Prosafe 5 port Gigabit switch model GS 105 is used in between such that the client and server auto negotiate to 1 Gbps. 1 Gbps: READ Transfer Length (KB) IOPS I/Os per second 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB Average Time from Read Command to Completion (milliseconds) in tSAS where each DATA frame is encapsulated in an Ethernet frame 0.249 ms 0.206 ms 0.216 ms 0.368 ms 0.495 ms 0.28 ms 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 1.616 ms 2.711 ms 4.913 ms 5.954 ms 7.681 ms 16.111 ms 618.811 368.867 203.541 167.954 130.191 62.069 4016.064 4854.368 4629.62 2717.391 2020.20 3571.428 Table 3.2.5.0 - Average Time from Read Command to Completion (milliseconds) in tSAS where each DATA frame is encapsulated in an Ethernet frame Below are the numbers of running the tSAS application when each REQUEST, RESPONSE and DATA frame is encapsulated into an Ethernet frame and sent across. However, in this implementation, the DATA is not limited to 1K in each Ethernet frame. DATA frames are combined to use each Ethernet frame to maximum capacity. Transfer Length (KB) 1 KB 2 KB 4 KB Average Time from Read Command to Completion (milliseconds) in tSAS where each Ethernet frame containing SSP Data is used efficiently 0.199 ms 0.114 ms 0.280 ms IOPS I/Os per second 5025.125 8771.929 3571.428 42 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 0.258 ms 0.174 ms 0.455 ms 0.828 ms 1.418 ms 2.714 ms 3.000 ms 3.756 ms 7.854 ms 3875.968 5747.124 2197.802 1207.729 705.218 368.459 333.333 266.240 127.323 Table 3.2.5.1 - Average Time from Read Command to Completion (milliseconds) in tSAS where each Ethernet frame containing SSP Data is used efficiently Below are the numbers of running the iSCSI application when each REQUEST, RESPONSE and DATA frame is encapsulated into an Ethernet frame and sent across. In this implementation, the DATA is not limited to 1K in each Ethernet frame. The iSCSI implementation itself doesn’t pack each SCSI DATA frame into a separate Ethernet frame. It allows DATA frames to be combined such that more than just a single DATA frame is sent. Therefore, in our implementation as well DATA frames are combined to use each Ethernet frame to maximum capacity. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) in iSCSI where each Ethernet frame containing SCSI Data is Maxed out 0.189 ms 0.261 ms 0.205 ms 0.501 ms 0.327 ms 0.454 ms 0.898 ms 1.421 ms 3.311 ms 3.138 ms 4.955 ms 8.942 ms IOPS I/Os per second 5291.005 3831.417 4878.048 2996.007 3058.104 2202.643 1113.585 703.729 302.023 318.674 201.816 111.831 Table 3.2.5.2 - Average Time from Read Command to Completion (milliseconds) in iSCSI where each Ethernet frame containing SCSI Data is Maxed out As can be seen from Tables 3.2.5.0, 3.2.5.1 and 3.2.5.2: 43 1. A tSAS implementation where each DATA frame is encapsulated in a separate Ethernet frame is not an efficient implementation 2. A tSAS implementation where more than just 1 K of DATA (a single DATA frame) is encapsulated in an Ethernet frame is more efficient. This is comparable to the iSCSI implementation in the market as well and the iSCSI client/server app written. Therefore, for the rest of this project we will go with this tSAS implementation. 3.3 Performance evaluation 3.3.0 Measuring SAS performance using IOMeter in Windows and VDbench in Linux 3.3.0.1 SAS Performance using IOMeter Iometer is an I/O subsystem measurement and characterization tool that can be used in both single and clustered systems [32]. Iometer is both a workload generator as it performs I/O operations in order to stress the system being tested, and a measurement tool as it examines and records the performance of its I/O operations and their impact on the system under test. It can either be configured to emulate a disk target or network I/O load of any program. It can also be used to generate entirely synthetic I/O loads. It can also generate and measure loads on single or multiple networked systems [32]. Iometer can be used to measure and characterize the: Performance of network controllers. Performance of disk controllers. Bandwidth and latency capabilities of various buses. Network throughput to attached drive targets. Shared bus performance. System-level performance of a hard drive . System-level performance of a network [32]. Iometer consists of two programs, namely, Iometer and Dynamo. Iometer is the name of the controlling program. Using the graphical user interface, a user can configure the workload, set the operating parameters, and start and stop tests. Iometer tells Dynamo what to do, collects the resulting data, and summarizes the results into output files. Only one copy of IOMeter should be running at a time. It is typically run on the server machine. Dynamo is the IO workload generator. It doesn’t come with a user interface. At the Iometer’s command, Dynamo performs I/O operations, records the performance information and finally returns the data to IOMeter [32]. In this project, IOMeter is used to measure performance of a SAS topology/drive. 44 The test bench used to measure SAS performance via the IOMeter is: 1. 2. 3. 4. 5. 6. 7. The Operating System used is Windows Server 2008. The server used was a Super Micro server A SAS 6 Gbps HBA in a PCIe slot The HBA attached to the 6 Gbps SAS Expander The 6G SAS expander attached downstream to a 6G SAS drive. A LeCroy SAS Analyzer placed between the target and expander IOMeter was set to have a maximum number of outstanding IOs of 1. In other words, the queue depth is set to 1. This makes IOs single-threaded. This option was used since the mock server and client iSCSI and tSAS applications also have a queue depth of 1. 8. For the maximum I/O rate (I/O operations per second), the Percent Read/Write Distribution was set to 100% Read while testing the read performance and was set to 100% write while testing the write performance. The Percent Random/Sequential Distribution was set to 100% Sequential while testing the read and write performance. 9. For measurements taken without an expander, the SAS drive was directly attached to the SAS analyzer and the SAS analyzer was attached to the HBA. A SAS Protocol Analyzer can be used to capture SSP/STP/SATA traffic between various components in a SAS topology. For example, a SAS Protocol Analyzer can be placed between an Initiator and an Expander to capture the IO traffic between the Initiator and the Expander. Similarly, a SAS protocol analyzer may be placed between drives and an Expander helping the user to capture IO traffic between the drives and an Expander. A capture using the SAS Protocol Analyzer is commonly known as a SAS trace. 45 Figure 3.3.0.1.0 – SAS Trace using Le Croy SAS Protocol Analyzer Timings on READ and WRITE commands with transfer sizes of 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K and 2048K are captured. The first two tables in Table 3.3.0.1.0 and Table 3.3.0.1.1 capture performance of READs. Figure 3.3.0.1.0 shows READ performance when a SAS drive is direct-attached to the HBA. Figure 3.3.0.1.1 shows READ performance when a SAS drive is connected to a HBA via an expander. Performance of READ10 command using Direct-Attached drive: Transfer Length (KB) 1 KB 2 KB 4 KB Average Time from Read Command to Completion (milliseconds) using IOMeter – DirectAttached 0.0644 ms 0.0768 ms 0.0800 ms Average time from READ command completion on the SAS trace from the drive 0.0365 ms 0.0389 ms 0.0563 ms Average IOPS I/Os per Second 15527.950 13020.833 12500 46 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 0.0916 ms 0.112 ms 0.219 ms 0.438 ms 0.861 ms 1.706 ms 3.409 ms 6.896 ms 30.972 ms 0.0508 ms 0.0675 ms 0.180 ms 0.376 ms 0.788 ms 1.579 ms 3.264 ms 6.693 ms 21.653 ms 10917.030 8928.571 4566.21 2283.105 1161.440 586.166 293.341 145.011 46.182 Table 3.3.0.1.0 – Direct-Attached SSP READ performance In Table 3.3.0.1.0, the average time for READ command to complete using IOMeter is the value calculated by IOMeter. The average time for READ command to complete using the SAS analyzer is the time it takes for the drive to respond to the command once the HBA sends the command. As can be seen, the drive is the bottle neck in this topology. The I/Os per second is not always a direct multiple of the average time for the IO completion due to delays at the HBA, hardware etc. However it is close enough and in this project we assume that the IOPS is 1000 ms/ (Average time in millisecond for 1 IO to complete). Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) using IOMeter Expander Attached 0.0649 ms 0.0709 ms 0.0810 ms 0.0840 ms 0.113 ms 0.225 ms 0.416 ms 0.872 ms 1.716 ms 3.418 ms 7.022 ms 31.344 ms Average time from READ command completion on the SAS trace from the drive Average IOPS I/Os per second Average time for READ Completion without including delay from the drive 0.0365 ms 0.0389 ms 0.0563 ms 0.0508 ms 0.0675 ms 0.180 ms 0.376 ms 0.788 ms 1.579 ms 3.264 ms 6.693 ms 21.653 ms 15408.320 14104.372 12345.679 11904.761 8849.557 4444.444 2403.846 1146.788 582.750 292.568 142.409 31.904 0.0284 ms 0.032 ms 0.0247 ms 0.0332 ms 0.0455 ms 0.045 ms 0.04 ms 0.084 ms 0.137 ms 0.154 ms 0.329 ms 9.691 ms Table 3.3.0.1.1 – Expander-Attached READ performance 47 As can be seen from the above tables, the performance numbers on READ commands of various transfer lengths when the SAS target is directly connected to the HBA or is behind an expander are very similar. In other words, the timing on the wire between the HBA and the expander is less than 1 millisecond for transfer sizes between 1K and 2048K. The HBA and the expander are generally designed such that h/w takes most of the heavy weight lighting when it comes to IO path/transfers. Performance of WRITE10 command: Timings on WRITE commands of sizes 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K and 2048K are captured below. Table 3.3.0.1.2 shows READ performance when a SAS drive is direct-attached to the HBA. Table 3.3.0.1.3 shows READ performance when a SAS drive is connected to a HBA via an expander. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using IOMeter – Direct Attached 6.014 ms 6.020 ms 6.030 ms 6.059 ms 6.111 ms 6.216 ms 6.424 ms 6.836 ms 7.672 ms 9.338 ms 12.824 ms 37.346 ms Table 3.3.0.1.2– Direct-Attached WRITE performance Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB Average Time from Write Command to Completion (milliseconds) using IOMeter – Expander Attached 6.012 ms 6.020 ms 6.032 ms 6.059 ms Average time from WRITE command completion on the SAS trace from the drive IOPS I/Os per second Average time for WRITE Completion without including delay from the drive 5.957 ms 5.964 ms 5.990 ms 6.011 ms 166.334 166.112 165.782 165.047 0.055 ms 0.056 ms 0.042 ms 0.048 ms 48 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 6.110 ms 6.215 ms 6.424 ms 6.839 ms 7.672 ms 9.337 ms 12.665 ms 37.345 ms 6.054 ms 6.157 ms 6.378 ms 6.782 ms 7.573 ms 9.204 ms 12.466 ms 27.751 ms 163.666 160.901 155.666 146.220 130.344 107.101 78.957 26.777 0.056 ms 0.058 ms 0.046 ms 0.057 ms 0.099 ms 0.133 ms 0.199 ms 9.74 ms Table 3.3.0.1.3 – Expander-Attached WRITE performance 3.3.0.2 SAS Performance using VDBench in Linux Vdbench is a disk and tape I/O workload generator that is used for testing and benchmarking of existing and future storage products. Vdbench generates a wide variety of controlled storage I/O workloads by allowing the user to set workload parameters such as allowing control over workload parameters such as I/O rate, transfer sizes, read and write percentages, and random or sequential workloads etc [37]. The test bench used to measure SAS performance via the VDBench is: 1. 2. 3. 4. 5. 6. 7. The Operating System used is Red Hat Enterprise Linux 5.4. The server used was a Super Micro server A SAS 6 Gbps HBA in a PCIe slot The HBA attached to the 6 Gbps SAS Expander The 6G SAS expander attached downstream to a 6G SAS drive. A LeCroy SAS Analyzer placed between the target and expander VDBench was set to have a maximum number of outstanding IOs of 1. In other words, the queue depth is set to 1. This makes IOs single-threaded. This option was used since the mock server and client iSCSI and tSAS applications also have a queue depth of 1. 8. For the maximum I/O rate (I/O operations per second), the Percent Read/Write distribution was set to 100% Read while testing the read performance and was set to 100% write while testing the write performance. The Percent Random/Sequential Distribution was set to 100% Sequential while testing the read and write performance. Please refer to Appendix in Section 8 to learn more about VDBench and the scripts used. Performance of READ10 command using VDBench Transfer Length (KB) Average Time from Read Command to Completion Average time from READ command completion on the Average IOPS I/Os per Second Average time for READ Completion without including delay from 49 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB (milliseconds) using VDBench 0.0780 ms 0.0850 ms 0.100 ms 0.132 ms 0.264 ms 0.528 ms 1.058 ms 2.117 ms 4.235 ms 8. 572 ms 18.317 ms 34.323 ms SAS trace from the drive 0.034 ms 0.039 ms 0.056 ms 0.085 ms 0.215 ms 0.476 ms 0.998 ms 2.051 ms 4.162 ms 8.367 ms 12.576 ms 20.983 ms the drive 12820.51 11764.705 10000 7575.75 3787.87 1893.94 945.18 472.366 236.127 116.658 54.594 29.135 0.044 ms 0.046 ms 0.044 ms 0.047 ms 0.049 ms 0.052 ms 0.060 ms 0.066 ms 0.073 ms 0.205 ms 5.741 ms 13.34 ms Table 3.3.0.2.0 – SSP READ performance using VDBench Performance of WRITE10 command using VDBench Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using VDBench 6.006 ms 6.025 ms 6.058 ms 6.121 ms 6.252 ms 6.518 ms 4.086 ms 5.142 ms 7.290 ms 14.511 ms 23.083 ms 40.105 ms Average time from Write command completion on the SAS trace from the drive 5.965 ms 5.978 ms 6.012 ms 6.068 ms 6.203 ms 5.638 ms 4.023 ms 5.078 ms 7.199 ms 11.895 ms 19.738 ms 35.078 ms Average IOPS I/Os per Second 166.500 165.975 165.070 163.371 159.95 153.421 244.738 194.476 137.174 68.913 43.321 24.935 Average time for WRITE Completion without including delay from the drive 0.041 ms 0.047 ms 0.046 ms 0.053 ms 0.049 ms 1.15 ms 0.063 ms 0.064 ms 0.091 ms 2.616 ms 3.345 ms 5.027 ms Table 3.3.0.2.1 – SSP WRITE performance using VDBench The following conclusions can be drawn from the tests above using IOMeter and VDBench: 1. Looking at the performance numbers above, one notices that the performance drops drastically for a 2048K Read/Write as compared to a 1024K Read/Write. After analyzing the SAS traces that were collected for the transfer sizes of 1K, 2K, 4K, 8K, 16K, 32K, 64K, 50 128K, 256K, 512K, 1024K and 2048K on Reads and Writes, and analyzing the SAS traces, one finds that up until 1024K transfer size requests, the HBA sends a command to the target requesting for all the data in a single IO. However, on the 2048K transfer sizes and higher, the HBA sends commands of varying transfer sizes to the target. In other words, a single IO doesn’t fetch the entire 2048K of data on a read and a single IO is not used to write 2048K of data to a drive. Multiple smaller transfer sizes IOs are used to read or write 2048K of data from or to the disk respectively causing the performance to suddenly drop. This most likely an optimization or limitation of the driver. 2. One also notices that performance on READs is better than the performance of WRITEs. This is obvious as the frame sizes on READs is lesser and the number of frame being transmitted and the number of handshakes that occur on a READ command is lesser than what occurs on a WRITE command. Also, it takes more time for a drive to write DATA than to read from Disk. The IOmeter user guide as well states “For the maximum I/O rate (I/O operations per second), try changing the Transfer Request Size to 512 bytes, the Percent Read/Write Distribution to 100% Read, and the Percent Random/Sequential Distribution to 100% Sequential.” 3. At smaller transfer sizes, the performance difference between each transfer size is not so apparent. However, at larger transfer sizes (above 256K etc), the performance and time for IOs to complete is more visibly lower and higher respectively. 4. The results obtained via VDBench are slightly poorer than the results obtained via IOMeter. A different SAS drive was used for both tests and therefore the SAS drive performance used during VDBench is poorer than the SAS drive performance used using IOMeter testing. Also, timings can vary as the OS and driver are different for Windows and Red hat Linux. Note: The SAS Analyzer Traces, performance results, VDBench scripts etc are located in the SASAnalyzerTraces folder in the project folder where all the deliverables are located. Refer to section Appendix 8. 3.3.1 Measuring iSCSI performance using IOMeter in Windows The following measurements are taken on the following test bench: 1. An iSCSI software Initiator running on a windows system. The Starwind iSCSI Initiator was used as the iSCSI Initiator. Please refer to Appendix in Section 8 to learn more about the StarWind iSCSI Initiator. 2. An iSCSI software Target emulated on a windows system. The KernSafe iSCSI Target was used to create an iSCSI target and talk to it. Please refer to Appendix in Section 8 to learn more about the KernSafe iSCSI Target. 51 3. The iSCSI target was created using a SCSI USB flash drive 4. The iSCSI Initiator and iSCSI Target system are connected to each other via a NetGear Pro Safe Gigabit Switch at a connection rate of 1Gbps. 5. READs/WRITEs of transfer lengths/sizes 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 1024K, 2048K are issued by the iSCSI Initiator 6. A WireShark analyzer is also running on the Initiator system to view the data passed between the iSCSI Initiator and iSCSI Target. Please refer to Appendix in Section 8 to learn more about the WireShark Network Protocol Analyzer. 7. IOMeter is used to view the performance of these transfer sizes 8. The number of outstanding IOs (queue Depth) is set to 1 in the IOMeter. 9. On each READ, the test is set to 100% sequential READS. On each WRITE, the test is set to 100% sequential WRITEs. 1Gbps Read Table shows the iSCSI Read Completion timings. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) using IOMeter with iSCSI Device 1.208 ms 1.423 ms 2.377 ms 2.252 ms 3.251 ms 4.550 ms 5.683 ms 14.640 ms 28.505 ms 164.172 ms 415.445 ms 913.563 ms IOPS I/Os per second 827.810 702.740 845.308 444.049 307.597 219.780 175.963 68.306 35.081 6.091 2.407 1.094 Table 3.3.1.0 – iSCSI Read Completion Timings at 1 Gbps Write Table shows the iSCSI Write Completion timings. Transfer Length (KB) Average Time from Read Command to Completion (milliseconds) using IOMeter with iSCSI Device MBs/Sec 52 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 1.077 ms 1.890 ms 2.220 ms 2.593 ms 4.867 ms 7.942 ms 13.083 ms 27.028 ms 50.340 ms 225.698 ms 593.711 ms 1059.284 ms 928.505 529.101 450.450 385.653 205.465 125.912 76.435 36.998 19.685 4.430 1.684 0.944 Table 3.3.1.1 – iSCSI Write Completion Timings at 1 Gbps The above timings include the delay at the USB flash drive. Since USB flash drives are slow, we then ran IOMeter on the machine connected to the USB to get the read/write timings when IOs are issued to the SCSI drive directly. The following measurements are taken on the following test bench: 1. A virtual SCSI target (USB flash drive) was used as the SCSI target. 2. READs/WRITEs of transfer lengths/sizes 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 1024K, 2048K are issued by the iSCSI Initiator 3. IOMeter is used to benchmark the SCSI device READ Table shows the SCSI Read Completion timings. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB Average Time from Read Command to Completio (milliseconds) using IOMeter with SCSI Device 0.919 ms 1.073 ms 1.194 ms 1.453 ms 1.984 ms 3.448 ms 4.455 ms 7.044 ms 13.205 ms 25.885 ms 51.234 ms IOPS I/Os per second 1088.139 931.966 837.520 688.231 504.032 290.023 224.466 141.964 75.728 38.632 19.518 53 2048 KB 102.571 ms 9.749 Table 3.3.1.2 – SCSI Read Completion Timings WRITE Table shows the SCSI Write Completion timings. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using IOMeter with SCSI Device 0.679 ms 1.075 ms 1.207 ms 1.775 ms 3.527 ms 5.914 ms 8.821 ms 17.041 ms 32.053 ms 62.021 ms 120.193 ms 244.561 ms MBs/Sec 1472.754 930.232 828.500 563.380 283.527 169.090 113.365 58.682 31.198 16.123 8.319 4.088 Table 3.3.1.3 – SCSI Write Completion Timings To get the performance of iSCSI without including the time it takes for IOs to complete from the SCSI target itself (the bottleneck), the iSCSI performance timings via IOMeter is subtracted by the SCSI Performance timings via IOMeter. These results then make it more feasible for us to compare the iSCSI numbers here to the mock client/server iSCSI application written for this project. Read Table shows the iSCSI Read Completion timings without including the time it takes for IOs to complete from the SCSI target itself. Transfer Length (KB) 1 KB 2 KB Average Time from Read Command to Completion (milliseconds) without including the time it takes For IOS to complete from the SCSI target 1.208 – 0.919 = 0.289 ms 0.35 ms 54 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 0.585 ms 0.799 ms 1.267 ms 1.102 ms 1.228 ms 7.596 ms 15.3 ms 138.287 ms 264.211 ms 810.992 ms Table 3.3.1.4 – iSCSI Read Completion Timings without including delay at the drive Write Table shows the iSCSI Write Completion timings without including the time it takes for IOs to complete from the SCSI target itself. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) without including the time it takes For IOS to complete from the SCSI target 0.398 ms 0.815 ms 1.013 ms 0.818 ms 1.34 ms 2.028 ms 4.262 ms 9.987 ms 18.287 ms 163.677 ms 473.518 ms 814.723 ms Table 3.3.1.5 – iSCSI Write Completion Timings without including delay at the drive Note: The IOMeter data collected as well as any WireShark Traces are located in the in the project where all the deliverables are located. Refer to Section Appendix 8. 55 3.3.2 Measuring tSAS performance using the client and server mock application written and comparing it to the iSCSI client/server mock application as well as to legacy SAS and legacy iSCSI A. The tSAS performance was measured by running the client/server application written. The test bench used to test the tSAS applications is two Windows 208 Server systems connected using a netgear switch with a connection rate of 10 Mbps, 100 Mbps and 1 Gbps. One windows machine runs the client application while the other runs the server application. 10 Mbps: READ: Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) using mock application in iSCSI 2.786 ms 5.968 ms 7.541 ms 11.002 ms 18.258 ms 175.630 ms 197.788 ms 255.342 ms 601.288 ms 741.555 ms 2228.483 ms (~2.228 sec) 3863.979 ms (~3.863 sec) IOPS I/Os per second 358.937 167.560 132.608 90.892 54.770 5.693 5.055 3.916 1.663 1.348 0.448 0.259 Table 3.3.2.0 – READ Command Timings iSCSi Mock app at 10 Mbps Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB Average Time from Read Command to Completion (milliseconds) using mock application in tSAS 2.543ms 5.933 ms 6.896 ms 10.902 ms IOPS I/Os per second 393.236 168.548 145.011 91.726 56 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 18.152 ms 153.126 ms 192.224 ms 192.103 ms 576.096 ms 996.854 ms 1614.082 ms (~1.614 sec) 3615. 275 ms (~3.615 sec) 55.090 6.530 5.202 5.205 1.736 1.003 0.619 0.276 Table 3.3.2.1 – READ Command Timings tSAS Mock app at 10 Mbps iSCSI vs tSAS READ Completion Time at 10 Mbps 2500 2000 Time (Milliseconds) 1500 1000 tSAS READ Completion time 500 iSCSI READ Completion Time 0 0 200 400 600 800 1000 1200 Transfer Size (Kilobytes) Figure 3.3.2.0– iSCSI vs tSAS Read Completion Time at 10 Mbps. Looking at the chart above, tSAS performs better than iSCSI. One also observes that at small READ transfers, iSCSI and tSAS have a more similar performance at 10 Mbps. However, at larger transfer sizes, tSAS performs more visibly better than iSCSI. WRITE Transfer Length (KB) Average Time from Write Command IOPS 57 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB to Completion (milliseconds) using mock application in iSCSI 13.968 ms 14.909 ms 16.867 ms 20.078 ms 27.365 ms 505.044 ms 710.429 ms 1572.559 ms (~1.572 sec) 3380.042 ms (~3.380 sec) 6886.112 ms (~6.886 sec) 1431.612 ms (~14.316 sec) 1977.700 ms (~19.777 sec) I/Os per second 71.592 67.073 59.287 49.805 36.543 1.980 1.407 0.636 0.256 0.145 0.698 5.056 Table 3.3.2.2 – WRITE Command Timings iSCSi Mock app at 10 Mbps Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using mock application in tSAS 4.892 ms 5.849 ms 8.054 ms 10.979 ms 17.984 ms 233.573 ms 614.819 ms 1584.924 ms (~1.584 sec) 3540.684 (~3.540 sec) 6684.609 (~6.684 sec) 1245.677 ms (~12.456 sec) 1772.838 ms (~17.728 sec) IOPS I/Os per second 204.415 170.969 91.089 91.083 55.605 4.281 1.626 0.631 0.282 0.149 0.803 0.564 Table 3.3.2.3 – WRITE Command Timings tSAS Mock app at 10 Mbps 58 tSAS vs iSCSI Write 10Mbps 8000 7000 6000 5000 Time (milliseconds) 4000 tSAS Write 10MBps iSCSI Write 10MBps 3000 2000 1000 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 Transfer Size (Kilobytes) Figure 3.3.2.1– iSCSI vs tSAS Write Completion Time at 10 Mbps. Looking at the chart above, tSAS performs better than iSCSI. One also observes that at small READ transfers, iSCSI and tSAS have a more similar performance. However, at larger transfer sizes, tSAS performs more visibly better than iSCSI. 100 Mbps Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB Average Time from Read Command to Completion (milliseconds) using mock application in iSCSI 1.996ms 2.692 ms 2.579 ms 3.093 ms 3.802 ms 15.001 ms IOPS I/Os per second 501.002 371.471 387.747 323.310 263.092 66.662 59 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB 17.193 ms 35.913 ms 82.172 ms 115.905 ms 311.735 ms 577.684 ms 58.163 27.845 12.169 8.627 3.208 1.731 Table 3.3.2.4 – READ Command Timings iSCSI Mock app at 100 Mbps Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) using mock application in tSAS 1.984 ms 2.595 ms 2.543 ms 2.979 ms 3.968 ms 14.330 ms 17.302 ms 41.583 ms 74.161 ms 106.293 ms 251.228 ms 569.294 ms IOPS I/Os per second 504.032 385.256 393.236 225.683 252.016 69.783 57.796 24.048 13.484 9.408 3.980 1.756 Table 3.3.2.5 – READ Command Timings tSAS Mock app at 100 Mbps 60 tSAS vs iSCSI Read 100 Mbps 700 600 500 Time (Milliseconds) 400 tSAS Read 100 Mbps 300 iSCSI Read 100 Mbps 200 100 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 Transfer Length (Kilobytes) Figure 3.3.2.2– iSCSI vs tSAS Read Completion Time at 100 Mbps. Looking at the chart above in Figure 3.3.2.2, tSAS performs better than iSCSI for all transfer sizes captured. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using mock application in iSCSI 14.559 ms 15.203 ms 14.716 ms 15.030 ms 22..011 ms 25.735 ms 55.918 ms 110.481 ms 193.932 ms 272.651 ms 350.924 ms 772.876 ms IOPS I/Os per second 68.686 65.776 67.953 66.533 45.431 38.857 17.883 9.051 5.156 3.667 2.849 1.294 Table 3.3.2.6 – WRITE Command Timings iSCSI Mock app at 100 Mbps 61 Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using mock application in tSAS 2.699 ms 2.864 ms 2.647 ms 3.285 ms 3.832 ms 5.480 ms 40.484 ms 66.802 ms 161.243 ms 272.125 ms 394.083 ms 761.236 ms IOPS I/Os per second 0.370 349.162 377.786 304.414 260.960 182.481 24.701 14.969 6.201 3.674 2.537 1.313 Table 3.3.2.7 – WRITE Command Timings tSAS Mock app at 100 Mbps tSAS vs iSCSi Write 100 Mbps 900 800 700 600 500 Time (Milliseconds) 400 tSAS Write 100 Mbps 300 iSCSI Write 100 Mbps 200 100 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 Transfer Length (KB) Figure 3.3.2.3– iSCSI vs tSAS Write Completion Time at 100 Mbps. Looking at the chart above, tSAS performs better than iSCSI overall. 1 Gbps: 62 Transfer Length (KB) 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB 2048KB Average Time from Read Command to Completion (milliseconds) using mock application in iSCSI 1.999 ms 1.231 ms 1.227 ms 1.436 ms 1.338 ms 1.795 ms 2.401 ms 4.264 ms 7.072 ms 12.395 ms 24.880 ms 44.383 ms IOPS I/Os per second 500.250 812.347 814.995 696.378 747.384 557.103 416.493 234.521 141.402 80.677 40.193 22.531 Table 3.3.2.8 – READ Command Timings iSCSI Mock app at 1000 Mbps (1 Gbps) Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Read Command to Completion (milliseconds) using mock application in tSAS 1.976 ms 1.507 ms 1.695 ms 1.251 ms 1.247 ms 1.708 ms 2.627 ms 4.467 ms 7.755 ms 13.054 ms 23.683 ms 40.627 ms IOPS I/OS per second 506.073 663.570 589.970 799.360 801.924 585.480 380.662 223.863 128.949 76.605 42.224 24.614 Table 3.3.2.9 – READ Command Timings tSAS Mock app at 1000 Mbps (1 Gbps) 63 tSAS vs iSCSI Read 1 Gbps 50 45 40 35 30 Time 25 (Milliseconds) 20 tSAS Read 1 Gbps iSCSi Read 1 Gbps 15 10 5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 Transfer Length (KB) Figure 3.3.2.4– iSCSI vs tSAS Read Completion Time at 1000 Mbps. Looking at the chart above, tSAS performs better than iSCSI. At smaller transfer sizes tSAS and iSCSI perform more similar while at larger transfer sizes tSAS performs visibly faster than iSCSI. Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using mock application in iSCSI 1.469 ms 1.503 ms 1.462 ms 11.528 ms 12.212 ms 13.088 ms 15.727 ms 16.928 ms 18.630 ms 27.883 ms 48.535 ms 75.057 ms IOPS I/Os per second 680.735 665.335 683.994 86.745 81.886 76.406 63.584 63.584 53.676 35.864 20.603 13.323 Table 3.3.2.10 – WRITE Command Timings iSCSI Mock app at 1000 Mbps (1 Gbps) 64 Transfer Length (KB) 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1024 KB 2048 KB Average Time from Write Command to Completion (milliseconds) using mock application in tSAS 1.347 ms 1.406 ms 1.497 ms 1.772 ms 1.523 ms 2.418 ms 2.654 ms 3.754 ms 6.882 ms 14.168 ms 27.538 ms 46.258 ms IOPs I/Os per second 742.390 711.237 668.002 564.334 656.598 413.564 376.789 266.382 145.306 70.581 36.313 21.617 Table 3.3.2.11 – WRITE Command Timings tSAS Mock app at 1000 Mbps (1 Gbps) tSAS vs iSCSI Write 1 Gbps 80 70 60 50 Time 40 (Milliseconds) tSAS Write 1 Gbps 30 iSCSI Write 1 Gbps 20 10 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 Transfer Size (KB) Figure 3.3.2.5– iSCSI vs tSAS Write Completion Time at 1000 Mbps. tSAS performs visibly better than iSCSI on both small and larger transfer sizes for Writes as 1 Gbps. 65 From the data collected on the tSAS mock application and the iSCSI mock application, the following conclusions can be drawn: 1. tSAS performs better than iSCSI overall at all transfer sizes regardless of the speed of the connection between the initiator and the target. The reason for this can be easily attributed to the fact that the REQUEST, TRANSFER READY (Xfer Rdy for SAS) and RESPONSE frame sizes are smaller in tSAS vs the REQUEST, TRANSGFER READY (R2T) and RESPONSE frame sizes in iSCSI. In other words, the over head in tSAS is smaller as compared to the over head in iSCSI. 2. At smaller speeds, the performance of iSCSI and tSAS is very comparable with tSAS performing slightly better than iSCSI. However, as transfer sizes get larger, tSAS performs more visibly better than iSCSI. 3. Overall, WRITE performance is poorer than READ performance in tSAS and iSCSI. This can be attributed to the fact that handshaking is more for WRITEs than READs. On WRITEs, the initiator needs to wait for the transfer ready (Xfer_Rdy or R2T) frame before sending data. 4. For better performance, it may be best to use smaller transfer sizes since at larger transfer sizes the error rate and retransmission of packets on TCP is higher looking at the wireshark traces collected. B. Next we will look at how tSAS performs at different connections speeds for a fixed transfer size at each connection rate. The below graph in Figure 3.3.2.6 compares tSAS READ performance at varying connection speeds for a 2K Transfer size. 66 Time for READ Completion for a transfer size of 2K at 10Mbps, 100Mbps and 1Gbps 7 6 5 4 Time (MilliSeconds) 3 Time for READ Completion 2 1 0 0 200 400 600 800 1000 1200 Connection Rate (Mbps) Figure 3.3.2.6–tSAS READ Completion Time for a transfer size of 2K at 10 Mbps, 100 Mbps and 1 Gbps. The below graph in Figure 3.3.2.7 compares tSAS READ performance at varying connection speeds for a 16K Transfer size. 67 Time for READ Completion with transfer size of 16K at 10 Mbps, 100 Mbps and 1 Gbps 20 18 16 14 12 Time (Milliseconds) 10 8 Time for READ Compeltion 6 4 2 0 0 200 400 600 800 1000 1200 Comnnection Rate (Mbps) Figure 3.3.2.7–tSAS READ Completion Time for a transfer size of 16K at 10 Mbps, 100 Mbps and 1 Gbps. The below graph in Figure 3.3.2.8 compares tSAS READ performance at varying connection speeds for a 512K Transfer size. 68 Time for READ Completion with transfer size of 512K at 10 Mbps, 100Mbps and 1Gbps 1200 1000 800 Time (Milliseconds) 600 Time for READ Compeltion 400 200 0 0 200 400 600 800 1000 1200 Connection Rate (Mbps) Figure 3.3.2.8–tSAS READ Completion Time for a transfer size of 512K at 10 Mbps, 100 Mbps and 1 Gbps. As can be seen for the graphs above, performance drastically improves from 10 Mbps to 1 Gbps. With 40 Gbps and 100 Gbps soon to be available [36], tSAS performance should outperform SAS. From a performance analysis done by Netapp on 1 Gbps and 10 Gbps Ethernet server scalability [35], one can infer that 10 Gbps can perform 4.834 times better than 1 Gbps on the wire. Therefore, for a tSAS solution 40/100 Gigabit Ethernet is recommended to obtain faster speeds and a better performance. C. Next, we will compare tSAS to legacy iSCSI and legacy SAS. Comparing tSAS results at 1 Gbps to legacy SAS by looking at performance numbers between the HBA and the expander: As mentioned in section 3.3.0.1 and looking at Tables 3.3.0.1.0 and 3.3.0.1.1, the delay between the HBA and expander is in the order of microseconds (less than a millisecond for all transfer sizes between 1K to 2048K). Comparing this to our tSAS mock application performance, we can easily see that tSAS performance is much slower than legacy SAS between a HBA and an expander. Since we can use tSAS between a HBA and an expander, this is a valid comparison of tSAS to legacy SAS. However, without having a solution where tSAS is implemented in hardware 69 by using a tSAS HBA, it is not fair to compare our tSAS results to legacy SAS between the HBA and the expander. Therefore, it is best to stick with the comparison of tSAS with the iSCSI mock application itself. Comparing tSAS results at 100 Gbps to legacy iSCSI without delay at the SCSI drive: It is not fair to compare the tSAS numbers with the iSCSI numbers we got using the StarWind iSCSI Initiator and KernSafe iSCSI target (Table 3.3.1.4 and 3.3.1.5). tSAS outperforms the iSCSI performance numbers we got using legacy iSCSI. However, our tSAS implementation is not a full implementation of the tSAS Software Initiator or Target. Therefore, it is best to stick with the comparison of tSAS with the iSCSI mock application itself. Comparing tSAS to legacy SAS and legacy iSCSI will be left as future work when a tSAS solution is implemented in a SAS HBA. 4.0 Similar Work 1. Michael Ko’s patent on Serial Attached SCSI over Ethernet proposes a very similar solution to the tSAS solution provided in this project. 2. iSCSI specification (SCSI over TCP) itself is similar to a tSAS solution (SAS over TCP). The iSCSI solution can be heavily leveraged for a tSAS solution. 3. The Fibre Channel over TCP/IP specification also can be leveraged to design and implement a tSAS solution [31]. 5.0 Future Direction 1. 2. 3. 4. The tSAS mock application can be run using a faster switch with connection rate of 10 Gbps to get more data points The tSAS mock application can be designed such that it uses piggy backing where the SSP Read Response frame from the target is piggy backed with the last DATA frame sent by the target. Also, a DATA frame can be piggy backed with a SSP Write Request. This may slightly improve READ and WRITE performance. Jumbo frames can be used to increase the amount of DATA that is passed from the initiator and target per Ethernet packet improving the performance results. Using an existing Generation 3 SAS HBA and expanders that have an Ethernet Port, read/write commands can be implemented on an expander and the HBA such that they are sent via TCP. This can be used to benchmark and see the feasibility further of tSAS. An embedded TCP./IP stack such as lwIP can be used to implement this [33]. 70 5. The Storage Associations can be motivated with the results of this project to work on a tSAS specification 6.0 Conclusion (Lessons learned) Overall, tSAS is a viable solution. tSAS will be faster than a similar iSCSI implementation due to the frame sizes (Request, Response and Transfer Ready) in tSAS being smaller than frame sizes (Request, Response and Ready to Transfer) in iSCSI. In other words, the overhead in tSAS is smaller than the overhead in iSCSI. Also, in a tSAS topology the back-end will always be a legacy SAS drive as opposed to iSCSI where the back-end may be a SCSI drive which is much slower than a SAS drive. At smaller transfer sizes, the performance of a tSAS and iSCSI solution may be very similar with tSAS performing slightly better than iSCSI. However, at larger transfer sizes, tSAS should be a better solution improving the overall performance of a storage system. For tSAS to outperform a typical SAS solution today, a HBA solution of tSAS should be used to increase performance. A software solution of tSAS may not be a good choice if the aim is to beat the performance of legacy SAS. However, with 40G/100G Ethernet in the horizon [36], a software solution of tSAS can provide both performance and prove to be a cheaper solution. tSAS can also make use of jumbo frames to increase performance. From a pure interest of overcoming the distance limitation of legacy SAS, tSAS is an excellent solution since it sends SAS packets over TCP. 7.0 References [1] T10/1760-D Information Technology – Serial Attached SCSI – 2 (SAS-2), T10, 18 April 2009, Available from http://www.t10.org/drafts.htm#SCSI3_SAS [2] Harry Mason, Serial attached SCSI Establishes its Position in the Enterprise, LSI Corporation, available from http://www.scsita.org/aboutscsi/sas/6GbpsSAS.pdf [3] http://www.scsilibrary.com/ [4] http://www.scsifaq.org/scsifaq.html [5] Kenneth Y. Yun ; David L. Dill; 71 A High-Performance Asynchronous SCSI Controller, available from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=528789 [6] http://www.t10.org/scsi-3.htm [7] Sarah Summers, Secure asymmetric iScsi system for online storage, 2008, University of Colorado, Colorado Springs, available from http://www.cs.uccs.edu/~gsc/pub/master/sasummer/doc/ [8] SCSI Architecture Model - 5 (SAM-5), Revision 21, T10, 2011/05/12, available from http://www.t10.org/members/w_sam5.htm [9] SCSI Primary Commands - 4 (SPC-4), Revision 31, T10, 2011/06/13, available from http://www.t10.org/members/w_spc4.htm [10] Marc Farley, Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications,Management, and File Systems, Cisco Press, 2005, ISBN 1-587051621 [11] Huseyin Simitci; Chris Malakapalli; Vamsi Gunturu; Evaluation of SCSI Over TCP/IP and SCSI Over Fibre Channel Connections, XIOtech Corporation, available from http://www.computer.org/portal/web/csdl/abs/proceedings/hoti/2001/1357/00/13570087abs.htm [12] Harry Mason, SCSI, the Industry Workhorse, Is Still Working Hard, Dec 2000, SCSI Trade Association available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=889098&tag=1 [13] Mark S. Kolich, Basics of SCSI: Firmware Applications and Beyond, Computer Science Department, Loyola Marymount University, Los Angeles, available from http://mark.koli.ch/2008/10/25/CMSI499_MarkKolich_SCSIPaper.pdf [14] Prasenjit Sarkar; Kaladhar Voruganti, IP Storage: The Challenge Ahead, IBM Almaden Research Center, San Jose, CA, available from http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8984 [15] Prasenjit Sarkar; Sandeep Uttamchandani; Kaladhar Voruganti, Storage over IP: When Does Hardware Support help?, 2003 IBM Almaden Research Center, San Jose, California available from http://dl.acm.org/citation.cfm?id=1090723 72 [16] A. Benner, "Fibre Channel: Gigabit Communications and I/O for Computer Networks", McGraw-Hill, 1996. [17] Infiniband Trade Association available from http://www.infinibandta.org [18] K.Voruganti; P. Sarkar, An Analysis of Three Gigabit Networking Protocols for Storage Area Networks’, 20th IEEE International Performance, Computing, and Communications Conference”, April 2001, available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=918661&tag=1 [19] Kalmath Meth; Julian Satran, Features of the iSCSI Protocol, August 2003, IBM Haifa Research Lab available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1222720 [20] Yingping Lu; David H. C. Du, Performance Study of iSCSI-Based Storage Subsystems, IEEE Communications Magazine, August 2003, pp 76-82. [21] Integration Scenarios for iSCSI and Fibre Channel, available from http://www.snia.org/forums/ipsf/programs/about/isci/iSCSI_FC_Integration_IPS.pdf [22] Irina Gerasimov; Alexey Zhuravlev; Mikhail Pershin; Dennis V. Gerasimov, Design and Implementation of a Block Storage Multi-Protocol Converter, Proceedings of the 20th IEEE/11th NASA Goddard Conference of Mass Storage Systems and Technologies (MSS‟03) available from http://storageconference.org/2003/papers/26-Gerasimov-Design.pdf [23] Internet Small Computer Systems Interface (iSCSI), http://www.ietf.org/rfc/rfc3720.txt [24] Yingping Lu; David H. C. Du, Performance Study of iSCSI-Based Storage Subsystems, University of Minnesota, Aug 2003, available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1222721 [25] Cai, Y.; Fang, L.; Ratemo, R.; Liu, J.; Gross, K.; Kozma, M.; A test case for 3Gbps serial attached SCSI (SAS) Test Conference, 2005. Proceedings. ITC 2005. IEEE International, February 2006, available from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1584027 [26] Rob Eliot, Serial Attached SCSI, HP Industry Standard Servers, Server Storage Advanced Technology , 30 September 2003 available from http://www.scsita.org/sas_library/tutorials/SAS_General_overview_public.pdf [27] Michael A. Ko, LAYERING SERIAL ATTACHED SMALL COMPUTER SYSTEM INTERFACE (SAS) 73 OVER ETHERNET, United States Patent Application 20080228897, 09/18/2008 available from http://www.faqs.org/patents/app/20080228897 [28] Mathew R. Murphy, iSCSI-based Storage Area Networks for Disaster Recovery Operations, The Florida State University, College of engineering, 2005, available from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.8245 [29] “Increase Performance of Network-Intensive Applications with TCP/IP Offload Engines (TOEs),” Adaptec, Inc. White Paper, May 2003 available from http://www.probsolvesolutions.co.uk/solutions/white_papers/adaptec/NAC_TechPaper2.pdf [30] IEEE P802.3ba 40Gb/s and 100Gb/s Ethernet Task Force available from http://www.ieee802.org/3/ba/ [31] M. Rajagopal; E. Rodriguez; R. Weber; Fibre Channel Over TCP/IP, Network Working Group, July 2004, available from http://rsync.tools.ietf.org/html/rfc3821 [32] IOMeter Users Guide, Version 2003.12.16 available from http://www.iometer.org/doc/documents.html [33] The lwIP TCP/IP stack, available from http://www.sics.se/~adam/lwip/ [34] 29West Messaging Performance on 10-Gigabit Ethernet, September 2008, available from http://www.cisco.com/web/strategy/docs/finance/29wMsgPerformOn10gigtEthernet.pdf [35] 1Gbps and 10Gbps Ethernet Server Scalability, NetApp, available from http://partners.netapp.com/go/techontap/matl/downloads/redhatneterion_10g.pdf [36] John D. Ambrosia, 40 gigabit Ethernet and 100 Gigabit Ethernet: The development of a flexible architecture available from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4804384 [37] Henk Vandenbergh, VDBench Users Guide, Version 5.00, October 2008 available from http://iweb.dl.sourceforge.net/project/vdbench/vdbench/Vdbench%205.00/vdbench.pdf 74 8.0 Appendix 8.1 How to run the tSAS and iSCSI mock initiator (client) and target (server) application Since the applications were written in C using Microsoft Visual Studio 2008 Professional Edition, you will need to have Visual Studio 2008 downloaded on the system where you would like to run these applications. A client.exe and server.exe files are provided for tSAS and iSCSi in the code directory of this project. Please run the Server.exe and Client.exe program for either iSCSI or tSAS after installing the tSAS and iSCSI client and server applications using the Windows Installer Packages provided for both tSAS and iSCSI mock applications: 1. You will see the following screens when you run the Server.exe and Client.exe files respectively. 2. Enter the IP address of your server/target to see the following output screens 75 3. Select if you would like to test READs/WRITEs along with the transfer size to see the output of the test results 76 77 8.2 How to run iSCSI Server and iSCSI Target Software 1. The StarWind iSCSI Initiator was used for this project. a. You may download the StarWind iSCSi Initiator software for free from http://www.starwindsoftware.com/iscsi-initiator b. After installing the software please refer to the “Using as iSCSi Initiator” PDF file included in http://www.starwindsoftware.com/iscsi-initiator. 2. The Kern Safe iSCSI Target was used to create an iSCSI Target a. You may download the iSCSI target software (KernSafe iStorage Server) from http://www.kernsafe.com/product/istorage-server.aspx. b. After installing and running it, please Click on the Create Target to Create a target and specify the type of target you would like to create as well as security specifications. 8.3 How to run LeCroy SAS Analyzer Software The LeCroy SAS Analyzer software can be downloaded from http://lecroy.ru/protocolanalyzer/protocolstandard.aspx?standardID=7 You can open the SAS Analyzer Traces provided in the SAS Analyzer Traces folder with this software. Running the menu Report->Statistical Report will give you the Average Completion time of IOs and other useful information. The SAS Analyzer traces are located in the project deliverable folder. 8.4 WireShark to view the WireShark traces The Wireshark Network analyzer Software can be downloaded from http://www.wireshark.org/ This software will let you capture and view the WireShark traces provided with this project. The WireShark traces are located in the project deliverable folder. 8.5 VDBench for Linux VDBench can be downloaded from http://sourceforge.net/projects/vdbench/ After installing VDBench on linux, you may use a script similar to the one below to run IOs and look at the performance results. 78 sd=s1,lun=/dev/sdb,align=4096,openflags=o_direct * wd=wd1,sd=(s1),xfersize=2048KB,seekpct=0,rdpct=0 * rd=rd1,wd=wd1,iorate=max,forthreads=1,elapsed=300,interval=1 * Lun=/dev/sdb simply states the target you are testing. Xfersize is used to change the transfer size [37]. Seekpct=0 states that all IOs are sequential [37]. Forthreads=1 states that the Queue Depth or Number of outstanding IOs is 1 [37]. Interval=1 will simply display/update the performance results onto the screen every second [37]. For additional information on each field and additional feields please refer to the VDBench user guide [37]. 8.5 IOMeter for Windows IOMeter can be downloaded from http://www.iometer.org/ Please refer to the user guide at http://www.iometer.org/doc/documents.html to use IOMeter. 79