Linux TCP/IP Stack TCP / IP vs. OSI model Process 7: Application 6: Presentation 5: Session Socket layer 4: Transport 3: Network Protocol Layer (TCP / IP) Interface Layer (Ethernet, etc.) 2: Data Link 1: Physical Layer TCP/IP Stack Overview Process 1: sosend (……………... ) 5: recvfrom(……….) Socket Layer 2: tcp_output ( ……. ) 4: tcp_input ( ……... ) Protocol Layer (TCP Layer) 3: ip_output ( ……. ) 3: ip_input ( ……... ) Protocol Layer (IP Layer) 4: ethernet_output ( ……. ) 2: ethernet_input ( …….. ) Interface Layer (Ethernet Device Driver) Physical Media Output Queue Input Queue Process Layer to TCP Layer send (int socket, const char *buf, int length, int flags) Process Kernel sendto (int socket, const char *data_buffer, int length, int flags, struct sockaddr *destination, int destination _length) sendit (struct proc *p, int socket, struct msghdr *mp, int flags, int *return_size) sosend (struct socket *s, struct mbuf *addr, struct uio *uio, struct mbuf *top, struct mbuf *control, int flags ) tcp_userreq (struct socket *s, int request, struct mbuf *m, struct mbuf * nam, struct mbuf * control ) TCP Layer tcp_output (struct tcpcb *tp) uipc_syscalls.c uipc_socket.c tcp_userreq.c tcp_output.c Socket Layer sendto (int socket, const char *data_buffer, int length, int flags, struct sockaddr *destination, int destination _length) MBUF Chain m_next = NULL m_next 28 Bytes m_nextpkt = NULL m_nextpkt = NULL m_len = 100 m_len = 50 m_data 20 Bytes m_type = MT_DATA m_type = MT_DATA data_buffer m_flags = M_PKTHDR m_flags = 0 m_pkthdr.len = 150 128 Bytes mBuf 150 Bytes Data m_data m_pkthdr.recvif =NULL 100 Bytes 50 Bytes Data 58 Bytes Unused Space Data Socket Layer -sosend passes data and control information to the protocol layer sosend(struct socket *s, struct mbuf *addr, struct uio *uio, struct mbuf *data_buffer, struct mbuf *control, int flags ) Initialize a new memory buffer and variables to hold flags no Is there enough space in the buffer sbspace(s->sb_snd) yes Copy data_buffer mbuf int error = tcp_usrreq(s, flags, mbuf, addr, control) yes 0 More buffers to send? 1 error no Free the memory buffers received Return value of error to sendto ( ) TCP Layer - tcp_usrreq(struct socket *s, int request, struct mbuf *data_buffer, mbuf *nam, mbuf * control) Initialize internet protocol control block inp and TCP control block tp to store information useful for TCP Convert Socket to Internet Protocol Control Block inp = sotoinpcb(so) Convert the internet protocol control block to a tcp control block tp = intopcb(inp) request PRU_SEND int error = tcp_output(tp) return error to tcp_userreq( ) TCP Layer (tcp_output.c) - tcp_output(struct tcpcb *tp) Called by tcp_usrreq for one of the following reasons: To send the initial SYN To send a finished_sending message To send data To send a window update after data has been received. tcp_ouput ( ) functionality: 1. determines whether TCP can send a segment or not depending on: flags in the data sent by the socket layer to send an ACK, etc. Size of window advertised by the receiver’s end. Amount of data ready to send whether unacknowledged data already exists for the connection 2. Calculate the amount of data to be sent depending on: size of receiver’s window number of bytes in the send buffer 3. Check for window shrink 4. Send a segment Allocate a buffer for the TCP and IP header from the header template Copy the TCP and IP header template into the the buffer to be sent. Fill the fields in the TCP header. Decrement the number of buffers to tbe sent, so that the end can be checked. Set sequencenumber and acknowledgement field. Set three fields in the IP header - IP length, TTL and Tos. Pass the datagram to IP TCP Layer (tcp_output.c) - tcp_output(struct tcpcb *tp) struct socket *so = tp -> t_inpcb -> inp_socket Initialize a tcp header tcp_header Idle is true if the max sequence number equals the oldest unacknowledged sequence number, if an ACK is not expected from the other end. int idle = (tp -> snd_max == tp -> snd_una) false idle true Check ACK Flag Acknowledgement is not expected, set the congestion window to one segment tp -> snd_cwnd = tp -> t_maxseg; TCP Layer - tcp_output(struct tcpcb *tp) Acknowledgement is not expected, set the congestion window to one segment tp -> snd_cwnd = tp -> t_maxseg; off is the offset in bytes from the beginning of the send buffer of the first data byte to send. off bytes have already been sent and acknowledgement on those is awaited. int off = tp -> snd_nxt - tp -> snd_una Determine length of data that should be transmitted and the flags to be used. len is the minimum number of bytes in the send buffer, win (the minimum of the receiver’s window) and the congestion window. len = min(so -> so_snd.sb_cc, win) - off Determine the flags like TH_ACK, TH_FIN, TH_RST, TH_SYN flags = tcp _outflags [ tp -> t_state ] TCP Layer - tcp_output(struct tcpcb *tp) Determine the flags like TH_ACK, TH_FIN, TH_RST, TH_SYN flags = tcp _outflags [ tp -> t_state ] true tp -> t_flags & TF_ACKNOW Send acknowledgement false true tp -> t_flags & TF_SYN || TH_RST Send sequence number or reset false true tp -> t_flags & TH_FIN false Finished sending Ckeck flags to determine the type of message: window probe retransmission normal data transmission Allocate an mbuf for the TCP & IP header and data if possible. MGETHDR ( m, M_DONTWAIT, MT_HEADR) M_DONTWAIT indicates that if memory is not available for mbuf then come out of the routine and return an error state. Length of data < 44 Bytes 100 - 40 - 16 no Create a new mbuf chain, copy the surplus data and point it to the first mbuf chain. yes Copy the data from the socket send buffer into the new packet header mbuf ip_output(m, tp->t_inpcb -> inp_options, &tp -> t_inpcb -> inp_route, so -> so_options & SO_DONOTROUTE, 0) ip_output.c ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags, struct ip_moptions *imo) 1. Header initialization 2. Route Selection 3. Source address selection and Fragmentation 1. Header initialization Packets damaged? ERROR yes Check if there were any errors while adding headers in higher layers. Most of the fields of the IP header are pre defined by higher layer protocols. no if ((flags == IP_FORWARDING ) || (flags == IP_RAWOUTPUT )) yes no Save header length in hlen for fragmentation algorithm Construct and initialize IP header set ip_v = 4, clear ip_off assign unique identifier to ip_id length, offset, TTL, protocol, TOS etc are set by higher layers. The value of “flags” decides what’s to be done with the data • IP_FORWARDING : Forward packet • IP_ROUTETOIF : Route directly to Interface • IP_ALLOWBROADCAST : Allow broadcasting of packet • IP_RAWOUTPUT : Packet contains pre-constructed header If the packet has to be forwarded to another host, i.e if the machine is acting as a router, then the IP header for forwarded packets should not be modified by ip_output. If the packet is not being forwarded and has to be sent to another host then initialize the IP header. 2. Route Selection A cached route may be provided to ip_output as an argument. UDP and TCP maintain a route cache associated with each socket. Verify Cached Route for destination address If (cached_route == destination) no Locate route : Call rtalloc(dst_ip) to locate a route to the destination. Find the interface on which the packet has to be placed. Ifp points to the interface’s ifnet structure. If rtalloc(dst_ip) fails to find a route, return host unreachable error. yes Find the interface on which the packet has to be placed. Ifp points to the interface’s ifnet structure. Check if the cached route is the correct destination. If a route has not been provided, ip_output sets a temporary route structure called iproute. If the cached route is provided, find the interface on which the frame has to be sent. If the packet is being routed, rtalloc locates a route to the address specified by dst. If rtalloc fails, an EHOSTUNREACH error is generated. If ip_forward called ip_output the error is converted to an ICMP error. If the address is found then ifp is made to point to thr ifnet structure for the interface. If the next hop is not the packets final destination, then dst is changed to point to the next hop router. 3. Source address selection and Fragmentation Check if valid source address is specified. no Select the IP address of the outgoing interface as the source address. yes Fragment the packet if it’s size is greater than the MTU. The final section of the ip_output ensures that the IP header has a valid source IP address. This couldn’t have been done earlier because the route hadn’t been selected yet. If there is no source IP then the IP address of the outgoing interface is used as the source IP. yes Does the packet have to be fragmented ? Larger packets (packets that exceed the MTU) must be fragmented before they can be sent. no If there are no check_sum errors, send the data to if_output function of the selected interface. In either case (fragmented or not) the checksum is computed (in_cksum). If no errors are found, the data is sent to if_output function of the output interface. Interface Layer (if_ethersubr.c) ether_output(struct ifnet *ifp, struct mbuf *mbuf, struct sockaddr *destination, struct rtentry *routing_entry) 1. Verification 2. Protocol-Specific Processing 3. Frame Construction 4. Interface Queuing. 1. Verification no Ethernet port up and running ? ifp -> if_flags & (IF_UP | IF_RUNNING ) yes senderr (ENETDOWN) Interface Layer(if_ethersubr.c) - ether_output(struct ifnet *ifp, struct mbuf *mbuf, struct sockaddr *destination, struct rtentry *rt_entry) Function: Takes the data portion of an Ethernet frame ans encapsulates it with a 14-byte header and places it on the interface send_queue. Phases: Verification, Protocol-Specific Processing, Frame Construction, Interface Queuing. Arguments ifp points to outgoing interface’s ifnet structure mbuf is the data to be sent destination is the destination address rt_entry points o the routing entry InitializeEthernet header - struct eth_header *eh Verification no Ethernet port up and running ? ifp -> if_flags & (IF_UP | IF_RUNNING ) yes senderr (ENETDOWN) 0 Route valid ? rt_entry = rtalloc1 (destination, 1) senderr (EHOSTUNREACH) 1 Next hop a gateway ? rt = rt -> rt_gwroute 0 1 Destination responding to ARP requests? If not then do not send more packets to avoid flooding. rt -> rt_flags & RTF_REJECT no Verification Protocol Specific Processing Functionality: Finds Ethernet address corresponding to the IP address of the destination. Protocol Specific Processing destination -> sa_family AF_INET Send ARP broadcast to find the ethernet address corresponding to the destination IP address Use m_copy( ) to keep the packet till an ack. Is recvd. Frame Preparartion Protocol Specific Processing Frame Preparartion Make sure there is room for the 14 byte ethernet header M_PREPEND ( m, sizeof(ethernet_header), M_DONOTWAIT) Form the Ethernet header from ethernet frame type, ethernet MAC address, unicast ethernet address associated with the output interface. e.g. the default gateway for a host Frame Preparartion Interface Queuing yes Is the output queue full Discard the frame Free the memory buff senderr ( ENOBUFS ) no Place the frame on the interface’s send queue lestart ( ifp ) if_snd lestart ( ifp ) Interface Layer(if_le.c) - lestart(struct ifnet *ifp) Function: Dequeues frames from the interface output queue and arranges for them to be transmitted by the Ethernet Card. struct le_softc *le = & le_softcl [ ifp -> if_unit ] 0 le -> sc_if.if_flags & IFF_RUNNING 1 Copy the the frame in mbuf to the hardware buffer Set the IFF_OACTIVE on to indicate that the device is busy transmitting. return error