Socket Layer COMS W6998 Spring 2010 Erich Nahum

advertisement
Socket Layer
COMS W6998
Spring 2010
Erich Nahum
Outline






Sockets API Refresher
Linux Sockets Architecture
Interface between BSD sockets and AF_INET
Interface between AF_INET and TCP/UDP
Receive Path
Send Path
BSD Socket API




Originally developed by UC Berkeley at the
dawn of time
Used by 90% of network oriented programs
Standard interface across operating systems
Simple, well understood by programmers
User Space Socket API

socket() / bind() / accept() / listen()


select() / poll() / epoll()


Stream oriented (e.g. TCP) Rx / Tx
sendto() / recvfrom()


Waiting for events
send() / recv()


Initialization, addressing and hand shaking
Datagram oriented (e.g. UDP) Rx / TX
close(), shutdown()

Closing down an association
Standard Socket Sequence
The ‘server’ application
socket()
bind()
The ‘client’ application
socket()
listen()
bind()
accept()
read()
write()
close()
3-way handshake
data flow to server
data flow to client
4-way handshake
connect()
write()
read()
close()
Socket() System Call


Creating a socket from user space is done by the
socket() system call:
 int socket (int family, int type, int
protocol);
 On success, a file descriptor for the new socket is
returned.
 For open() system call (for files), we also get a file
descriptor as the return value.
 “Everything is a file” Unix paradigm.
The first parameter, family, is also sometimes referred
to as “domain”.
Socket(): Family


A family is a suite of protocols
Each family is a subdirectory of linux/net




IPv4: PF_INET
IPv6: PF_INET6.
Packet sockets: PF_PACKET




E.g., linux/net/ipv4, linux/net/decnet, linux/net/packet
Operate at the device driver layer.
pcap library for Linux uses PF_PACKET sockets
pcap library is in use by sniffers such as tcpdump.
Protocol Family == Address Family

PF_INET == AF_INET (in /include/linux/socket.h)
Address/Protocol Families
/* Supported address families. */
#define AF_UNSPEC
0
#define AF_UNIX
1
/*
#define AF_LOCAL
1
/*
#define AF_INET
2
/*
#define AF_AX25
3
/*
#define AF_IPX
4
/*
#define AF_APPLETALK
5
/*
#define AF_NETROM
6
/*
#define AF_BRIDGE
7
/*
#define AF_ATMPVC
8
/*
#define AF_X25
9
/*
#define AF_INET6
10
/*
#define AF_ROSE
11
/*
#define AF_DECnet
12
/*
#define AF_NETBEUI
13
/*
#define AF_SECURITY
14
/*
#define AF_KEY
15
/*
..
#define AF_ISDN
34
/*
#define AF_PHONET
35
/*
#define AF_IEEE802154
36
/*
#define AF_MAX
37
/*
Unix domain sockets
*/
POSIX name for AF_UNIX
*/
Internet IP Protocol
*/
Amateur Radio AX.25
*/
Novell IPX
*/
AppleTalk DDP
*/
Amateur Radio NET/ROM
*/
Multiprotocol bridge
*/
ATM PVCs
*/
Reserved for X.25 project
*/
IP version 6
*/
Amateur Radio X.25 PLP
*/
Reserved for DECnet project */
Reserved for 802.2LLC project*/
Security callback pseudo AF */
PF_KEY key management API */
mISDN sockets
Phonet sockets
IEEE802154 sockets
For now.. */
*/
*/
*/
include/linux/socket.h
Socket(): Type

SOCK_STREAM and SOCK_DGRAM are
the mostly used types.




SOCK_STREAM for TCP, SCTP
SOCK_DGRAM for UDP.
SOCK_RAW for RAW sockets.
There are cases where protocol can be either
SOCK_STREAM or SOCK_DGRAM; for
example, Unix domain socket (AF_UNIX).
Socket(): Protocol


Protocol is protocol number within a family.
Internet protocols are assigned by IANA


For AF_INET, it’s usually 0.


http://www.iana.org/assignments/protocol-numbers/
IPPROTO_IP is 0, see: include/linux/in.h.
For SCTP:

protocol is IPPROTO_SCTP (132)
sockfd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);

For UDP-Lite:

protocol is IPPROTO_UDPLITE (136)
Socket Layer Architecture
PF_INET
SOCK_
STREAM
TCP
SOCK_
DGRAM
UDP
Application
User
BSD Socket Layer
Socket
Interface
PF_PACKET
SOCK
_RAW
SOCK
_RAW
PF_UNIX
….
….
SOCK_
DGRAM
PF_IPX
Protocol
Layers
IPV4
Kernel
Network Device Layer
Ethernet
Intel E1000
Token Ring
PPP
SLIP
FDDI
Device
Layer
Hardware
Key Concepts

Function pointer tables (“ops”)


In-kernel interfaces for socket functions
 Binding between BSD sockets and AF_XXX families
 Binding between AF_INET and transports (TCP, UDP)
Socket data structures


struct socket (BSD socket)
struct sock (protocol family socket, network state)
 struct packet_sock (PF_PACKET)
 struct inet_sock (PF_INET)


struct udp_sock
struct tcp_sock
Socket Data Structures




For every socket which is created by a user space application,
there is a corresponding struct socket and struct sock in the
kernel.
These are confusing.
struct socket: include/linux/net.h
 Data common to the BSD socket layer
 Has only 8 members
 Any variable “sock” always refers to a struct socket
struct sock : include/net/sock/h
 Data common to the Network Protocol layer (i.e., AF_INET)
 has more than 30 members, and is one of the biggest structures
in the networking stack.
 Any variable “sk” always refers to a struct sock.
struct socket
struct socket {
socket_state
short
unsigned long
struct fasync_struct
wait_queue_head_t
struct file
struct sock
const struct proto_ops
};
state; // SS_CONNECTING etc.
type; // SOCK_STREAM etc.
flags;
*fasync_list;
wait; // tasks waiting
*file; // back ptr to inode
*sk;
// AF specific state
*ops; // AF specific operations
include/linux/net.h
Socket State
typedef enum {
SS_FREE = 0,
SS_UNCONNECTED,
SS_CONNECTING,
SS_CONNECTED,
SS_DISCONNECTING
} socket_state;

/*
/*
/*
/*
/*
not allocated
unconnected to an socket
in process of connecting
connected to socket
in process of disconnecting
*/
*/
*/
*/
*/
These states are not layer 4 states (like TCP_ESTABLISHED or
TCP_CLOSE).
include/linux/net.h
Socket Types
enum sock_type {
SOCK_STREAM
SOCK_DGRAM
SOCK_RAW
SOCK_RDM
SOCK_SEQPACKET
SOCK_DCCP
SOCK_PACKET
};
=
=
=
=
=
=
=
1,
2,
3,
4,
5,
6,
10,
include/linux/net.h
Comment in include/net/sock.h
/*
* This structure really needs to be cleaned up.
* Most of it is for TCP, and not used by any of
* the other protocols.
*/
struct sock_common
/* minimal network layer representation of sockets */
struct sock_common {
/*
* first fields are not copied in sock_copy()
*/
union {
struct hlist_node
skc_node;
// main hash linkage for lookup
struct hlist_nulls_node skc_nulls_node; // main hash for TCP/UDP
};
atomic_t
skc_refcnt;
int
skc_tx_queue_mapping; // tx queue for this connection
union {
unsigned int
skc_hash;
// hash value for lookup
__u16
skc_u16hashes[2];
};
unsigned short
skc_family;
// network address family
volatile unsigned char skc_state;
// Connection state
unsigned char
skc_reuse;
// SO_REUSEADDR setting
int
skc_bound_dev_if;
// bound if !=0
union {
struct hlist_node
skc_bind_node;
// bind hash linkage
struct hlist_nulls_node skc_portaddr_node; // bind hash for UDP/Lite
};
struct proto
*skc_prot; // protocol handlers in a net family
};
include/net/sock.h
Outline






Sockets API Refresher
Linux Sockets Architecture
Interface between BSD sockets and AF_INET
Interface between AF_INET and TCP/UDP
Receive Path
Send Path
BSD Socket  AF Interface

Main data structures



struct net_proto_family
struct proto_ops
Key function
sock_register(struct net_proto_family *ops)

Each address family:



Implements the struct net _proto_family.
Calls the function sock_register( ) when the protocol
family is initialized.
Implement the struct proto_ops for binding the BSD
socket layer and protocol family layer.
BSD Socket Layer
net_proto_family

AF Socket Layer
Describes each of the supported protocol families
struct net_proto_family {
int family;
int (*create)(struct net *net, struct socket
*sock, int protocol, int kern);
struct module *owner;
}

Specifies the handler for socket creation

create() function is called whenever a new socket of this type is
created
BSD Socket Layer
AF Socket Layer
INET and PACKET proto_family
static const struct net_proto_family
inet_family_ops = {
.family = PF_INET,
.create = inet_create,
.owner = THIS_MODULE,
/* af_inet.c */
};
static const struct net_proto_family
packet_family_ops = {
.family = PF_PACKET,
.create = packet_create,
.owner = THIS_MODULE,
/* af_packet.c
*/
};
BSD Socket Layer
proto_ops



AF Socket Layer
Defines the binding between the BSD
socket layer and address family (AF_*)
layer.
The proto_ops tables contain function
exported by the AF socket layer to the BSD
socket layer
It consists of the address family type and a
set of pointers to socket operation routines
specific to a particular address family.
BSD Socket Layer
struct proto_ops
struct proto_ops {
int
struct module
int
int
int
int
int
int
unsigned int
int
int
int
int
int
int
int
int
int
int
int
ssize_t
ssize_t
};
AF Socket Layer
family;
*owner;
(*release);
(*bind);
(*connect);
(*socketpair);
(*accept);
(*getname);
(*poll);
(*ioctl);
(*compat_ioctl);
(*listen);
(*shutdown);
(*setsockopt);
(*getsockopt);
(*compat_setsockopt);
(*compat_getsockopt);
(*sendmsg);
(*recvmsg);
(*mmap);
(*sendpage);
(*splice_read);
include/linux/net.h
BSD Socket Layer
PF_PACKET proto_opsAF Socket Layer
static const struct
.family =
.owner =
.release =
.bind =
.connect =
.socketpair
.accept =
.getname =
.poll =
.ioctl =
.listen =
.shutdown =
.setsockopt
.getsockopt
.sendmsg =
.recvmsg =
.mmap =
.sendpage =
};
proto_ops packet_ops = {
PF_PACKET,
THIS_MODULE,
packet_release,
packet_bind,
sock_no_connect,
=
sock_no_socketpair,
sock_no_accept,
packet_getname,
packet_poll,
packet_ioctl,
sock_no_listen,
sock_no_shutdown,
=
packet_setsockopt,
=
packet_getsockopt,
packet_sendmsg,
packet_recvmsg,
packet_mmap,
sock_no_sendpage,
net/packet/af_packet.c
BSD Socket Layer
PF_INET proto_ops
AF Socket Layer
inet_stream_ops (TCP)
inet_dgram_ops (UDP)
inet_sockraw_ops (RAW)
.family
PF_INET
PF_INET
PF_INET
.owner
THIS_MODULE
THIS_MODULE
THIS_MODULE
.release
inet_release
inet_release
inet_release
.bind
inet_bind
inet_bind
inet_bind
.connect
inet_stream_connect
inet_dgram_connect
inet_dgram_connect
.socketpair
sock_no_socketpair
sock_no_socketpair
sock_no_socketpair
.accept
inet_accept
sock_no_accept
sock_no_accept
.getname
inet_getname
inet_getname
inet_getname
.poll
tcp_poll
udp_poll
datagram_poll
.ioctl
inet_ioctl
inet_ioctl
inet_ioctl
.listen
inet_listen
sock_no_listen
sock_no_listen
.shutdown
inet_shutdown
inet_shutdown
inet_shutdown
.setsockopt
sock_common_setsockopt
sock_common_setsockopt
sock_common_setsockopt
.getsockopt
sock_common_getsockop
sock_common_getsockop
sock_common_getsockop
.sendmsg
tcp_sendmsg
inet_sendmsg
inet_sendmsg
.recvmsg
sock_common_recvmsg
sock_common_recvmsg
sock_common_recvmsg
.mmap
sock_no_mmap
sock_no_mmap
sock_no_mmap
.sendpage
tcp_sendpage
inet_sendpage
inet_sendpage
.splice_read
tcp_splice_read
--
--
net/ipv4/af_inet.c
Outline




Sockets API Refresher
Linux Sockets Architecture
Interface between BSD sockets and AF_INET
Interface between AF_INET and TCP/UDP




Binding between IP and TCP/UDP (upcall)
Binding between AF_INET and TCP (downcall)
Receive Path
Send Path
AF_INET Layer
AF_INET  TransportTransport
APILayer



struct inet_protos
 Interface between IP and the transport layer
 Is the upcall binding from IP to transport
 Method for demultiplexing IP packets to proper transport
struct proto
 Defines interface for individual protocols (TCP, UDP, etc)
 Is the downcall binding for AF_INET to transport
 Transport-specific functions for socket API
struct inet_protosw
 Describes the PF_INET protocols
 Defines the different SOCK types for PF_INET
 SOCK_STREAM (TCP), SOCK_DGRAM (UDP), SOCK_RAW
BSD Socket Layer
Recall IP’s inet_protos AF Socket Layer
net_protocol
inet_protos[MAX_INET_PROTOS]
0
handler

udp_rcv()
udp_err()
err_handler

gso_send_check
gso_segment
gro_receive
gro_complete
1
net_protocol
handler
err_handler
gso_send_check
gso_segment
gro_receive
gro_complete
MAX_INET_
PROTOS
net_protocol
igmp_rcv()
Null
Receive binding
from the IP layer to
the transport layer.
init_inet( ) calls
inet_add_protocol
(p) to add each
protocol to the hash
queues.
BSD Socket Layer
struct proto
AF Socket Layer
/* Networking protocol blocks we attach to sockets.
* socket layer -> transport layer interface
*/
struct proto {
void
(*close);
int
(*connect);
int
(*disconnect);
struct sock *
(*accept);
int
(*ioctl);
int
(*init);
void
(*destroy);
void
(*shutdown);
int
(*setsockopt);
int
(*getsockopt);
int
(*sendmsg);
int
(*recvmsg);
int
(*sendpage);
int
(*bind);
int
(*backlog_rcv);
void
(*hash);
void
(*unhash);
int
(*get_port);
}
include/linux/net.h
BSD Socket Layer
udp_prot
struct proto udp_prot = {
.name
.owner
.close
.connect
.disconnect
.ioctl
.destroy
.setsockopt
.getsockopt
.sendmsg
.recvmsg
.sendpage
.backlog_rcv
.hash
.unhash
.get_port
.memory_allocated
.sysctl_mem
.sysctl_wmem
.sysctl_rmem
.obj_size
.slab_flags
.h.udp_table
#ifdef CONFIG_COMPAT
.compat_setsockopt
.compat_getsockopt
#endif
};
AF Socket Layer
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
"UDP",
THIS_MODULE,
udp_lib_close,
ip4_datagram_connect,
udp_disconnect,
udp_ioctl,
udp_destroy_sock,
udp_setsockopt,
udp_getsockopt,
udp_sendmsg,
udp_recvmsg,
udp_sendpage,
__udp_queue_rcv_skb,
udp_lib_hash,
udp_lib_unhash,
udp_v4_get_port,
&udp_memory_allocated,
sysctl_udp_mem,
&sysctl_udp_wmem_min,
&sysctl_udp_rmem_min,
sizeof(struct udp_sock),
SLAB_DESTROY_BY_RCU,
&udp_table,
= compat_udp_setsockopt,
= compat_udp_getsockopt,
net/ipv4/af_inet.c
BSD Socket Layer
inet_protosw
static struct inet_protosw inetsw_array[] =
{
{

.type =
SOCK_STREAM,
.protocol =
IPPROTO_TCP,
.prot =
&tcp_prot,
.ops =
&inet_stream_ops,
.no_check =
0,
.flags =
INET_PROTOSW_PERMANENT |
INET_PROTOSW_ICSK,
},

{
.type =
SOCK_DGRAM,
.protocol =
IPPROTO_UDP,
.prot =
&udp_prot,

.ops =
&inet_dgram_ops,
.no_check =
UDP_CSUM_DEFAULT,
.flags =
INET_PROTOSW_PERMANENT,
},
{
.type =
SOCK_RAW,
.protocol =
IPPROTO_IP, /* wild card */
.prot =
&raw_prot,
.ops =
&inet_sockraw_ops,
.no_check =
UDP_CSUM_DEFAULT,
.flags =
INET_PROTOSW_REUSE,
}
};
AF Socket Layer
On startup (inet_init()),
TCP, UDP, and Raw
socket protocols are
inserted into the
inetsw_array[].
Other protocols call
inet_register_protosw()
inet_unregister_protosw()
will not remove protocols
with PERMANENT set.
net/ipv4/af_inet.c
Relationships
struct socket
state
type
flags
fasync_list
wait
file
sk
proto_ops
struct sock
sk_common
sk_lock
sk_backlog
...
(*sk_prot_creator)
sk_socket
sk_send_head
...
struct proto_ops
PF_INET
af_inet.c
inet_release
inet_bind
inet_accept
...
struct sock_common
skc_node
skc_refcnt
skc_hash
...
skc_proto
skc_net
struct proto
udp_lib_close
ipv4_dgram_connect
udp_sendmsg
udp_recvmsg
...
Example: inet_accept()
int inet_accept(struct socket *sock, struct socket *newsock, int flags)
{
struct sock *sk1 = sock->sk;
int err = -EINVAL;
struct sock *sk2 = sk1->sk_prot->accept(sk1, flags, &err);
if (!sk2)
goto do_err;
lock_sock(sk2);
WARN_ON(!((1 << sk2->sk_state) &
(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_CLOSE)));
sock_graft(sk2, newsock);
newsock->state = SS_CONNECTED;
err = 0;
release_sock(sk2);
do_err:
return err;
}
Backup
Download