标签:理解 tom tla 高亮 so_linger short suse block dict
LINUX内核协议栈分析
目 录
本文档制作基于版本 linux-2.6.32,本文档的目的是让有一定的网络协议基础的人了解到网络数据包在协议栈中的传输流程,大致理解到从网卡收到数据包传输到应用层所经历的步骤,以及每个步骤所做的事情。 图片贴到最后。
本文档阅读基础:C语言基础,C语言回调函数,UML建模基础,C++面向对象封装思想,TCP/IP协议或网络基础。
本章摘自[TCP-IP详解卷一] 第一章。
网络协议通常分不同层次进行开发,每一层分别负责不同的通信功能。一个协议族,比如T C P / I P,是一组不同层次上的多个协议的组合。T C P / I P通常被认为是一个四层协议系统,如图1 - 1所示。每一层负责不同的功能:
1)链路层,有时也称作数据链路层或网络接口层,通常包括操作系统中的设备驱动程序和计算机中对应的网络接口卡。它们一起处理与电缆(或其他任何传输媒介)的物理接口细节。
2)网络层,有时也称作互联网层,处理分组在网络中的活动,例如分组的选路。在
TC P / I P协议族中,网络层协议包括I P协议(网际协议),I C M P协议(I n t e r n e t互联网控制报文协议),以及I G M P协议(I n t e r n e t组管理协议)。
3) 运输层主要为两台主机上的应用程序提供端到端的通信。在T C P / I P协议族中,有两个互不相同的传输协议:T C P(传输控制协议)和U D P(用户数据报协议)。T C P为两台主机提供高可靠性的数据通信。它所做的工作包括把应用程序交给它的数据分成合适的小块交给下面的网络层,确认接收到的分组,设置发送最后确认分组的超时时钟等。由于运输层提供了高可靠性的端到端的通信,因此应用层可以忽略所有这些细节。而另一方面,U D P则为应用层提供一种非常简单的服务。它只是把称作数据报的分组从一台主机发送到另一台主机,但并不保证该数据报能到达另一端。任何必需的可靠性必须由应用层来提供。
这两种运输层协议分别在不同的应用程序中有不同的用途,这一点将在后面看到。
4 ) 应用层负责处理特定的应用程序细节。几乎各种不同的T C P / I P实现都会提供下面这些通用的应用程序:
• Telnet 远程登录。
• FTP 文件传输协议。
• SMTP 简单邮件传送协议。
• SNMP 简单网络管理协议。
在TC P / I P协议族中,有很多种协议。图1 - 4给出了本书将要讨论的其他协议。
T C P和U D P是两种最为著名的运输层协议,二者都使用I P作为网络层协议。
虽然T C P使用不可靠的I P服务,但它却提供一种可靠的运输层服务。本书第1 7~2 2章将详细讨论T C P的内部操作细节。然后,我们将介绍一些T C P的应用,如第2 6章中的Te l n e t和R l o g i n、第2 7章中的F T P以及第2 8章中的S M T P等。这些应用通常都是用户进程。
U D P为应用程序发送和接收数据报。一个数据报是指从发送方传输到接收方的一个信息单元(例如,发送方指定的一定字节数的信息)。但是与T C P不同的是,U D P是不可靠的,它不能保证数据报能安全无误地到达最终目的。本书第11章将讨论U D P,然后在第1 4章(D N S :域名系统),第1 5章(T F T P:简单文件传送协议),以及第1 6章(BO OT P:引导程序协议)介绍使用U D P的应用程序。S N M P也使用了U D P协议,但是由于它还要处理许多其他的协议,因此本书把它留到第2 5章再进行讨论。
I P是网络层上的主要协议,同时被T C P和U D P使用。T C P和U D P的每组数据都通过端系统和每个中间路由器中的I P层在互联网中进行传输。在图1 - 4中,我们给出了一个直接访问I P的应用程序。这是很少见的,但也是可能的(一些较老的选路协议就是以这种方式来实现的。当然新的运输层协议也有可能使用这种方式)。第3章主要讨论I P协议,但是为了使内容更加有针对性,一些细节将留在后面的章节中进行讨论。第9章和第1 0章讨论I P如何进行选路。
I C M P是I P协议的附属协议。I P层用它来与其他主机或路由器交换错误报文和其他重要信息。
第6章对I C M P的有关细节进行讨论。尽管I C M P主要被I P使用,但应用程序也有可能访问它。我们将分析两个流行的诊断工具,P i n g和Tr a c e r o u t e(第7章和第8章),它们都使用了I C M P。
I G M P是I n t e r n e t组管理协议。它用来把一个U D P数据报多播到多个主机。我们在第1 2章中描述广播(把一个U D P数据报发送到某个指定网络上的所有主机)和多播的一般特性,然后在第1 3章中对I G M P协议本身进行描述。
A R P(地址解析协议)和R A R P(逆地址解析协议)是某些网络接口(如以太网和令牌环网)使用的特殊协议,用来转换I P层和网络接口层使用的地址。我们分别在第4章和第5章对这两种协议进行分析和介绍。
互联网上的每个接口必须有一个唯一的I n t er n e t地址(也称作I P地址)。I P地址长32 bit。I n t e r n e t地址并不采用平面形式的地址空间,如1、2、3等。I P地址具有一定的结构,五类不同 的互联网地址格式如图1 - 5所示。
这些3 2位的地址通常写成四个十进制的数,其中每个整数对应一个字节。这种表示方法称作“点分十进制表示法(Dotted decimal notation)”。例如,作者的系统就是一个B类地址,它表示为:1 4 0 . 2 5 2 .1 3 . 3 3。
区分各类地址的最简单方法是看它的第一个十进制整数。图1 - 6列出了各类地址的起止范围,其中第一个十进制整数用加黑字体表示。
需要再次指出的是,多接口主机具有多个I P地址,其中每个接口都对应一个I P地址。
由于互联网上的每个接口必须有一个唯一的I P地址,因此必须要有一个管理机构为接入互联网的网络分配I P地址。这个管理机构就是互联网络信息中心(Internet Network InformationC e n t r e),称作I n t e r N I C。I n t e r N I C只分配网络号。主机号的分配由系统管理员来负责。
I n t e r n e t注册服务( I P地址和D N S域名)过去由N I C来负责,其网络地址是n i c . d d n . m i l。1 9 9 3年4月1日,I n t e r N I C成立。现在,N I C只负责处理国防数据网的注册请求,所有其他的I n t e r n e t用户注册请求均由I n t e rN I C负责处理,其网址是:r s . i n t er n i c . n e t。
事实上I n t e r N I C由三部分组成:注册服务(r s. i n t e r n i c . n e t),目录和数据库服
务(d s . i n t e r n i c. n e t),以及信息服务(i s . i n t e rn i c . n e t)。有关I n t e r N I C的其他信息参见习题1 . 8。
有三类I P地址:单播地址(目的为单个主机)、广播地址(目的端为给定网络上的所有主机)以及多播地址(目的端为同一组内的所有主机)。第1 2章和第1 3章将分别讨论广播和多播的更多细节。
在3 . 4节中,我们在介绍I P选路以后将进一步介绍子网的概念。图3 - 9给出了几个特殊的I P地址:主机号和网络号为全0或全1。
当应用程序用T C P传送数据时,数据被送入协议栈中,然后逐个通过每一层直到被当作一串比特流送入网络。其中每一层对收到的数据都要增加一些首部信息(有时还要增加尾部信息),该过程如图1 - 7所示。T C P传给I P的数据单元称作T C P报文段或简称为T C P段(T C P s e g m e n t)。I P传给网络接口层的数据单元称作I P数据报(IP datagram)。通过以太网传输的比特流称作帧(Fr a m e )。1 - 7中帧头和帧尾下面所标注的数字是典型以太网帧首部的字节长度
当目的主机收到一个以太网数据帧时,数据就开始从协议栈中由底向上升,同时去掉各
层协议加上的报文首部。每层协议盒都要去检查报文首部中的协议标识,以确定接收数据的
上层协议。这个过程称作分用( D e m u lt i p l e x i n g),图1 - 8显示了该过程是如何发生的。[TCP-IP详解卷一]
描述以太网头部
/*
* Thisis an Ethernet frame header.
*/
struct ethhdr {
unsigned char h_dest[ETH_ALEN];/* destination ethaddr */
unsigned char h_source[ETH_ALEN]; /* source ether addr */
__be16 h_proto; /* packet type ID field */
} __attribute__((packed));
/*
* Theseare the defined Ethernet Protocol ID‘s.
*/
#define ETH_P_LOOP 0x0060 /* Ethernet Loopback packet */
#define ETH_P_PUP 0x0200 /* Xerox PUP packet */
#define ETH_P_PUPAT 0x0201 /* Xerox PUP Addr Trans packet */
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
#define ETH_P_X25 0x0805 /* CCITT X.25 */
#define ETH_P_ARP 0x0806 /* Address Resolution packet */
#define ETH_P_BPQ 0x08FF /* G8BPQ AX.25Ethernet Packet [ NOT AN OFFICIALLYREGISTERED ID ] */
#define ETH_P_IEEEPUP 0x0a00 /* Xerox IEEE802.3 PUP packet */
#define ETH_P_IEEEPUPAT 0x0a01 /* Xerox IEEE802.3 PUP Addr Trans packet */
#define ETH_P_DEC 0x6000 /* DEC Assigned proto */
#define ETH_P_DNA_DL 0x6001 /* DEC DNA Dump/Load */
#define ETH_P_DNA_RC 0x6002 /* DEC DNA Remote Console */
#define ETH_P_DNA_RT 0x6003 /* DEC DNA Routing */
#define ETH_P_LAT 0x6004 /* DEC LAT */
#define ETH_P_DIAG 0x6005 /* DEC Diagnostics */
#define ETH_P_CUST 0x6006 /* DEC Customer use */
#define ETH_P_SCA 0x6007 /* DEC Systems Comms Arch */
#define ETH_P_TEB 0x6558 /* Trans Ether Bridging */
#define ETH_P_RARP 0x8035 /* Reverse Addr Res packet */
#define ETH_P_ATALK 0x809B /* Appletalk DDP */
#define ETH_P_AARP 0x80F3 /* Appletalk AARP */
#define ETH_P_8021Q 0x8100 /* 802.1Q VLAN Extended Header */
#define ETH_P_IPX 0x8137 /* IPX over DIX */
#define ETH_P_IPV6 0x86DD /* IPv6 over bluebook */
#define ETH_P_PAUSE 0x8808 /* IEEE Pause frames. See 802.3 31B */
#define ETH_P_SLOW 0x8809 /* Slow Protocol. See 802.3ad 43B */
#define ETH_P_WCCP 0x883E /* Web-cache coordination protocol
* defined in draft-wilson-wrec-wccp-v2-00.txt*/
#define ETH_P_PPP_DISC 0x8863 /* PPPoE discovery messages */
#define ETH_P_PPP_SES 0x8864 /* PPPoE session messages */
#define ETH_P_MPLS_UC 0x8847 /* MPLS Unicast traffic */
#define ETH_P_MPLS_MC 0x8848 /* MPLS Multicast traffic */
#define ETH_P_ATMMPOA 0x884c /* MultiProtocol Over ATM */
#define ETH_P_ATMFATE 0x8884 /* Frame-based ATM Transport
* over Ethernet
*/
#define ETH_P_PAE 0x888E /* Port Access Entity (IEEE 802.1X) */
#define ETH_P_AOE 0x88A2 /* ATA over Ethernet */
#define ETH_P_TIPC 0x88CA /* TIPC */
#define ETH_P_1588 0x88F7 /* IEEE 1588 Timesync */
#define ETH_P_FCOE 0x8906 /* Fibre Channel over Ethernet */
#define ETH_P_TDLS 0x890D /* TDLS */
#define ETH_P_FIP 0x8914 /* FCoE Initialization Protocol */
#define ETH_P_EDSA 0xDADA /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID] */
#define ETH_P_AF_IUCV 0xFBFB /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ]*/
描述ip头部
struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u8 ihl:4,
version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
__u8 version:4,
ihl:4;
#else
#error "Please fix<asm/byteorder.h>"
#endif
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
};
描述udp头部
struct udphdr {
__be16 source;
__be16 dest;
__be16 len;
__sum16 check;
};
内核协议栈涉及的数据结较多,错综复杂,这里只是粘贴了设计到的数据结构的源码。源码和注释用10字体,高亮显示;重要的成员和方法用加粗11号字体标出。例如
图4-1 内核协议栈分层结构
Physical device hardware : 指的实实在在的物理设备。 对应physical layer
Device agnostic interface : 设备无关层。 对应Link layer
Network protocols : 网络层。 对应Ip layer 和 transportlayer
Protocol agnostic interface: 协议无关层 适配系统调用层,屏蔽了协议的细节
System callinterface:系统调用层 提供给应用层的系统调用,屏蔽了socket操作的细节
BSD socket: BSD Socket层 提供统一socket操作的接口, socket结构关系紧密
Inet socket: inet socket 层 调用ip层协议的统一接口,sock结构关系紧密
描述了从应用层传递下来的消息格式,包含有用户空间地址,消息标记等重要信息。
/*
* Aswe do 4.4BSD message passing we use a 4.4BSD message passing
* system,not 4.3. Thus msg_accrights(len) are now missing. They
* belongin an obscure libc emulation or the bin.
*/
struct msghdr {
void * msg_name; /* Socket name */
int msg_namelen; /* Length of name */
struct iovec* msg_iov; /* Data blocks */
__kernel_size_t msg_iovlen; /* Number of blocks */
void * msg_control; /* Per protocolmagic (eg BSD file descriptor passing) */
__kernel_size_t msg_controllen; /* Length of cmsglist */
unsigned msg_flags;
};
描述了用户空间地址的起始位置。
/*
* Berkeleystyle UIO structures - Alan Cox 1994.
*
* Thisprogram is free software; you can redistribute it and/or
* modifyit under the terms of the GNU General Public License
* aspublished by the Free Software Foundation; either version
* 2of the License, or (at your option) any later version.
*/
struct iovec {
void __user*iov_base; /* BSD uses caddr_t(1003.1g requires void *) */
__kernel_size_t iov_len;/* Must be size_t(1003.1g) */
};
描述文件属性的结构体,与文件描述符一一对应。
struct file {
/*
* fu_list becomes invalid after file_free iscalled and queued via
* fu_rcuhead for RCU freeing
*/
union {
struct list_head fu_list;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
#define f_dentry f_path.dentry
#define f_vfsmnt f_path.mnt
const struct file_operations *f_op;
spinlock_t f_lock; /* f_ep_links,f_flags, no IRQ */
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
loff_t f_pos;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybeothers */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link allthe hooks to this file */
struct list_head f_ep_links;
#endif /*#ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
#ifdef CONFIG_DEBUG_WRITECOUNT
unsigned long f_mnt_write_state;
#endif
};
文件操作相关结构体,包括read(), write(), open(),ioctl()等。
/*
* NOTE:
* read, write, poll, fsync, readv,writev, unlocked_ioctl and compat_ioctl
* can be called without the bigkernel lock held in all filesystems.
*/
structfile_operations {
struct module *owner;
loff_t (*llseek)(struct file*, loff_t,int);
ssize_t (*read) (struct file*,char __user*,size_t, loff_t*);
ssize_t (*write) (struct file*,constchar __user*,size_t, loff_t*);
ssize_t (*aio_read)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);
ssize_t (*aio_write)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);
int (*readdir)(struct file*,void*, filldir_t);
unsigned int (*poll)(struct file*,struct poll_table_struct *);
int (*ioctl) (struct inode*,struct file*,unsignedint,unsignedlong);
long (*unlocked_ioctl)(struct file*, unsigned int,unsignedlong);
long (*compat_ioctl)(struct file*, unsigned int,unsignedlong);
int (*mmap)(struct file*,struct vm_area_struct *);
int (*open) (struct inode*,struct file*);
int (*flush)(struct file*, fl_owner_t id);
int (*release)(struct inode*,struct file *);
int (*fsync)(struct file*,struct dentry *,int datasync);
int (*aio_fsync)(struct kiocb*, int datasync);
int (*fasync)(int,struct file *,int);
int (*lock)(struct file*,int,struct file_lock *);
ssize_t (*sendpage)(struct file*, struct page *, int, size_t, loff_t *,int);
unsigned long (*get_unmapped_area)(struct file*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);
int (*check_flags)(int);
int (*flock)(struct file*,int,struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info*,struct file *, loff_t*,size_t,unsignedint);
ssize_t (*splice_read)(struct file*, loff_t *,struct pipe_inode_info*,size_t,unsignedint);
int (*setlease)(struct file*,long,struct file_lock **);
};
向应用层提供的BSD socket操作结构体,协议无关,主要作用为应用层提供统一的socket操作。BSD: BerkeleySoftwareDistribution)
/**
* struct socket - general BSD socket
* @state: socket state (%SS_CONNECTED, etc)
* @type: socket type (%SOCK_STREAM, etc)
* @flags: socket flags (%SOCK_ASYNC_NOSPACE, etc)
* @ops:protocol specific socket operations
* @fasync_list: Asynchronous wake up list
* @file: File back pointer for gc
* @sk:internal networking protocol agnostic socket representation
* @wait: wait queue for several uses
*/
struct socket {
socket_state state;
kmemcheck_bitfield_begin(type);
short type;
kmemcheck_bitfield_end(type);
unsigned long flags;
/*
* Please keep fasync_list & wait fields inthe same cache line
*/
struct fasync_struct*fasync_list;
wait_queue_head_t wait;
struct file *file;
struct sock *sk;
const struct proto_ops *ops;
};
typedef enum {
SS_FREE = 0, /* not allocated */
SS_UNCONNECTED, /* unconnected to any socket */
SS_CONNECTING, /* in process of connecting */
SS_CONNECTED, /* connected to socket */
SS_DISCONNECTING /* in process of disconnecting */
} socket_state;
网络层sock(可理解为C++基类),定义与协议无关操作,是网络层的统一的结构,传输层在此基础上实现了inet_sock(可理解为C++派生类)。
/**
* structsock - network layer representation of sockets
* @__sk_common:shared layout with inet_timewait_sock
* @sk_shutdown:mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
* @sk_userlocks:%SO_SNDBUF and %SO_RCVBUF settings
* @sk_lock: synchronizer
* @sk_rcvbuf:size of receive buffer in bytes
* @sk_sleep:sock wait queue
* @sk_dst_cache:destination cache
* @sk_dst_lock:destination cache lock
* @sk_policy:flow policy
* @sk_rmem_alloc:receive queue bytes committed
* @sk_receive_queue:incoming packets
* @sk_wmem_alloc:transmit queue bytes committed
* @sk_write_queue:Packet sending queue
* @sk_async_wait_queue:DMA copied packets
* @sk_omem_alloc:"o" is "option" or "other"
* @sk_wmem_queued:persistent queue size
* @sk_forward_alloc:space allocated forward
* @sk_allocation:allocation mode
* @sk_sndbuf:size of send buffer in bytes
* @sk_flags:%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPINGsettings
* @sk_no_check:%SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps:route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_type:GSO type (e.g. %SKB_GSO_TCPV4)
* @sk_gso_max_size:Maximum GSO segment size to build
* @sk_lingertime:%SO_LINGER l_linger setting
* @sk_backlog:always used with the per-socket spinlock held
* @sk_callback_lock:used with the callbacks in the end of this struct
* @sk_error_queue:rarely used
* @sk_prot_creator:sk_prot of original sock creator (see ipv6_setsockopt,
* IPV6_ADDRFORM for instance)
* @sk_err:last error
* @sk_err_soft:errors that don‘t cause failure but are the cause of a
* persistent failure not just ‘timed out‘
* @sk_drops:raw/udp drops counter
* @sk_ack_backlog:current listen backlog
* @sk_max_ack_backlog:listen backlog set in listen()
* @sk_priority:%SO_PRIORITY setting
* @sk_type:socket type (%SOCK_STREAM, etc)
* @sk_protocol:which protocol this socket belongs in this network family
* @sk_peercred:%SO_PEERCRED setting
* @sk_rcvlowat:%SO_RCVLOWAT setting
* @sk_rcvtimeo:%SO_RCVTIMEO setting
* @sk_sndtimeo:%SO_SNDTIMEO setting
* @sk_filter:socket filtering instructions
* @sk_protinfo:private area, net family specific, when not using slab
* @sk_timer:sock cleanup timer
* @sk_stamp:time stamp of last packet received
* @sk_socket:Identd and reporting IO signals
* @sk_user_data:RPC layer private data
* @sk_sndmsg_page:cached page for sendmsg
* @sk_sndmsg_off:cached offset for sendmsg
* @sk_send_head:front of stuff to transmit
* @sk_security:used by security modules
* @sk_mark:generic packet mark
* @sk_write_pending:a write to stream socket waits to start
* @sk_state_change:callback to indicate change in the state of the sock
* @sk_data_ready:callback to indicate there is data to be processed
* @sk_write_space:callback to indicate there is bf sending space available
* @sk_error_report:callback to indicate errors (e.g. %MSG_ERRQUEUE)
* @sk_backlog_rcv:callback to process the backlog
* @sk_destruct:called at sock freeing time, i.e. when all refcnt == 0
*/
struct sock {
/*
* Now struct inet_timewait_sock also usessock_common, so please just
* don‘t add nothing before this first member(__sk_common) --acme
*/
struct sock_common __sk_common;
#define sk_node __sk_common.skc_node
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
#define sk_copy_start __sk_common.skc_hash
#define sk_hash __sk_common.skc_hash
#define sk_family __sk_common.skc_family
#define sk_state __sk_common.skc_state
#define sk_reuse __sk_common.skc_reuse
#define sk_bound_dev_if __sk_common.skc_bound_dev_if
#define sk_bind_node __sk_common.skc_bind_node
#definesk_prot __sk_common.skc_prot
#define sk_net __sk_common.skc_net
kmemcheck_bitfield_begin(flags);
unsigned int sk_shutdown : 2,
sk_no_check :2,
sk_userlocks :4,
sk_protocol :8,
sk_type :16;
kmemcheck_bitfield_end(flags);
int sk_rcvbuf;
socket_lock_t sk_lock;
/*
* The backlog queue is special, it is alwaysused with
* the per-socket spinlock held and requireslow latency
* access. Therefore we special case it‘simplementation.
*/
struct {
struct sk_buff *head;
struct sk_buff *tail;
} sk_backlog;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
#ifdef CONFIG_XFRM
struct xfrm_policy *sk_policy[2];
#endif
rwlock_t sk_dst_lock;
atomic_t sk_rmem_alloc;
atomic_t sk_wmem_alloc;
atomic_t sk_omem_alloc;
int sk_sndbuf;
struct sk_buff_head sk_receive_queue;
struct sk_buff_head sk_write_queue;
#ifdef CONFIG_NET_DMA
struct sk_buff_head sk_async_wait_queue;
#endif
int sk_wmem_queued;
int sk_forward_alloc;
gfp_t sk_allocation;
int sk_route_caps;
int sk_gso_type;
unsigned int sk_gso_max_size;
int sk_rcvlowat;
unsigned long sk_flags;
unsigned long sk_lingertime;
struct sk_buff_head sk_error_queue;
struct proto *sk_prot_creator;
rwlock_t sk_callback_lock;
int sk_err,
sk_err_soft;
atomic_t sk_drops;
unsigned short sk_ack_backlog;
unsigned short sk_max_ack_backlog;
__u32 sk_priority;
struct ucred sk_peercred;
long sk_rcvtimeo;
long sk_sndtimeo;
struct sk_filter *sk_filter;
void *sk_protinfo;
struct timer_list sk_timer;
ktime_t sk_stamp;
struct socket *sk_socket;
void *sk_user_data;
struct page *sk_sndmsg_page;
struct sk_buff *sk_send_head;
__u32 sk_sndmsg_off;
int sk_write_pending;
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
__u32 sk_mark;
u32 sk_classid;
void (*sk_state_change)(struct sock*sk);
void (*sk_data_ready)(struct sock*sk,int bytes);
void (*sk_write_space)(struct sock*sk);
void (*sk_error_report)(struct sock*sk);
int (*sk_backlog_rcv)(struct sock*sk,
struct sk_buff*skb);
void (*sk_destruct)(struct sock*sk);
};
最小网络层表示结构体
/**
* structsock_common - minimal network layer representation of sockets
* @skc_node:main hash linkage for various protocol lookup tables
* @skc_nulls_node:main hash linkage for UDP/UDP-Lite protocol
* @skc_refcnt:reference count
* @skc_hash:hash value used with various protocol lookup tables
* @skc_family:network address family
* @skc_state:Connection state
* @skc_reuse:%SO_REUSEADDR setting
* @skc_bound_dev_if:bound device index if != 0
* @skc_bind_node:bind hash linkage for various protocol lookup tables
* @skc_prot:protocol handlers inside a network family
* @skc_net:reference to the network namespace of this socket
*
* Thisis the minimal network layer representation of sockets, the header
* forstruct sock and struct inet_timewait_sock.
*/
struct sock_common {
/*
* first fields are not copied in sock_copy()
*/
union {
struct hlist_node skc_node;
struct hlist_nulls_node skc_nulls_node;
};
atomic_t skc_refcnt;
unsigned int skc_hash;
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_bind_node;
struct proto *skc_prot;
#ifdef CONFIG_NET_NS
struct net *skc_net;
#endif
};
Inet_sock表示层结构体,在sock上做的扩展,用于在网络层之上表示inet协议族的的传输层公共结构体。
/** struct inet_sock - representation of INET sockets
*
* @sk - ancestor class
* @pinet6 - pointer to IPv6 controlblock
* @daddr - Foreign IPv4 addr
* @rcv_saddr - Bound local IPv4addr
* @dport - Destination port
* @num - Local port
* @saddr - Sending source
* @uc_ttl - Unicast TTL
* @sport - Source port
* @id - ID counter for DF pkts
* @tos - TOS
* @mc_ttl - Multicasting TTL
* @is_icsk - is this aninet_connection_sock?
* @mc_index - Multicast deviceindex
* @mc_list - Group array
* @cork - info to build ip hdr oneach ip frag while socket is corked
*/
structinet_sock {
/* sk and pinet6 has to be the firsttwo members of inet_sock */
struct sock sk;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
struct ipv6_pinfo *pinet6;
#endif
/* Socket demultiplex comparisons onincoming packets. */
__be32 daddr;
__be32 rcv_saddr;
__be16 dport;
__u16 num;
__be32 saddr;
__s16 uc_ttl;
__u16 cmsg_flags;
struct ip_options *opt;
__be16 sport;
__u16 id;
__u8 tos;
__u8 mc_ttl;
__u8 pmtudisc;
__u8 recverr:1,
is_icsk:1,
freebind:1,
hdrincl:1,
mc_loop:1,
transparent:1,
mc_all:1;
int mc_index;
__be32 mc_addr;
struct ip_mc_socklist *mc_list;
struct {
unsigned int flags;
unsigned int fragsize;
struct ip_options*opt;
struct dst_entry *dst;
int length;/* Total length ofall frames */
__be32 addr;
struct flowi fl;
} cork;
};
传输层UDP协议专用sock结构,在传输层inet_sock上扩展
structudp_sock {
/* inet_sock has to be the firstmember */
struct inet_sock inet;
int pending; /* Any pending frames ? */
unsigned int corkflag; /* Cork is required*/
__u16 encap_type; /* Is this anEncapsulation socket? */
/*
* Following member retains the information tocreate a UDP header
* when the socket is uncorked.
*/
__u16 len; /* total length ofpending frames */
/*
* Fields specific to UDP-Lite.
*/
__u16 pcslen;
__u16 pcrlen;
/* indicator bits used by pcflag: */
#define UDPLITE_BIT 0x1 /* set by udpliteproto init function */
#define UDPLITE_SEND_CC 0x2 /* set via udplitesetsockopt */
#define UDPLITE_RECV_CC 0x4 /* set via udplite setsocktopt */
__u8 pcflag; /* marks socket asUDP-Lite if > 0 */
__u8 unused[3];
/*
* For encapsulation sockets.
*/
int (*encap_rcv)(struct sock*sk,struct sk_buff *skb);
};
BSD socket层到inet_sock层接口,主要用于操作socket结构
structproto_ops {
int family;
struct module *owner;
int (*release) (struct socket*sock);
int (*bind) (struct socket*sock,
struct sockaddr*myaddr,
int sockaddr_len);
int (*connect) (struct socket*sock,
struct sockaddr*vaddr,
int sockaddr_len,int flags);
int (*socketpair)(struct socket*sock1,
struct socket*sock2);
int (*accept) (struct socket*sock,
struct socket*newsock,int flags);
int (*getname) (struct socket*sock,
struct sockaddr*addr,
int*sockaddr_len,int peer);
unsigned int (*poll) (struct file*file,struct socket *sock,
struct poll_table_struct*wait);
int (*ioctl) (struct socket*sock,unsignedint cmd,
unsignedlong arg);
int (*compat_ioctl)(struct socket*sock,unsignedint cmd,
unsignedlong arg);
int (*listen) (struct socket*sock,int len);
int (*shutdown) (struct socket*sock,int flags);
int (*setsockopt)(struct socket*sock,int level,
int optname,char __user*optval,unsignedint optlen);
int (*getsockopt)(struct socket*sock,int level,
int optname,char __user*optval,int __user*optlen);
int (*compat_setsockopt)(struct socket*sock,int level,
int optname,char __user*optval,unsignedint optlen);
int (*compat_getsockopt)(struct socket*sock,int level,
int optname,char __user*optval,int __user*optlen);
int (*sendmsg) (struct kiocb*iocb,struct socket*sock,
struct msghdr*m, size_t total_len);
int (*recvmsg) (struct kiocb*iocb,struct socket*sock,
struct msghdr*m, size_t total_len,
int flags);
int (*mmap) (struct file*file,struct socket*sock,
struct vm_area_struct* vma);
ssize_t (*sendpage) (struct socket*sock,struct page *page,
int offset, size_t size,int flags);
ssize_t (*splice_read)(struct socket*sock, loff_t*ppos,
struct pipe_inode_info*pipe, size_t len,unsignedint flags);
};
inet_sock 层到传输层 操作的统一接口,主要用于操作sock结构
/* Networking protocol blocks we attach to sockets.
* socket layer -> transportlayer interface
* transport -> network interfaceis defined by struct inet_proto
*/
struct proto {
void (*close)(struct sock*sk,
long timeout);
int (*connect)(struct sock*sk,
struct sockaddr*uaddr,
int addr_len);
int (*disconnect)(struct sock*sk,int flags);
struct sock * (*accept)(struct sock*sk,int flags,int*err);
int (*ioctl)(struct sock*sk,int cmd,
unsignedlong arg);
int (*init)(struct sock*sk);
void (*destroy)(struct sock*sk);
void (*shutdown)(struct sock*sk,int how);
int (*setsockopt)(struct sock*sk,int level,
int optname,char __user*optval,
unsignedint optlen);
int (*getsockopt)(struct sock*sk,int level,
int optname,char __user*optval,
int __user*option);
#ifdef CONFIG_COMPAT
int (*compat_setsockopt)(struct sock*sk,
int level,
int optname,char __user*optval,
unsignedint optlen);
int (*compat_getsockopt)(struct sock*sk,
int level,
int optname,char __user*optval,
int __user*option);
#endif
int (*sendmsg)(struct kiocb*iocb,struct sock *sk,
struct msghdr*msg, size_t len);
int (*recvmsg)(struct kiocb*iocb,struct sock *sk,
struct msghdr*msg,
size_t len,int noblock,int flags,
int *addr_len);
int (*sendpage)(struct sock*sk,struct page *page,
int offset, size_t size,int flags);
int (*bind)(struct sock*sk,
struct sockaddr*uaddr,int addr_len);
int (*backlog_rcv)(struct sock*sk,
struct sk_buff*skb);
/* Keeping track of sk‘s, lookingthem up, and port selection methods. */
void (*hash)(struct sock*sk);
void (*unhash)(struct sock*sk);
int (*get_port)(struct sock*sk,unsignedshort snum);
/* Keeping track of sockets in use */
#ifdef CONFIG_PROC_FS
unsigned int inuse_idx;
#endif
/* Memory pressure */
void (*enter_memory_pressure)(struct sock*sk);
atomic_t *memory_allocated; /* Current allocated memory. */
struct percpu_counter *sockets_allocated; /* Current number ofsockets. */
/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiplecontexts non atomically.
* All the __sk_mem_schedule() is of thisnature: accounting
* is strict, actions are advisory and havesome latency.
*/
int *memory_pressure;
int *sysctl_mem;
int *sysctl_wmem;
int *sysctl_rmem;
int max_header;
struct kmem_cache *slab;
unsigned int obj_size;
int slab_flags;
struct percpu_counter *orphan_count;
struct request_sock_ops *rsk_prot;
struct timewait_sock_ops*twsk_prot;
union {
struct inet_hashinfo*hashinfo;
struct udp_table *udp_table;
struct raw_hashinfo *raw_hash;
} h;
struct module *owner;
char name[32];
struct list_head node;
#ifdef SOCK_REFCNT_DEBUG
atomic_t socks;
#endif
};
用于标识和注册协议族,常见的协议族有 ipv4, ipv6。
协议族: 用于完成某些特定的功能的协议集合。
structnet_proto_family {
int family;
int (*create)(struct net*net,struct socket *sock,
int protocol,int kern);
struct module *owner;
};
内核中声明了大量的协议族,并不是所有的协议族都支持。
/* Supported address families. */
#define AF_UNSPEC 0
#define AF_UNIX 1 /* Unix domain sockets */
#define AF_LOCAL 1 /* POSIX name for AF_UNIX */
#define AF_INET 2 /* Internet IP Protocol */
#define AF_AX25 3 /* Amateur Radio AX.25 */
#define AF_IPX 4 /* Novell IPX */
#define AF_APPLETALK 5 /* AppleTalk DDP */
#define AF_NETROM 6 /* Amateur Radio NET/ROM */
#define AF_BRIDGE 7 /* Multiprotocol bridge */
#define AF_ATMPVC 8 /* ATM PVCs */
#define AF_X25 9 /* Reserved for X.25 project */
#define AF_INET6 10 /* IP version 6 */
#define AF_ROSE 11 /* Amateur Radio X.25 PLP */
#define AF_DECnet 12 /* Reserved for DECnet project */
#define AF_NETBEUI 13 /* Reserved for 802.2LLC project*/
#define AF_SECURITY 14 /* Security callback pseudo AF */
#define AF_KEY 15 /* PF_KEY key management API */
#define AF_NETLINK 16
#define AF_ROUTE AF_NETLINK /* Alias to emulate4.4BSD */
#define AF_PACKET 17 /* Packet family */
#define AF_ASH 18 /* Ash */
#define AF_ECONET 19 /* Acorn Econet */
#define AF_ATMSVC 20 /* ATM SVCs */
#define AF_RDS 21 /* RDS sockets */
#define AF_SNA 22 /* Linux SNA Project (nutters!) */
#define AF_IRDA 23 /* IRDA sockets */
#define AF_PPPOX 24 /* PPPoX sockets */
#define AF_WANPIPE 25 /* Wanpipe API Sockets */
#define AF_LLC 26 /* Linux LLC */
#define AF_CAN 29 /* Controller Area Network */
#define AF_TIPC 30 /* TIPC sockets */
#define AF_BLUETOOTH 31 /* Bluetooth sockets */
#define AF_IUCV 32 /* IUCV sockets */
#define AF_RXRPC 33 /* RxRPC sockets */
#define AF_ISDN 34 /* mISDN sockets */
#define AF_PHONET 35 /* Phonet sockets */
#define AF_IEEE802154 36 /* IEEE802154 sockets */
#define AF_MAX 37 /* For now.. */
static const struct net_proto_family *net_families[NPROTO];
内核为每个CPU都分配一个这样的softnet_data数据空间。
每个CPU都有一个这样的队列,用于接收数据包。
/*
* Incoming packets are placed onper-cpu queues so that
* no locking is needed.
*/
structsoftnet_data {
struct Qdisc *output_queue;
struct list_head poll_list;
struct sk_buff *completion_queue;
/* Elements below can be accessedbetween CPUs for RPS */
struct call_single_data csd ____cacheline_aligned_in_smp;
unsigned int input_queue_head;
struct sk_buff_head input_pkt_queue;
struct napi_struct backlog;
};
描述一个帧结构的属性,持有socket,到达时间,到达设备,各层头部大小,下一站路由入口,帧长度,校验和,等等。
/**
* structsk_buff - socket buffer
* @next:Next buffer in list
* @prev:Previous buffer in list
* @sk:Socket we are owned by
* @tstamp:Time we arrived
* @dev:Device we arrived on/are leaving by
* @transport_header:Transport layer header
* @network_header:Network layer header
* @mac_header:Link layer header
* @_skb_dst:destination entry
* @sp:the security path, used for xfrm
* @cb:Control buffer. Free for use by every layer. Put private vars here
* @len:Length of actual data
* @data_len:Data length
* @mac_len:Length of link layer header
* @hdr_len:writable header length of cloned skb
* @csum:Checksum (must include start/offset pair)
* @csum_start:Offset from skb->head where checksumming should start
* @csum_offset:Offset from csum_start where checksum should be stored
* @local_df:allow local fragmentation
* @cloned:Head may be cloned (check refcnt to be sure)
* @nohdr:Payload reference only, must not modify header
* @pkt_type:Packet class
* @fclone:skbuff clone status
* @ip_summed:Driver fed us an IP checksum
* @priority:Packet queueing priority
* @users:User count - see {datagram,tcp}.c
* @protocol:Packet protocol from driver
* @truesize:Buffer size
* @head:Head of buffer
* @data:Data head pointer
* @tail:Tail pointer
* @end:End pointer
* @destructor:Destruct function
* @mark:Generic packet mark
* @nfct:Associated connection, if any
* @ipvs_property:skbuff is owned by ipvs
* @peeked:this packet has been seen already, so stats have been
* donefor it, don‘t do them again
* @nf_trace:netfilter packet trace flag
* @nfctinfo:Relationship of this skb to the connection
* @nfct_reasm:netfilter conntrack re-assembly pointer
* @nf_bridge:Saved data about a bridged frame - see br_netfilter.c
* @iif:ifindex of device we arrived on
* @queue_mapping:Queue mapping for multiqueue devices
* @tc_index:Traffic control index
* @tc_verd:traffic control verdict
* @ndisc_nodetype:router type (from link layer)
* @dma_cookie:a cookie to one of several possible DMA operations
* doneby skb DMA functions
* @secmark:security marking
* @vlan_tci:vlan tag control information
*/
struct sk_buff{
/* These two members must be first. */
struct sk_buff *next;
struct sk_buff *prev;
struct sock *sk;
ktime_t tstamp;
struct net_device*dev;
unsigned long _skb_dst;
#ifdef CONFIG_XFRM
struct sec_path *sp;
#endif
/*
* This is the control buffer. It is free touse for every
* layer. Please put your private variablesthere. If you
* want to keep them across layers you have todo a skb_clone()
* first. This is owned by whoever has the skbqueued ATM.
*/
char cb[48];
unsigned int len,
data_len;
__u16 mac_len,
hdr_len;
union {
__wsum csum;
struct {
__u16 csum_start;
__u16 csum_offset;
};
};
__u32 priority;
kmemcheck_bitfield_begin(flags1);
__u8 local_df:1,
cloned:1,
ip_summed:2,
nohdr:1,
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
ipvs_property:1,
peeked:1,
nf_trace:1;
__be16 protocol:16;
kmemcheck_bitfield_end(flags1);
void (*destructor)(struct sk_buff*skb);
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct nf_conntrack *nfct;
struct sk_buff *nfct_reasm;
#endif
#ifdef CONFIG_BRIDGE_NETFILTER
struct nf_bridge_info *nf_bridge;
#endif
int iif;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic controlindex */
#ifdef CONFIG_NET_CLS_ACT
__u16 tc_verd; /* traffic controlverdict */
#endif
#endif
kmemcheck_bitfield_begin(flags2);
__u16 queue_mapping:16;
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2,
deliver_no_wcard:1;
#else
__u8 deliver_no_wcard:1;
#endif
#ifndef __GENKSYMS__
__u8 ooo_okay:1;
#endif
kmemcheck_bitfield_end(flags2);
/* 0/13 bit hole */
#ifdef CONFIG_NET_DMA
dma_cookie_t dma_cookie;
#endif
#ifdef CONFIG_NETWORK_SECMARK
__u32 secmark;
#endif
union {
__u32 mark;
__u32 dropcount;
};
__u16 vlan_tci;
#ifndef __GENKSYMS__
__u16 rxhash;
#endif
sk_buff_data_t transport_header;
sk_buff_data_t network_header;
sk_buff_data_t mac_header;
/* These elements must be at the end,see alloc_skb() for details. */
sk_buff_data_t tail;
sk_buff_data_t end;
unsigned char *head,
*data;
unsigned int truesize;
atomic_t users;
};
数据包队列结构
structsk_buff_head {
/* These two members must be first.*/
struct sk_buff *next;
struct sk_buff *prev;
__u32 qlen;
spinlock_t lock;
};
这个巨大的结构体描述一个网络设备的所有属性,数据等信息。
/*
* TheDEVICE structure.
* Actually,this whole structure is a big mistake. It mixes I/O
* datawith strictly "high-level" data, and it has to know about
* almostevery data structure used in the INET module.
*
* FIXME:cleanup struct net_device such that network protocol info
* movesout.
*/
structnet_device
{
/*
* This is the first field of the"visible" part of this structure
* (i.e. as seen by users in the"Space.c" file). It is thename
* the interface.
*/
char name[IFNAMSIZ];
/* device name hash chain */
struct hlist_node name_hlist;
/* snmp alias */
char *ifalias;
/*
* I/Ospecific fields
* FIXME:Merge these and struct ifmap into one
*/
unsigned long mem_end; /* shared mem end */
unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/Oaddress */
unsigned int irq; /* device IRQ number */
/*
* Somehardware also needs these fields, but they are not
* partof the usual set specified in Space.c.
*/
unsigned char if_port; /* Selectable AUI,TP,..*/
unsigned char dma; /* DMA channel */
unsigned long state;
struct list_head dev_list;
struct list_head napi_list;
/* Net device features */
unsigned long features;
#define NETIF_F_SG 1 /* Scatter/gather IO. */
#define NETIF_F_IP_CSUM 2 /* Can checksum TCP/UDP over IPv4. */
#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
#define NETIF_F_IPV6_CSUM 16 /* Can checksum TCP/UDP over IPV6 */
#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
#define NETIF_F_HW_VLAN_TX 128/* Transmit VLAN hw acceleration */
#define NETIF_F_HW_VLAN_RX 256/* Receive VLAN hw acceleration */
#define NETIF_F_HW_VLAN_FILTER 512/* Receive filtering on VLAN */
#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
#define NETIF_F_GSO 2048 /* Enable software GSO. */
#define NETIF_F_LLTX 4096 /* LockLess TX - deprecated. Please */
/* do not use LLTXin new drivers */
#define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */
#define NETIF_F_GRO 16384 /* Generic receive offload */
#define NETIF_F_LRO 32768 /* large receive offload */
/* the GSO_MASK reserves bits 16 through 23 */
#define NETIF_F_FCOE_CRC (1 <<24)/* FCoECRC32 */
#define NETIF_F_SCTP_CSUM (1<< 25)/* SCTPchecksum offload */
#define NETIF_F_FCOE_MTU (1 <<26)/*Supports max FCoE MTU, 2158 bytes*/
#define NETIF_F_NTUPLE (1<< 27)/*N-tuple filters supported */
#define NETIF_F_RXHASH (1<< 28)/*Receive hashing offload */
#define NETIF_F_RXCSUM (1<< 29)/*Receive checksumming offload */
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
#define NETIF_F_GSO_MASK 0x00ff0000
#define NETIF_F_TSO (SKB_GSO_TCPV4<< NETIF_F_GSO_SHIFT)
#define NETIF_F_UFO (SKB_GSO_UDP<< NETIF_F_GSO_SHIFT)
#define NETIF_F_GSO_ROBUST (SKB_GSO_DODGY<< NETIF_F_GSO_SHIFT)
#define NETIF_F_TSO_ECN (SKB_GSO_TCP_ECN<< NETIF_F_GSO_SHIFT)
#define NETIF_F_TSO6 (SKB_GSO_TCPV6<< NETIF_F_GSO_SHIFT)
#define NETIF_F_FSO (SKB_GSO_FCOE<< NETIF_F_GSO_SHIFT)
#define NETIF_F_ALL_TSO (NETIF_F_TSO| NETIF_F_TSO6 | NETIF_F_TSO_ECN)
/* List of features with softwarefallbacks. */
#define NETIF_F_GSO_SOFTWARE (NETIF_F_TSO| NETIF_F_TSO_ECN | \
NETIF_F_TSO6 | NETIF_F_UFO)
#define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM| NETIF_F_HW_CSUM)
#define NETIF_F_V4_CSUM (NETIF_F_GEN_CSUM| NETIF_F_IP_CSUM)
#define NETIF_F_V6_CSUM (NETIF_F_GEN_CSUM| NETIF_F_IPV6_CSUM)
#define NETIF_F_ALL_CSUM (NETIF_F_V4_CSUM| NETIF_F_V6_CSUM)
/*
* If one device supports one of thesefeatures, then enable them
* for all in netdev_increment_features.
*/
#define NETIF_F_ONE_FOR_ALL (NETIF_F_GSO_SOFTWARE| NETIF_F_GSO_ROBUST | \
NETIF_F_SG | NETIF_F_HIGHDMA | \
NETIF_F_FRAGLIST)
/* Interface index. Unique deviceidentifier */
int ifindex;
int iflink;
struct net_device_stats stats;
#ifdef CONFIG_WIRELESS_EXT
/* List of functions to handleWireless Extensions (instead of ioctl).
* See <net/iw_handler.h> for details.Jean II */
const struct iw_handler_def * wireless_handlers;
/* Instance data managed by the coreof Wireless Extensions. */
struct iw_public_data* wireless_data;
#endif
/* Management operations */
const struct net_device_ops *netdev_ops;
const struct ethtool_ops *ethtool_ops;
/* Hardware header description */
const struct header_ops *header_ops;
unsigned int flags; /* interface flags(a la BSD) */
unsigned short gflags;
unsigned short priv_flags;/* Like ‘flags‘ but invisible touserspace. */
unsigned short padded; /* How much paddingadded by alloc_netdev() */
unsigned char operstate; /* RFC2863 operstate */
unsigned char link_mode; /* mapping policy to operstate */
unsigned mtu; /* interface MTUvalue */
unsigned short type; /* interfacehardware type */
unsigned short hard_header_len; /* hardware hdr length */
/* extra head- and tailroom thehardware may need, but not in all cases
* can this be guaranteed, especially tailroom.Some cases also use
* LL_MAX_HEADER instead to allocate the skb.
*/
unsigned short needed_headroom;
unsigned short needed_tailroom;
struct net_device *master;/* Pointer to masterdevice of a group,
* which this device is member of.
*/
/* Interface address info. */
unsigned char perm_addr[MAX_ADDR_LEN];/* permanent hwaddress */
unsigned char addr_assign_type;/* hw address assignment type */
unsigned char addr_len; /* hardware addresslength */
unsigned short dev_id; /* for sharednetwork cards */
struct netdev_hw_addr_list uc;/* Secondary unicast
mac addresses */
int uc_promisc;
spinlock_t addr_list_lock;
struct dev_addr_list*mc_list; /* Multicast mac addresses */
int mc_count; /* Number of installed mcasts */
unsigned int promiscuity;
unsigned int allmulti;
/* Protocol specific pointers */
#ifdef CONFIG_NET_DSA
void *dsa_ptr; /* dsa specific data */
#endif
void *atalk_ptr; /* AppleTalk link */
void *ip_ptr; /* IPv4 specific data */
void *dn_ptr; /* DECnet specific data */
void *ip6_ptr; /* IPv6specific data */
void *ec_ptr; /* Econet specific data */
void *ax25_ptr;/* AX.25 specific data
also used by openvswitch */
struct wireless_dev *ieee80211_ptr; /* IEEE 802.11 specific data,
assign before registering */
/*
* Cache line mostly used on receivepath (including eth_type_trans())
*/
unsigned long last_rx; /* Time of last Rx */
/* Interface address info used ineth_type_trans() */
unsigned char *dev_addr;/* hw address,(before bcast
because most packets are
unicast) */
struct netdev_hw_addr_list dev_addrs;/* list of device
hw addresses */
unsigned char broadcast[MAX_ADDR_LEN];/* hw bcast add */
struct netdev_queue rx_queue;
struct netdev_queue *_tx____cacheline_aligned_in_smp;
/* Number of TX queues allocated atalloc_netdev_mq() time */
unsigned int num_tx_queues;
/* Number of TX queues currentlyactive in device */
unsigned int real_num_tx_queues;
/* root qdisc from userspace point ofview */
struct Qdisc *qdisc;
unsigned long tx_queue_len; /* Max frames perqueue allowed */
spinlock_t tx_global_lock;
/*
* One part is mostly used on xmitpath (device)
*/
/* These may be needed for futurenetwork-power-down code. */
/*
* trans_start here is expensive for high speeddevices on SMP,
* please use netdev_queue->trans_startinstead.
*/
unsigned long trans_start; /* Time (in jiffies)of last Tx */
int watchdog_timeo;/* used bydev_watchdog() */
struct timer_list watchdog_timer;
/* Number of references to thisdevice */
atomic_t refcnt ____cacheline_aligned_in_smp;
/* delayed register/unregister */
struct list_head todo_list;
/* device index hash chain */
struct hlist_node index_hlist;
struct net_device *link_watch_next;
/* register/unregister state machine*/
enum { NETREG_UNINITIALIZED=0,
NETREG_REGISTERED,/* completedregister_netdevice */
NETREG_UNREGISTERING, /* called unregister_netdevice */
NETREG_UNREGISTERED, /* completed unregister todo */
NETREG_RELEASED, /* called free_netdev */
NETREG_DUMMY, /* dummy device forNAPI poll */
} reg_state;
/* Called from unregister, can beused to call free_netdev */
void (*destructor)(struct net_device*dev);
#ifdef CONFIG_NETPOLL
struct netpoll_info *npinfo;
#endif
#ifdef CONFIG_NET_NS
/* Network namespace this networkdevice is inside */
struct net *nd_net;
#endif
/* mid-layer private */
void *ml_priv;
/* bridge stuff */
struct net_bridge_port *br_port;
/* macvlan */
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port *garp_port;
/* class/net/name entry */
struct device dev;
/* space for optional statistics andwireless sysfs groups */
const struct attribute_group *sysfs_groups[3];
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
/* VLAN feature mask */
unsigned long vlan_features;
/* for setting kernel sock attributeon TCP connection setup */
#define GSO_MAX_SIZE 65536
unsigned int gso_max_size;
#ifdef CONFIG_DCB
/* Data Center Bridging netlink ops*/
const struct dcbnl_rtnl_ops *dcbnl_ops;
#endif
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
/* max exchange id for FCoE LRO byddp */
unsigned int fcoe_ddp_xid;
#endif
};
向IP层注册socket层的调用操作接口
/* This is used to register socket interfaces for IP protocols. */
structinet_protosw {
struct list_head list;
/* These two fields form the lookupkey. */
unsigned short type; /* This is the 2ndargument to socket(2). */
unsigned short protocol; /* This is the L4protocol number. */
struct proto*prot;
const struct proto_ops *ops;
char no_check; /* checksum on rcv/xmit/none? */
unsigned char flags; /* SeeINET_PROTOSW_* below. */
};
socket层调用IP层操作接口都在这个数组中注册。
/* Upon startup we insert all the elements in inetsw_array[] into
* the linked list inetsw.
*/
static struct inet_protoswinetsw_array[] =
{
{
.type = SOCK_STREAM,
.protocol= IPPROTO_TCP,
.prot = &tcp_prot,
.ops = &inet_stream_ops,
.no_check= 0,
.flags = INET_PROTOSW_PERMANENT |
INET_PROTOSW_ICSK,
},
{
.type = SOCK_DGRAM,
.protocol = IPPROTO_UDP,
.prot = &udp_prot,
.ops = &inet_dgram_ops,
.no_check= UDP_CSUM_DEFAULT,
.flags = INET_PROTOSW_PERMANENT,
},
{
.type = SOCK_DGRAM,
.protocol= IPPROTO_ICMP,
.prot = &ping_prot,
.ops = &inet_dgram_ops,
.no_check= UDP_CSUM_DEFAULT,
.flags = INET_PROTOSW_REUSE,
},
{
.type= SOCK_RAW,
.protocol= IPPROTO_IP, /* wild card */
.prot= &raw_prot,
.ops = &inet_sockraw_ops,
.no_check= UDP_CSUM_DEFAULT,
.flags= INET_PROTOSW_REUSE,
}
};
socket类型
/**
* enum sock_type - Socket types
* @SOCK_STREAM: stream (connection)socket
* @SOCK_DGRAM: datagram (conn.less)socket
* @SOCK_RAW: raw socket
* @SOCK_RDM: reliably-deliveredmessage
* @SOCK_SEQPACKET: sequentialpacket socket
* @SOCK_DCCP: Datagram CongestionControl Protocol socket
* @SOCK_PACKET: linux specific wayof getting packets at the dev level.
* For writing rarp and other similar things onthe user level.
*
* When adding some new socket typeplease
* grep ARCH_HAS_SOCKET_TYPEinclude/asm-* /socket.h, at least MIPS
* overrides this enum for binarycompat reasons.
*/
enumsock_type {
SOCK_STREAM =1,
SOCK_DGRAM = 2,
SOCK_RAW =3,
SOCK_RDM =4,
SOCK_SEQPACKET =5,
SOCK_DCCP =6,
SOCK_PACKET =10,
};
传输层协议类型ID
/* Standard well-defined IP protocols. */
enum {
IPPROTO_IP = 0, /* Dummy protocol for TCP */
IPPROTO_ICMP =1, /* Internet Control Message Protocol */
IPPROTO_IGMP =2, /* Internet Group Management Protocol */
IPPROTO_IPIP =4, /* IPIP tunnels (older KA9Q tunnels use 94) */
IPPROTO_TCP = 6, /* Transmission Control Protocol */
IPPROTO_EGP = 8, /* Exterior Gateway Protocol */
IPPROTO_PUP = 12, /* PUP protocol */
IPPROTO_UDP = 17, /* User Datagram Protocol */
IPPROTO_IDP = 22, /* XNS IDP protocol */
IPPROTO_DCCP =33, /* Datagram Congestion Control Protocol */
IPPROTO_RSVP =46, /* RSVP protocol */
IPPROTO_GRE = 47, /* Cisco GRE tunnels (rfc 1701,1702) */
IPPROTO_IPV6 =41, /* IPv6-in-IPv4tunnelling */
IPPROTO_ESP = 50, /* Encapsulation Security Payloadprotocol */
IPPROTO_AH = 51, /* Authentication Headerprotocol */
IPPROTO_BEETPH =94, /* IP option pseudoheader for BEET */
IPPROTO_PIM =103, /* ProtocolIndependent Multicast */
IPPROTO_COMP =108, /* CompressionHeader protocol */
IPPROTO_SCTP =132, /* Stream ControlTransport Protocol */
IPPROTO_UDPLITE =136,/* UDP-Lite (RFC 3828) */
IPPROTO_RAW =255, /* Raw IP packets */
IPPROTO_MAX
};
/* The inetsw table contains everything that inet_create needs to
* build a new socket.
*/
static struct list_head inetsw[SOCK_MAX];
用于传输层协议向IP层注册收包的接口
/* This is used to register protocols. */
struct net_protocol{
int (*handler)(structsk_buff*skb);
void (*err_handler)(struct sk_buff*skb, u32 info);
int (*gso_send_check)(struct sk_buff*skb);
struct sk_buff *(*gso_segment)(struct sk_buff*skb,
int features);
struct sk_buff **(*gro_receive)(struct sk_buff**head,
struct sk_buff*skb);
int (*gro_complete)(struct sk_buff*skb);
unsigned int no_policy:1,
netns_ok:1;
};
实例,UDP向IP层注册的接口
static const struct net_protocoludp_protocol={
.handler = udp_rcv,
.err_handler= udp_err,
.gso_send_check= udp4_ufo_send_check,
.gso_segment= udp4_ufo_fragment,
.no_policy = 1,
.netns_ok = 1,
};
IP层收包的接口都在这个数组中注册。
externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];
以太网数据包的结构,包括了以太网帧类型,包处理方法等。
structpacket_type {
__be16 type; /* This is really htons(ether_type). */
struct net_device *dev; /* NULL is wildcarded here */
int (*func) (struct sk_buff*,
struct net_device*,
struct packet_type*,
struct net_device*);
struct sk_buff *(*gso_segment)(struct sk_buff*skb,
int features);
int (*gso_send_check)(struct sk_buff*skb);
struct sk_buff **(*gro_receive)(struct sk_buff**head,
struct sk_buff*skb);
int (*gro_complete)(struct sk_buff*skb);
void *af_packet_priv;
struct list_head list;
};
IP协议向链路层注册的包处理接口。
/*
* IPprotocol layer initialiser
*/
static struct packet_typeip_packet_type ={
.type = cpu_to_be16(ETH_P_IP),
.func = ip_rcv,
.gso_send_check= inet_gso_send_check,
.gso_segment= inet_gso_segment,
.gro_receive= inet_gro_receive,
.gro_complete= inet_gro_complete,
};
/*
* Theseare the defined Ethernet Protocol ID‘s.
*/
#define ETH_P_LOOP 0x0060 /* Ethernet Loopback packet */
#define ETH_P_PUP 0x0200 /* Xerox PUP packet */
#define ETH_P_PUPAT 0x0201 /* Xerox PUP Addr Trans packet */
#defineETH_P_IP 0x0800 /*Internet Protocol packet */
#define ETH_P_X25 0x0805 /* CCITT X.25 */
#define ETH_P_ARP 0x0806 /* Address Resolution packet */
#define ETH_P_BPQ 0x08FF /* G8BPQ AX.25Ethernet Packet [ NOT AN OFFICIALLYREGISTERED ID ] */
#define ETH_P_IEEEPUP 0x0a00 /* Xerox IEEE802.3 PUP packet */
#define ETH_P_IEEEPUPAT 0x0a01 /* Xerox IEEE802.3 PUP Addr Trans packet */
#define ETH_P_DEC 0x6000 /* DEC Assigned proto */
#define ETH_P_DNA_DL 0x6001 /* DEC DNA Dump/Load */
#define ETH_P_DNA_RC 0x6002 /* DEC DNA Remote Console */
#define ETH_P_DNA_RT 0x6003 /* DEC DNA Routing */
#define ETH_P_LAT 0x6004 /* DEC LAT */
#define ETH_P_DIAG 0x6005 /* DEC Diagnostics */
#define ETH_P_CUST 0x6006 /* DEC Customer use */
#define ETH_P_SCA 0x6007 /* DEC Systems Comms Arch */
#define ETH_P_TEB 0x6558 /* Trans Ether Bridging */
#define ETH_P_RARP 0x8035 /* Reverse Addr Res packet */
#define ETH_P_ATALK 0x809B /* Appletalk DDP */
#define ETH_P_AARP 0x80F3 /* Appletalk AARP */
#define ETH_P_8021Q 0x8100 /* 802.1Q VLAN Extended Header */
#define ETH_P_IPX 0x8137 /* IPX over DIX */
#define ETH_P_IPV6 0x86DD /* IPv6 over bluebook */
#define ETH_P_PAUSE 0x8808 /* IEEE Pause frames. See 802.3 31B */
#define ETH_P_SLOW 0x8809 /* Slow Protocol. See 802.3ad 43B */
#define ETH_P_WCCP 0x883E /* Web-cache coordination protocol
* defined in draft-wilson-wrec-wccp-v2-00.txt*/
#define ETH_P_PPP_DISC 0x8863 /* PPPoE discovery messages */
#define ETH_P_PPP_SES 0x8864 /* PPPoE session messages */
#define ETH_P_MPLS_UC 0x8847 /* MPLS Unicast traffic */
#define ETH_P_MPLS_MC 0x8848 /* MPLS Multicast traffic */
#define ETH_P_ATMMPOA 0x884c /* MultiProtocol Over ATM */
#define ETH_P_ATMFATE 0x8884 /* Frame-based ATM Transport
* over Ethernet
*/
#define ETH_P_PAE 0x888E /* Port Access Entity (IEEE 802.1X) */
#define ETH_P_AOE 0x88A2 /* ATA over Ethernet */
#define ETH_P_TIPC 0x88CA /* TIPC */
#define ETH_P_1588 0x88F7 /* IEEE 1588 Timesync */
#define ETH_P_FCOE 0x8906 /* Fibre Channel over Ethernet */
#define ETH_P_TDLS 0x890D /* TDLS */
#define ETH_P_FIP 0x8914 /* FCoE Initialization Protocol */
#define ETH_P_EDSA 0xDADA /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID] */
#define ETH_P_AF_IUCV 0xFBFB /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ]*/
网络层向链路层注册操作函数集合在此数据。
static struct list_head ptype_base[PTYPE_HASH_SIZE];
路由表结构,描述一个路由表的完整形态。
struct rtable {
union
{
struct dst_entry dst;
} u;
/* Cache lookup keys */
struct flowi fl;
struct in_device *idev;
int rt_genid;
unsigned rt_flags;
__u16 rt_type;
__be32 rt_dst; /* Path destination */
__be32 rt_src; /* Path source */
int rt_iif;
/* Info on neighbour */
__be32 rt_gateway;
/* Miscellaneous cached information*/
__be32 rt_spec_dst;/* RFC1122 specific destination */
struct inet_peer *peer;/* long-living peerinfo */
};
路由表缓存
/*
* Route cache.
*/
/* The locking scheme is rather straight forward:
*
* 1) Read-Copy Update protects thebuckets of the central route hash.
* 2) Only writers remove entries,and they hold the lock
* as they look at rtable reference counts.
* 3) Only readers acquirereferences to rtable entries,
* they do so with atomic increments and with the
* lock held.
*/
structrt_hash_bucket {
struct rtable *chain;
};
包的去向接口,描述了包的去留,下一跳等路由关键信息。
/* Each dst_entry has reference count and sits in some parent list(s).
* When it is removed from parentlist, it is "freed" (dst_free).
* After this it enters dead state(dst->obsolete > 0) and if its refcnt
* is zero, it can be destroyedimmediately, otherwise it is added
* to gc list and garbage collectorperiodically checks the refcnt.
*/
structdst_entry
{
struct rcu_head rcu_head;
struct dst_entry *child;
struct net_device *dev;
short error;
short obsolete;
int flags;
#define DST_HOST 1
#define DST_NOXFRM 2
#define DST_NOPOLICY 4
#define DST_NOHASH 8
unsigned long expires;
unsigned short header_len; /* more space athead required */
unsigned short trailer_len; /* space to reserveat tail */
unsigned int rate_tokens;
unsigned long rate_last; /* rate limiting forICMP */
struct dst_entry *path;
struct neighbour *neighbour;
struct hh_cache *hh;
#ifdef CONFIG_XFRM
struct xfrm_state *xfrm;
#else
void *__pad1;
#endif
int (*input)(struct sk_buff*);
int (*output)(struct sk_buff*);
struct dst_ops *ops;
/* This Red Hat kABI workaround will shift tclassid 32 bit, while we
* still keep the original size ofdst_entry and assures alignment
* (see further down).
*/
#ifdef __GENKSYMS__
u32 metrics[RTAX_MAX_ORIG];
#else
u32 metrics[RTAX_MAX];
#endif
#ifdef CONFIG_NET_CLS_ROUTE
__u32 tclassid;
#else
__u32 __pad2;
#endif
/*
* Align __refcnt to a 64 bytes alignment
* (L1_CACHE_SIZE would be too much)
*/
/* Red Hat kABI workaround to assure aligning __refcnt, while
* consuming 32 bit of padding forour metrics expansion above.
* On 32bit archs not padding remains.
*/
#ifdef __GENKSYMS__
#ifdef CONFIG_64BIT
long __pad_to_align_refcnt[2];
#else
long __pad_to_align_refcnt[1];
#endif
#else /* __GENKSYMS__ */
#ifdef CONFIG_64BIT
u32 __pad_hole_in_struct;
long __pad_to_align_refcnt[1];
#endif
#endif /*__GENKSYMS__ */
/*
* __refcnt wants to be on a different cacheline from
* input/output/ops or performance tanks badly
*/
atomic_t __refcnt; /* client references */
int __use;
unsigned long lastuse;
union {
struct dst_entry*next;
struct rtable *rt_next;
struct rt6_info *rt6_next;
struct dn_route *dn_next;
};
};
NAPI调度的结构
NAPI: NAPI是LINUX上采用的一种提高网络处理效率的技术,它的核心概念就是不采用中断的方式读取数据,而代之以首先采用中断唤醒数据接收服务,然后采用poll的方法来轮询数据。NAPI技术适用于高速率的短长度数据包的处理。
/*
* Structure for NAPI schedulingsimilar to tasklet but with weighting
*/
structnapi_struct {
/* The poll_list must only be managedby the entity which
* changes the state of the NAPI_STATE_SCHEDbit. This means
* whoever atomically sets that bit can addthis napi_struct
* to the per-cpu poll_list, and whoever clearsthat bit
* can remove from the list right beforeclearing the bit.
*/
struct list_head poll_list;
unsigned long state;
int weight;
int (*poll)(structnapi_struct*,int);
#ifdef CONFIG_NETPOLL
spinlock_t poll_lock;
int poll_owner;
#endif
unsigned int gro_count;
struct net_device *dev;
struct list_head dev_list;
struct sk_buff *gro_list;
struct sk_buff *skb;
};
图2 数据结构
当内核完成自解压过程后进入内核启动,这一过程先在arch/mips/kernel/head.S 程序中,这个程序负责数据区(BBS)、中断描述表(IDT)、段描述表(GDT)、页表和寄存器的初始化,程序中定义了内核的入口函数 kernel_entry( ) , kernel_entry( )函数是体系结构相关的汇编代码,它首先初始化内核堆栈段为创建系统中的第一过程进行准备,接着用一段循环将内核映像的未初始化的数据段清零,最后跳到 start_kernel()函数中初始化硬件相关的代码,完成Linux核心环境的建立。
start_kenrel()定义在init/main.c中,真正的内核初始化过程就是从这里才开始。函数start_kerenl()将会调用一系列的初始化函数,用来完成内核本身的各方面设置,如中断,内存管理,进程管理,信号,文件系统,目的是最终建立起基本完整的Linux核心环境
start_kernel()函数中主要函数及调用关系如下:
start_kernel |
setup_arch |
sched_init |
init_IRQ |
proc_root_init |
mm_init |
console_init |
rest_init |
cpu_probe |
prom_init |
cpu_report |
arch_mem_init |
resource_init |
kernel_init |
cpu_idle |
do_basic_setup |
init_post |
init_tmpfs |
driver_init |
do_initcalls |
sock_init: Initializesk_buff SLAB cache注册socket文件系统
net_inuse_init: 为每个CPU分配缓存。
proto_init: 在/proc/net域下建立protocols文件,注册相关文件操作函数
net_dev_init: 建立netdevice在/proc/sys相关的数据结构,并且开启网卡收发中断。
为每个CPU初始化一个数据包接收队列(softnet_data),包接收的回调。注册本地回环操作,注册默认网络设备操作。 驱动层
Inet_init: 注册Inet协议族的socket创建方法,注册tcp,udp,icmp,igmp 接口基本的收包方法。为IPv4协议族创建proc文件。
此函数为协议栈主要的注册函数:
1. rc = proto_register(&udp_prot, 1); 注册inet层udp协议,为其分配快速缓存。
2. (void)sock_register(&inet_family_ops); 向static const struct net_proto_family *net_families[NPROTO] ; 结构注册inet协议族的操作集合(主要是协议族inetsocket的创建操作)。Inet socket层
3. inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0, 向externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];(网络层)注册传输层UDP的操作集合。网络层
4. static struct list_head inetsw[SOCK_MAX]; for (r = &inetsw[0]; r < &inetsw[SOCK_MAX];++r) INIT_LIST_HEAD(r); 初始化SOCKET类型数组,其中保存了这是个链表数组,每个元素是一个链表,连接使用同种socket类型的协议和操作集合。
5. for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN];++q)
a) inet_register_protosw(q);
向sock层注册协议的的调用操作集合 bsd socket层和 inet socket层
6. arp_init(); 启动arp协议支持
7. ip_init(); 启动Ip协议支持
8. udp_init(); 启动UDP协议支持
9. dev_add_pack(&ip_packet_type); 向 ptype_base[PTYPE_HASH_SIZE] ; 注册ip 协议的操作集合。 协议无关层
10. 系统调用层: socket.c中提供的系统调用接口。
本章主要介绍socket创建的流程,参数传递过程。fd = socket(family, type, protocol); 创建后,内存中的数据结构的组织结构。
图3 socket创建流程
以UDP协议为例
图 收发流程页
以UDP协议为例
图 内核收包流程页
以UDP协议为例
图 应用层收包流程页
以UDP协议为例
图 UDP发包流程
本文只是对协议栈流程做了些粗略的分析,里面涉及到大量的技术思想没有办法传达,要深入理解可先参考csdn博主yming0221的关于协议栈的文章,链接为 http://blog.csdn.net/column/details/linux-kernel-net.html。或者直接阅读linux内核协议栈源码。
1. TCP/IP详解卷一
2 .博客 http://blog.csdn.net/column/details/linux-kernel-net.html
图1 初始化流程
图2 分层数据结构
图3 socket 创建流程
图4 收发流程
图 5 内核收包流程细化 (中断收包)
图6 应用层收包流程
图7 UDP发包流程
.
标签:理解 tom tla 高亮 so_linger short suse block dict
原文地址:https://www.cnblogs.com/liuhongru/p/11412363.html