The Real Story behind SO_RCVLOWAT ================================== While studying "The Design and Implementation of the FreeBSD Operating System" about Interprocess Communication, I caught up with the following: "A low watermark also is present in each socket data buffer. The low watermark allows applications to control data flow by specifying a minimum number of bytes required to satisfy a reception request, with a default of 1 byte and a maximum of the high watermark." So what exactly are these watermarks ? And how can they be used in the real world? They are nothing else other than the SO_RCVLOWAT and SO_SNDLOWAT options that can be set on any socket through the setsockopt(2) function of the socket API. Some nifty syscall optimization can be done if one uses them with care. However there is one problem with them. Quoting the linux manual pages: socket(7): "SO_RCVLOWAT and SO_SNDLOWAT Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol (SO_SNDLOWAT) or the user on receiving (SO_RCVLOWAT). These two values are initialized to 1. SO_SNDLOWAT is not changeable on Linux (setsockopt(2) fails with the error ENOPROTOOPT). SO_RCVLOWAT is changeable only since Linux 2.4. The select(2) and poll(2) sys- tem calls currently do not respect the SO_RCVLOWAT setting on Linux, and mark a socket readable when even a single byte of data is available. A subsequent read from the socket will block until SO_RCVLOWAT bytes are available." FreeBSD is more flexible on that regard and lets us do whatever we please with both options. Here's the POC of the linux kernel code: /usr/src/linux/net/core/sock.c int sock_setsockopt(struct socket *sock, int level, int optname, char __user *optval, int optlen) { /* ............. */ case SO_RCVLOWAT: if (val < 0) val = INT_MAX; sk->sk_rcvlowat = val ? : 1; break; /* ............. */ /* We implement the SO_SNDLOWAT etc to not be settable (1003.1g 5.3) */ } int sock_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { /* ............. */ case SO_RCVLOWAT: v.val = sk->sk_rcvlowat; break; case SO_SNDLOWAT: v.val=1; break; /* ............. */ } We can get our hands dirty with socket system calls other than select(2) and poll(2) however. So what does SO_RCVLOWAT do behind the scenes ? It delays the copying of the sk_buff (linux) or m_buff (FreeBSD) data that are referenced by the socket data structure in kernel space to the user space application buffer until the kernel buffer is full with *at least* N bytes, where N is the int * optval of the setsockopt() setting SO_RCVLOWAT. This means that the system call requesting the receiving will stay blocked until that time. If it uses non-blocking I/O then it returns immediately with an error if the buffer hasn't got at least that specified amount. Here is a small program demonstrating this little feature: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #define PORT 10000 int main(int argc, char *argv[]) { int listenfd, connfd; struct sockaddr_in own, rem; int opt; socklen_t size; if (argc < 2) { printf("usage: %s SO_RCVLOWAT-argument bufsize\n", argv[0]); exit(EXIT_FAILURE); } opt = strtol(argv[1], NULL, 10); own.sin_family = PF_INET; own.sin_addr.s_addr = htonl(INADDR_ANY); own.sin_port = htons(PORT); memset(&own.sin_zero, 0, sizeof(own.sin_zero)); if ((listenfd = socket(PF_INET, SOCK_STREAM, 0)) < 0) { perror("listen:"); exit(EXIT_FAILURE); } if (bind(listenfd, (struct sockaddr *) &own, sizeof(own)) < 0) { perror("bind:"); exit(EXIT_FAILURE); } if (listen(listenfd, 20) < 0) { perror("listen:"); exit(EXIT_FAILURE); } /* set SO_RCVLOWAT */ socklen_t optlen = sizeof(opt); setsockopt(listenfd, SOL_SOCKET, SO_RCVLOWAT, &opt, optlen); size = sizeof(rem); connfd = accept(listenfd, (struct sockaddr *) &rem, &size); fprintf(stderr, "connfd: %d\n", connfd); int bufsize = strtol(argv[2], NULL, 10); char *buf = malloc(bufsize * sizeof(char)); memset(buf, '\0', bufsize); int bytes = read(connfd, buf, bufsize-1); if (bytes < 0) perror("error:"); fprintf(stderr, "bytes read: %d\n", bytes); fprintf(stderr, "buf: %s\n", buf); exit(EXIT_SUCCESS); } Tx indicate the time intervals between the actions in the two terminals running our two applications (serv and netcat) T0 == ithilgore@fitz sockets $ ./serv 10 30 connfd: 4 T1 == ithilgore@fitz sockets $ nc 127.0.0.1 10000 Testing T2 == ithilgore@fitz sockets $ ./serv 10 30 connfd: 4 T3 == ithilgore@fitz sockets $ nc 127.0.0.1 10000 Testing More Data ithilgore@fitz sockets $ T4 == ithilgore@fitz sockets $ ./serv 10 30 connfd: 4 bytes read: 18 buf: Testing More Data ithilgore@fitz sockets $ So, read(2) returned after sending the second stream of data ("More Data") which surpassed the 10 characters which we set as the SO_RECVLOWAT option argument. This behaviour could possibly be used when making applications that want to avoid the overhead of constantly copying data from kernel space to user space by calling read(2) or equivalents (recv(2) etc) many times.