Lab 11 - Networking
Lab goals:
- Learn how to use the
sockets
networking API - Implement an
echo
client program and server program using sockets - Modify the echo server program to use process concurrency
Computer Networking Background
The most commonly used method for communicating across
computer networks such as the Internet today is the TCP/IP
protocol family. Basically, IP (the Internet Protocol)
provides the ability to communicate packets
between computers on the network, but this communication is
not guaranteed to be reliable, including the possibility
that some packets may be garbled, lost, or reordered. A
packet consists of a control header, carrying information
such as the network address of the source and destination
computers, and a payload of up to some limited number of
data bytes. TCP (the Transmission Control Protocol) uses
the services of IP to provide reliable (lossless) byte
stream communication from one computer to another. In
Unix, the TCP/IP protocols are accessed via a mix of
standard Unix I/O and the sockets
API.
IPv4 Addresses
Throughout this lab, any mention of IP addresses refers to IPv4 addresses. IPv4 (IP version 4) is what is widely used in the Internet today, whereas IPv6 (IP version 6) is gradually being deployed in the Internet to supplement and replace IPv4.
IP addresses (again, here, IPv4 addresses) are represented in the sockets API using the following structure:
/* Internet address structure */
struct in_addr {
unsigned int s_addr; /* network byte order (big-endian) */
};
When using this structure, it is important to remember
that the data in it is in network byte order.
As we
have previously discussed in Lab 4
(Advanced I/O in C), different machines store the bytes
of a multi-byte data type, like an integer, in different
byte orders. Some computers use little-endian
order,
and others use big-endian
order. To allow machines
with different byte orders to communicate over the network,
the byte order for all IP networks, for all information as
it goes over the network, has been standardized to use
big-endian order. Therefore, any part of a packet header,
including the IP addresses, are stored and transmitted in
network byte order.
The C library provides some handy functions, defined by
the header file <arpa/inet.h>
, for
handling byte-order conversions. You've seen some of these
before in the
Linking assignment:
uint32_t htonl(uint32_t hostlong)
:
Convert a 32-bit value from the local host byte order to network byte order.uint16_t htons(uint16_t hostshort)
:
Convert a 16-bit value from the local host byte order to network byte order.uint32_t ntohl(uint32_t netlong)
:
Convert a 32-bit value from network byte order to the local host byte order.uint16_t ntohs(uint16_t netshort)
:
Convert a 16-bit value from network byte order to the local host byte order.
If the local machine's byte order happens to be the same
as network byte order (i.e., big-endian order), these
functions do nothing, but otherwise, they byte swap
the value to properly convert it to/from network byte
order.
Although the size of an IP address is 32 bits, it isn't
usually written (e.g., for us humans) as a single 32-bit
number. Instead, each of the 4 bytes in the address is
represented in decimal separately, in the order from the
most significant byte to least significant byte, with
periods (referred to as dots
) separating them. For
example, the IP address 2150244638 (decimal) or 0x802A211E
(hex) is usually written as 128.42.33.30. This format is
referred to as the dotted decimal representation
of the address.
The C library provides the following functions for converting an IP address into and out of dotted decimal representation:
int inet_pton(AF_INET, const char *src, void *dst)
:
Convert the dotted decimal IP address string atsrc
to an IP address in network byte order atdst
.const char *inet_ntop(AF_INET, const void *src, char *dst, socklen_t size)
:
Convert the IP address in network byte order atsrc
to a dotted decimal string atdst
.
As always, check the manual (man
)
pages for details on using these functions and on the error
conditions that they can return. To use these functions,
you must include the header file
<arpa/inet.h>
.
The Domain Name System (DNS)
IP addresses are difficult for humans to remember, so the Domain Name System (DNS) is used to provide a mapping between easy to remember textual domain names (sometimes simply called hostnames) and IP addresses. A domain name consists a hierarchical sequence of textual labels that are separated by dots, with the right-most label identifying the top-level domain for the name. For example, rice.edu, cs.rice.edu, and www.cs.rice.edu are all examples of different domain names that belong to the edu top-level domain. The name rice.edu refers to the rice subdomain of edu, the name cs.rice.edu refers to the cs subdomain of rice.edu, and the name www.cs.rice.edu refers to the www subdomain of cs.rice.edu. Other top-level domain names include com, org, and net.
The DNS is maintained as a huge worldwide distributed
database implemented on a set of DNS servers, where each
server handles requests for all names in a zone of
the DNS name space. A zone is a subtree of the DNS name
space that is managed by a single entity, pruned
of
lower down subtrees for which management has been delegated
to a different entity and thus form a different zone. Each
zone is required to have replicated servers. For example,
the rice.edu domain is served by two different
replicated DNS servers operated by Rice, and is also
supported by two backup servers operated by Purdue
University and one operated by Southern Methodist
University. The domain name essentially identifies the path
from the root of the hierarchy (reading from right to left
in the domain name), and the data is stored by the servers
for that zone as a set of resource records (RRs)
associated with that name. The IPv4 address for a name is
stored as an A resource record, whereas the IPv6
address for the same name is stored as an AAAA
(usually called quad A
) resource
record.
The following image (from Wikipedia) depicts this organization of the DNS hierarchy into zones, domain names, and resource records:
The C library provides a set of functions for querying the DNS for host address information, represented by the following structure:
struct addrinfo {
int ai_flags; /* flags for getaddrinfo */
int ai_family; /* address type (AF_INET or AF_INET6) */
int ai_socktype; /* the socket type */
int ai_protocol; /* the type of protocol */
size_t ai_addrlen; /* length of ai_addr */
struct sockaddr *ai_addr; /* pointer to a sockaddr struct */
char *ai_canonname;/* the canonical name */
struct addrinfo *ai_next; /* pointer to the next addrinfo struct */
};
And the C library provides the following functions to perform these types of DNS queries:
int getaddrinfo(const char *node, const char *service, const struct addrinfo *hints, struct addrinfo **res)
:
Query the DNS using a domain name or IP address specified bynode
. On return, this function allocates and initializes a linked list ofaddrinfo
structures, one for each network address that matchesnode
andservice
, subject to any restrictions imposed byhints
, and returns a pointer to the start of the list inres
. The items in the linked list are linked by theai_next
field. In its simplest usage, to just look up an IP address for a given name, you should specify bothservice
andhints
asNULL
. Returns 0 on success, or an error code (seegai_strerror()
below) in case of any error.void freeaddrinfo(struct addrinfo *res)
:
Free the memory that was allocated for the dynamically allocated linked listres
by a previous call togetaddrinfo()
.int getnameinfo(const struct sockaddr *sa, socklen_t salen, char *host, size_t hostlen, char *serv, size_t servlen, int flags)
:
Query the DNS using asockaddr
struct. It is essentially the inverse ofgetaddrinfo()
. Returns 0 on success, or an error code (seegai_strerror()
below) in case of any error.const char *gai_strerror(int errcode)
:
Return a human readable string, suitable for error reporting, for error codes returned bygetaddrinfo()
orgetnameinfo()
.
To use these functions, you must include both
<sys/socket.h>
and
<netdb.h>
.
Aside: Some older documents, including
previous editions of the textbook for this class, refer to
the functions gethostbyname()
and
gethostbyaddr()
. However, these functions have
been deprecated, for two reasons: they are not thread-safe,
and some implementations of these functions do not support
IPv6. You should avoid using these deprecated functions and
should use only getaddrinfo()
and its related
functions above instead.
Connection Endpoint Addresses
As described above, TCP provides a reliable (lossless), point-to-point byte stream connection between two computers such as between a client and a server. A socket represents one of the two endpoints of the connection. Every TCP socket has a unique combination of IP address and port number, where the port number is a 16-bit integer. The port number can either be assigned automatically or be a well-known port number used by convention to represent a particular service, like port 80 for HTTP.
Sockets are used in Unix as the endpoint for more than
just TCP connections, and at the time when the sockets API
was originally developed, the C language did not yet
support generic (i.e., void *
)
pointers. So, the following generic socket structure was
meant to be able to represent the endpoint addresses for
arbitrary communication channels (not just for TCP/IP),
with the sa_data member providing space for the
particular addressing information needed, depending on the
value of sa_family:
struct sockaddr {
unsigned short sa_family; /* protocol family */
char sa_data[14]; /* address data */
};
The TCP/IP-specific version of this generic connection
endpoint address structure, a
struct sockaddr_in
, is defined as
follows, with the sa_data member above replaced
with the addressing information needed for TCP/IP
sockets:
struct sockaddr_in {
unsigned short sin_family; /* address family (AF_INET or AF_INET6) */
unsigned short sin_port; /* port num in network byte order */
struct in_addr sin_addr; /* IP addr in network byte order */
unsigned char sin_zero[8]; /* pad to sizeof(struct sockaddr) */
};
The proper way to pass a
struct sockaddr_in
pointer as an
argument to a function that expects a
struct sockaddr
pointer is to fill
in the struct sockaddr_in
variable and then cast the address of that variable to
a struct sockaddr *
when
using it as the argument to the function.
More on getaddrinfo()
As noted above, in the simplest usage of
getaddrinfo()
, to just look up an IP address
for a given name, you should specify the
service
and hints
arguments for
getaddrinfo()
both as NULL
.
However, getaddrinfo()
is more powerful than
that. It can provide you with all of the argument values
needed to actually open a TCP connection to that computer
(see the description of the socket()
and
connect()
system calls for the simple echo
client below).
In particular, the service
argument can
specify the server port number to connect to, either as a
"%d"
-formatted character string giving the
port number or as the name of the service to which you
would like to connect. For example, port number 80 is
defined as the default port number used by web servers. On
a call to getaddrinfo()
, this port can be
specified with the service
argument either as
"80"
or as "http"
or
"www"
, and getaddrinfo()
will
then initialize the sin_port
field of the
struct sockaddr_in
within the struct
addrinfo
(when the struct sockaddr
is
viewed as a struct sockaddr_in
) to the
value 80.
The hints
argument to
getaddrinfo()
can be used to specify many
other fields or behaviors as well. For example, the
socktype
field can be set to
SOCK_STREAM
to indicate a TCP connection. And
the ai_flags
field can be set to the logical
or
of different flags to control the results of
getaddrinfo()
. For this lab, the two relevant
flags are:
AI_NUMERICSERV
: indicates that theservice
argument gives the port number as a numeric"%d"
-formatted character string rather than the name of the intended service.AI_ADDRCONFIG
: indicates thatgetaddrinfo()
should return IP addresses fornode
that are supported by the requesting (local) computer; IPv4 addresses are returned only if the requesting computer supports IPv4, and/or IPv6 addresses are returned only if the requesting computer supports IPv6.
Finally, as noted above, getaddrinfo()
may
return more than one struct addrinfo
, since
any computer on the Internet may have more than one IP
address. To fully make use of these multiple addresses, the
requesting computer, when attempting to connect to the
intended server, may try a connection attempt with each of
those multiple addresses, until one of these connection
attempts succeeds (or until all addresses have been
attempted).
Overview of the Lab Exercises
The exercises in this lab will be similar to those in
Lab 6 (Advanced Linked Lists),
in which we asked you to fill in code to make a working
hash table. For this time lab, you will create the
client program and the server program for
a network echo
service: the echo client
will connect to the echo server and send some
text, and the server will then reply by sending the same
text back to the client.
To build these programs, use the Unix command:
make
You can also build either the echo server or echo client programs individually using either of the following two Unix commands, respectively:
make echoclient make echoserver
The initial source code for the echo client program is
located in echoclient.c
in your repository,
and the initial source code for the echo server program is
located in echoserver.c
.
A Simple Echo Client
The anatomy of the client side of a typical TCP
connection is simple. First, the client program should
create a socket with a call to socket()
, which
has the following prototype:
int socket(int domain, int type, int protocol);
The argument domain
should be
AF_INET
for IP (i.e., IPv4) connections or
AF_INET6
for IPv6 connections; use AF_INET for
domain
for this lab. The argument
type
should be SOCK_STREAM
for
TCP connections, and the argument protocol
should be 0
for TCP connections. This call
returns a file descriptor that refers to the new
socket.
Next, the client program should pass the file descriptor
returned from socket()
, together with a
properly filled in struct sockaddr_in
, to
the connect()
function, which has the
following prototype:
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
Remember that the sockets interface was meant to support
many different types of communication channels, not just
TCP/IP. The addrlen
argument is used to allow
for differently-sized socket addresses. For TCP/IP
communication, you should use a
struct sockaddr_in
as the variable for
the socket address, and addrlen
should be the
sizeof()
for that
struct sockaddr_in
socket address
variable. Assuming that a server process is running and
listening on the requested address, the TCP connection will
have been established upon return from
connect()
.
A Simple Echo Server
The server side of a TCP connection is more complex than the client side.
Just like with the client, the server program should
first call socket()
to create a TCP socket,
which requires the AF_INET
(or
AF_INET6
) and SOCK_STREAM
options
(for this lab, you will use AF_INET
to create
an IPv4 TCP socket):
int socket(int domain, int type, int protocol);
Next, a struct sockaddr_in
must be
filled in to describe what interface the server will be
listening on. Recall the definition of a
struct sockaddr_in
:
struct sockaddr_in {
unsigned short sin_family; /* AF_INET for this lab */
unsigned short sin_port; /* htons(port number) */
struct in_addr sin_addr; /* htonl(INADDR_ANY) */
unsigned char sin_zero[8]; /* unused */
};
If this computer has multiple network interfaces
(generally, that is, multiple different network hardware
interfaces to different networks), but if the server is
willing to use only one of those network interfaces for
this server, then the sin_addr
field should be
set to the IP address of this computer on that particular
network interface (each network interface on the computer
will have a different IP address). More commonly, if the
server is willing to accept connections from any network
interface on this computer, then sin_addr
should be set to htonl(INADDR_ANY)
. As
described above, the struct sockaddr_in
requires the address to be in network byte order, so we
need to use htonl
here. The port number
identifies the port on which this server will be listening
for connections.
The function bind
should then used to set
the address of the socket to the address specified in the
struct sockaddr_in
. The prototype for
bind
is
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
Here, sockfd
should be the file descriptor
created by the call to socket
(above), and
addr
should be the address of the
struct sockaddr_in
variable that was
filled in above (cast to be a
struct sockaddr *
). The argument
addrlen
should be sizeof()
for
that struct sockaddr_in
variable.
Once the address of the socket is set, the server should
use listen()
to tell the operating system
kernel that the socket is ready to accept connections from
clients (i.e., to begin listening for connections). The
prototype for listen()
is
int listen(int sockfd, int backlog);
The backlog
argument to
listen()
is the maximum number of pending
connections that should be supported. Additional connection
requests beyond this limit are rejected by the operating
system. Note that this is a limit on pending
connections (i.e., those that have not yet been accepted,
as described below); it is not a limit on
total connections.
It is important to note the distinction between
listening file descriptors and connected
file descriptors. A listening file descriptor is created by
a call to listen()
and is the endpoint for new
client connection requests. The server program
should create a listening descriptor only once during the
lifetime of the program. A connected file descriptor is the
endpoint for one specific connection between a specific
client and this server. A new connected file descriptor is
created every time the server accepts a new connection
request from a client. The reason for this distinction is
so that concurrent servers can simultaneously communicate
with many clients.
A new connection request from a client is accepted with
the accept()
function. The
accept()
function has the following
prototype:
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
The sockfd
should be a valid
listening file descriptor. The
accept()
function returns a connected
file descriptor for the newly created connection, and if
addr
is not NULL
, then
accept()
also sets the contents of the buffer
at addr
to be the socket address of the
accepted client. The accept()
call will
also have set the integer at the address given by
addrlen
to the size of the resulting
addr
structure.
Server Process Concurrency
Your current echoserver handles only one request at a time, but servers on the Internet generally need to be able to handle many concurrent requests. For example, Wikipedia averages around 100,000 requests (or more) per second. Wikipedia would be unusable if it did not handle requests concurrently. In order to handle that many requests, Wikipedia needs concurrency on each server as well concurrency across its many (300) servers. In this lab, we will focus on concurrency within a single server.
One simple form of concurrency for a server is process concurrency. In process concurrency, each different request is handled by a distinct process. In this lab, we will explore a common process concurrency technique often called fork-after-accept.
In the fork-after-accept technique, a single
server process accepts all incoming connections. After a
connection is made, the server forks a copy of itself to
handle the new request, and the original process (the
parent) returns to accepting other new incoming
connections. This technique leverages the behavior of
fork()
in copying the parent's address space
to the new child's address space. In addition, all open
files and sockets are shared by the parent and new
child process after the fork()
.
GitHub Repository for This Lab
To obtain your private repo for this lab, please point your browser to this link for the starter code for the lab. Follow the same steps as for previous labs and assignments to create your repository on GitHub and then to clone it onto CLEAR. The directory for your repository for this lab will be
lab-11-networking-name
where name is your GitHub userid.
Submission
Be sure to git push the appropriate C source files for this lab before 11:55 PM tonight, to get credit for this lab.