Lab 4 - Advanced I/O in C

Lab goals:

Introduce advanced C input/output functions and how they are used.

Formatted I/O

We have seen printf() before in previous labs and examples, but we have not yet really covered all of its features. In this lab, we will start with a deeper look at printf() and will also introduce the related Standard I/O library function scanf():

    int printf(const char *format, ...);

    int fprintf(FILE *stream, const char *format, ...);

    int scanf(const char *format, ...);

    int fscanf(FILE *stream, const char *format, ...);

Included above are also the related functions fprintf() and fscanf(). Calling printf() always outputs to stdout, whereas fprintf() outputs to the specified stream, and calling scanf() always reads from stdin, whereas fscanf() reads from the specified stream.

Below is a simple, but buggy, example that uses printf() and scanf() (this code is also available as printf_bug.c in your repo for this lab):

     #include <stdio.h>

     int
     main(void)
     {
             int cnt, i1, i2;

             printf("Enter two integers: ");
     
             cnt = scanf("%d %d", i1, i2);
             if (cnt == EOF) {
                     fprintf(stderr, "Error during scanf.\n");
                     return (1);  /* non-zero for error */
             } else if (cnt < 2) {
                     fprintf(stderr, "scanf matched %d input items instead of 2.\n",
                         cnt);
                     return (2);  /* non-zero for error */
             }

             printf("\nThe product of %d and %d is %d.\n", i1, i2, i1 * i2);

             return (0);  /* no error */
     }

Key points:

The first argument of printf() and scanf() is a format string. In a format string, the percent sign (%) acts as an escape character to mark the placement of one of the arguments following the format string. The character(s) following the percent sign indicate the output (for printf()) or input (for scanf()) display format for the argument. The common formats are: %d (for an integer formatted in decimal), %u (for an unsigned decimal integer), %x (for an unsigned integer formatted in hexadecimal), %f (for a floating-point number), %c (for a single character), %s (for a string), and %p (for a pointer).
The escape sequences in the format string can also include modifiers, e.g., to indicate the number of digits to display. For example, the format string %02x will print out a hex number with at least two digits of precision, printing leading zeroes as necessary ("zero padding"). See man 3 printf for more information on modifiers. Remember that section 3 of the manual documents library procedures, whereas section 1 documents commands; what happens if you use just man printf instead of man 3 printf?
The values of i1 and i2 above are undefined before the call to scanf(), but afterwards they have the inputted values. For scanf(), the arguments following the format string must be pointers to space that has been allocated! (See also the class notes on allocation.)

String I/O

Recall the echo_bug.c program from Lab 2. It is also possible to do I/O — and to echo — with strings instead of characters, using the following functions in the Standard I/O library:

    char *gets(char *s);

    char *fgets(char *s, int size, FILE *stream);

    int puts(const char *s);

    int fputs(const char *s, FILE *stream);

Here is a simple, but buggy, example (see echostr_bug.c in your repo):

     #include <stdio.h>

     int
     main(void)
     {
             char input[10];

             while (gets(input) != NULL) {
                     puts(input);
             }

             return (0);  /* no error */
     }

Key points:

The functions gets() and fgets() return s on success, or return NULL on any error or on end of file on the input stream.
In the code above, if the length of the input string is greater than 9 characters, where do the extra characters go? They overrun into unallocated memory! Never use gets(). Note the warning message that cc reports when you try to use it!
Calling gets(input) is (almost) equivalent to scanf("%s", input), and it suffers from the same buffer overflow bug. Never use scanf("%s", input). Instead, use scanf("%9s", input) for a maximum number of 9 characters (for example), or use scanf("%ms", &p) where p is a char * to have scanf() allocate a buffer of sufficient size (you must remember to then free(p) yourself!). See man scanf for more information.
The function fgets() is the file version of gets(), except that fgets() also includes a maximum length argument that makes it safe to use. Calling fgets() is the only preferred way to read a string in C!

Byte and Word I/O

When processing large data sets, it is more efficient to handle the data as raw binary data rather than as character strings. For example, representing the number 1234567 as a character string requires 7 bytes (really 8 bytes, with the '\0' character to terminate this string), whereas representing it as an integer (raw binary data), requires only 4 bytes. The following functions in the Standard I/O library are useful for I/O on such raw binary data:

    int getw(FILE *stream);

    int putw(int w, FILE *stream);

    size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);

    size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);

Here is a simple, but buggy, example (this code is also available as sumints_bug.c in your repo):

     #include <stdio.h>

     int
     main(int argc, char *argv[])
     {
             FILE *input_file, *output_file;
             int error, number, sum = 0;
             char *input_filename, *output_filename = "SUM.bin";
     
             /* Filename must be the only argument. */
             if (argc == 2)
                     input_filename = argv[argc - 1];
             else {
                     fprintf(stderr, "Wrong number of arguments.\n");
                     return (1);
             }
     
             input_file = fopen(input_filename, "r");
             if (input_file == NULL) {
                     fprintf(stderr, "Can't open %s.\n", input_filename);
                     return (1);  /* non-zero for error */
             }
      
             output_file = fopen(output_filename, "w");
             if (output_file == NULL) {
                     fclose(input_file);
                     fprintf(stderr, "Can't open %s.\n", output_filename);
                     return (1);  /* non-zero for error */
             }
      
             while ((number = getw(input_file)) != EOF)
                     sum += number;
             
             printf("The sum is %d.\n", sum);
      
             if (putw(sum, output_file) == EOF) {
                     fprintf(stderr, "Unable to write sum.\n");
                     error = 1;
             } else
                     error = 0;
      
             fclose(input_file);
             fclose(output_file);
      
             return (error);
     }

These functions can be used to effectively serialize your data, so that you can save the data to a file and restore it later. However, note that this serialization is actually machine-dependent, unlike the machine-independent serialization you may be used to from languages like Java. Therefore, one important caveat when dealing with binary data in C is that the data will be interpreted differently depending on the endianness (little-endian vs. big-endian format) of the host computer. The terms little-endian and big-endian refer to the two different, incompatible ways of laying out the bytes of a multi-byte value (such as an int) in memory. (The use of the word endian in describing this issue comes originally from Jonathan Swift's story Gulliver's Travels; see this classic paper, if you are curious about how we got from there to here.)

As an example, to represent a 4-byte integer on computers using little-endian format such as the Intel x86, the least significant byte of the integer comes first in memory (i.e., of the 4 bytes of memory used to store the integer, the memory byte with the lowest address stores the least significant byte of the integer, with the other 3 bytes following in order, at each higher-addressed byte in memory), whereas on computers using big-endian format such as most SPARC processors, the most significant byte of the integer comes first in memory (i.e., of the 4 bytes, the memory byte with the lowest address stores the most significant byte of the integer). This difference in byte ordering means that you would need to byte swap the values in your data if you write it on a computer of one endianness and then read it on a computer of the opposite endianness.

Also, note that the function getw() returns either the integer that was read or the value EOF. However, as described in the previous I/O lab, the value EOF is traditionally just the integer -1. So it is impossible to tell if getw() actually read and is returning the integer -1 or instead is returning the value EOF. It would be correct to use feof() to check for the end-of-file when using getw().

But the functions getw() and putw() are deprecated, and using the functions fread() and fwrite() instead is preferred.

Seeking to a New Position within a Stream

When you first open a stream, the current position within that stream is at the beginning of the file. If you read or write n bytes on that stream, the current position on the stream advances by n bytes. You can thus read or write the stream sequentially by simply making repeated read or write calls on the stream.

But reading through an entire large file just to access data that is near the end of the file would be inefficient. Similarly, for example, it would be inefficient to close a file and re-open it just to go back to the beginning of the file and be able to read the first parts of the file again. To make moving around within an open file more efficient and convenient, the Standard I/O library provides the function fseek() to allow you to explicitly move around within the stream by modifying the stream's current position in the file, seeking in the stream to a specified position:

    int fseek(FILE *stream, long offset, int whence);

The whence parameter should be one of SEEK_SET, SEEK_CUR, or SEEK_END to indicate that the offset parameter is relative to, respectively, the beginning of the file, the current position in the file for that stream, or the end of the file. See man fseek.

GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to the starter code for the lab at:

https://classroom.github.com/a/V7d1IpFb

Follow the same steps as for previous labs and assignments to to create your repository on GitHub and to then clone it onto CLEAR. The directory for your repository for this lab will be

lab-4-advanced-i-o-in-c-name

where name is your GitHub userid.

Submission

Again, be sure to git push the appropriate C source files for this lab before 11:55 PM tonight to get credit for this lab.

COMP 321: Introduction to Computer Systems