Fundamentals of Binary Input/Output in C

Previously, we examined some of the operations which C allows you to do with files in the form of ASCII text, and how the use of these operations could greatly increase the flexibility of the programs that you can write in C. With these functions, we can take our first steps towards writing a text editor, a file concatenator, a database program or many other useful utilities. However, being able to deal only with ASCII text has its limitations; we are unable, for instance, to operate on image, sound or formatted text files.

In order to expand our programs further, we must delve into the world of the binary file operations which exist in the C programming language. As with the plain-text operations, the binary operations are defined in the <stdio.h> header file. The fopen() function includes special modes for reading from or writing to a file in binary mode, which looks something like this:

fp = fopen("foo.bin", "rb");

Note the addition of the “b” tag to the mode specification. This is required in some operating systems to act upon the file with operations other than the plain-text operations we discussed before, opening up the block input/output operations such as fread() and fwrite(). In a POSIX-compliant system (such as Unix, Linux, Mac OS X, et cetera), this is not required, but for maximum cross-compatibility, it should be included anyway.

While we’re discussing opening files, there is a function included in the C standard library which closes and reopens a file, allowing one to change the read-write permissions which the file is opened with. This function is named freopen(), and is called in the following fashion:

freopen("foo.file", "r+b", fp);

where “foo.file” is the name of the file in question, the mode is as in fopen() and fp is the file pointer or stream which the file is to be associated with. This can be used to change the permissions of a file from read-only to read-write, to append, to write-only or to change the access from plain-text mode to binary mode. It is more traditionally used to change the streams that the standard input/output streams are associated with, particularly in systems where stdin, stdout and stderr cannot be closed manually.

Now that we have a file open in binary mode, we may perform some binary operations on it. The fread() and fwrite() functions are more complicated to declare than the plain-text functions we have seen so far, but these functions have greater flexibility as a result. We will begin with fread(), which takes in a number of items in binary mode. It is declared in the following fashion:

fread(buffer, size_z, number, fp);

where buffer represents the variable or array where the things read by the function are stored, size_z represents the size of the objects to be taken in (for example, sizeof(char) for single bytes, or sizeof(short) for 2-byte short integers), number represents the number of objects to be read in, and fp represents the file pointer or stream which the data is to be read from.

The following simple program takes in a number of integers from a file with the following hexadecimal contents, written in the Intel x86-compatible little-endian format:

0A 00 00 00 14 00 00 00 1E 00 00 00 28 00 00 00 32 00 00 00
37 00 00 00 3C 00 00 00 41 00 00 00 46 00 00 00 4B 00 00 00

If read into a text editor like Emacs, this file is represented by the following meaningless string:

^@^@^@^T^@^@^@^^^@^@^@(^@^@^@2^@^@^@7^@^@^@<^@^@^@A^@^@^@F^@^@^@K^@^@^@

However, we can read this into a C program in binary format and get some meaning out of it.

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int bar[10];
    int i;

    fp = fopen("foo.bin", "rb");
    if (fp == NULL) {
        puts("Error: Input file cannot be read");
        return -1;
    }
    else {
        fread(bar, sizeof(int), 10, fp);
        for (i = 0; i < 10; i++) {
            printf("%d ", bar[i]);
        }
        putchar('\n');
    }
    fclose(fp);
    return 0;
}

This program prints out the following:

10 20 30 40 50 55 60 65 70 75

In any machine with 32-bit int variables and little-endian bit organisation, the same results will apply. However, this code isn’t particularly machine-portable, and will have different results in different computers. For instance, using this code in a computer which uses the IBM POWER architecture, such as a seventh-generation games console like the Xbox 360, will have a radically different result than the one we get with x86 processors. This has to be borne in mind when using binary files in different computers.

Just as we may wish to both read and write plain-text files using C’s functions, we may wish to do the same with binary files. We may perform these writing operations using the fwrite() function, which takes similar arguments to fread(). The declaration of this function is illustrated below:

fwrite(buffer, size_z, number, fp)

where again, buffer refers to the variable or array where the values are stored, size_z refers to the size of the values, number refers to the number of values to be written, and fp refers to the file pointer or stream where the binary is to be written. The following program calculates the first ten powers of two, stores them in an integer array and writes these values to a file.

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int powers_two[10];
    int i;

    powers_two[0] = 1;
    for (i = 1; i < 10; i++) {
        powers_two[i] = powers_two[i-1] * 2;
    }

    fp = fopen("bar.bin", "wb");
    if (fp == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        fwrite(powers_two, sizeof(int), 10, fp);
    }
    fclose(fp);
    return 0;
}

In my little-endian AMD x86_64 machine, the binary values written to the bar.bin file have the following hexadecimal values:

01 00 00 00
02 00 00 00
04 00 00 00
08 00 00 00
10 00 00 00
20 00 00 00
40 00 00 00
80 00 00 00
00 01 00 00
00 02 00 00

Converted into decimal, we get the values 1, 2, 4, 8, 16, 32, 64, 128, 256 and 512, as expected. Again, these values are almost meaningless in ASCII; most of the bit arrangements correspond to control characters, and the string of characters is of little interest. We can, however, read in the binary data as we did before and perform calculations on the values as standard integers.

Now that we have our basic block input/output functions, we can start investigating some of the other file access functions which we can use on our file pointers. There are a considerable number of these functions defined in <stdio.h>, not all of which are of immediate interest, but which have their own utility in a C program.

One of the functions which is of interest is the rewind() function, which returns the file position to the beginning of the file, clearing the end-of-file and error flags in the process. One might liken this to rewinding a tape; this would also be of use when reading a sound file in a music player, which is one type of program which works on binary files. The rewind() function is illustrated below, reading in integers from the bar.bin file defined in the last program:

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int foo[10];
    int i;

    if ((fp = fopen("bar.bin", "rb")) == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        for (i = 0; i < 10; i+=5) {
            fread(&foo[i], sizeof(int), 5, fp);
            rewind(fp);
        }
    }

    fclose(fp);

    for (i = 0; i < 10; i++) {
        printf("%d ", foo[i]);
    }
    putchar('\n');

    return 0;
}

This program opens the bar.bin file in binary format, then starts reading values into the foo array one set of four bytes at a time as before. Note the use of the & reference operator; because five integers are read at a time, we need to reference the specific element of the array which we wish to read into, otherwise, we will end up with the first five elements of the array being read into twice, and the others being undeclared. When we run this program, we get the following results:

1 2 4 8 16 1 2 4 8 16

The first five powers of two have been read in to the array twice, with the file rewinding on each occasion. If we were to read into a larger array, the function would start reading from the first value in bar.bin, 01 00 00 00.

Given that we have a function which rewinds the file fully, we might want to prove to ourselves explicitly that the rewind() function has actually rewinded the file position back to the start, and to find the file position before the rewind() function is called. To do this, we can use the ftell() function, a function which returns a long integer value which tells us the current file position. We can expand our previous program with calls to the ftell() function contained within our for loop:

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int foo[10];
    int i;

    if ((fp = fopen("bar.bin", "rb")) == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        for (i = 0; i < 10; i+=5) {
            fread(&foo[i], sizeof(int), 5, fp);
            printf("%d\n", ftell(fp)); /* Calling ftell() */
            rewind(fp);
            printf("%d\n", ftell(fp)); /* And calling it again */
        }
    }

    fclose(fp);

    for (i = 0; i < 10; i++) {
        printf("%d ", foo[i]);
    }
    putchar('\n');

    return 0;
}

The results from this program are:

20
0
20
0
1 2 4 8 16 1 2 4 8 16

Using this, we can see that the file position was at the twenty-first (noting that as usual in C, we start counting from zero) byte or the start of the sixth integer position in the file before the rewind() function was called, and that the file position returns to the first byte after the rewind() function is called. We have verified that the rewind() function does, in fact, work as we expect it to.

It is clear, though, that the rewind() function is rather limited in its scope; it can only return to the beginning of the file. A music player that could only rewind to the start of a song would be considered terribly limited, and just as with a modern digital music player, we can seek out a specific byte within a file and operate from that position. This is where the fseek() function comes in to play, which works somewhat like rewind(), but with a lot more flexibility. fseek() is called in the following fashion:

fseek(fp, offset, origin)

where fp is the file pointer for which we wish to change the file position, offset is the number of bytes we want to move the file position, given a certain origin, which is equal to the defined value SEEK_SET for the beginning of the file, SEEK_CUR for the current file position and SEEK_END for the end of the file. For this function, we might want to define a larger binary input file, so that we can see the full extent of the fseek() function. The following set of hexadecimal values was saved to the baz.bin file:

01 01 00 00 01 02 00 00 01 04 00 00 01 08 00 00 01 10 00 00 01 20 00 00 01 40
00 00 01 80 00 00 01 00 01 00 01 00 02 00 01 00 04 00 01 00 08 00 01 00 10 00
01 00 20 00 01 00 40 00 01 00 80 00 01 00 00 01 01 00 00 02 01 00 00 04 01 00
00 08 01 00 00 10 01 00 00 20 01 00 00 40 01 00 00 80 02 01 00 00 02 02 00 00
02 04 00 00 02 08 00 00 02 10 00 00 02 20 00 00 02 40 00 00 02 80 00 00 02 00
01 00 02 00 02 00 02 00 04 00 02 00 08 00 02 00 10 00 02 00 20 00 02 00 40 00
02 00 80 00 02 00 00 01 02 00 00 02 02 00 00 04 02 00 00 08 02 00 00 10 02 00
00 20 02 00 00 40 02 00 00 80

The following program declares an integer array of size 48, and moves the file pointer to i bytes past the start of the array every time the for loop runs.

#include <stdio.h>
#define ARRAYSIZE 48

int main(void)
{
    FILE *fp;
    int values[ARRAYSIZE];
    int i, j;

    if ((fp = fopen("baz.bin", "rb")) == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        for (i = 0; i < ARRAYSIZE; i++) {
            fread(&values[i], sizeof(int), 1, fp);
            fseek(fp, i, SEEK_SET);
        }
    }

    for (i = 0; i < ARRAYSIZE; i++) {
        printf("%d ", values[i]);
    }
    putchar('\n');

    return 0;
}

This returns the following:

257 257 16777217 33619968 131328 513 16777218 67174400 262400 1025 16777220
134283264 524544 2049 16777224 268500992 1048832 4097 16777232 536936448
2097408 8193 16777248 1073807360 4194560 16385 16777280 -2147418112 8388864
32769 16777344 65536 16777472 65537 16777472 65537 33554688 131073 16777728
65538 67109120 262145 16778240 65540 134217984 524289 16779264 65544

All of these values are somewhat closely related to powers of two, as the series of values which we placed into the bar.bin file would suggest. Nevertheless, this is not an incredibly exciting program, nor are the programs we defined before. We are simply working on a set of integers, but binary files can be more interesting.

Executable files are a sort of binary file, although with modern computer architectures – and for that matter, even for older, more consistent architectures – working on these executables in unadorned hexadecimal is difficult. Other types of files that are in binary format are the likes of MP3 music files, MPEG movies and various formats of image files.

One of the simplest image formats is the uncompressed BMP bitmap format defined by Microsoft. The mandatory components of a BMP file are the 14-byte header, which stores general information about the file, a DIB header of various size which contains more detailed information about the bitmap image, and a pixel array, which consists of blocks of bytes which encode the red, green and blue colour values of the pixels, along with an optional transparency (or alpha) value.

Using this information, we can write a simple application which reads in a BMP file, then does something with this file, such as inverting the values of the colours. The application then writes this information to a separate file, including the header and DIB header. The following program takes an input file, which has been defined in this instance as having a 40-byte DIB header corresponding to one of the seven BMP header types that exist. The filename of the input file can be taken in as a command-line argument, or defined within the program itself.

/* Image manipulation: Colour inverter */
/* Based on the work of Leo Tilson @ Dublin Institute of Technology,
   modifications made by Richard Kiernan */

#include <stdio.h>
#include <stdlib.h>
#define CHAR_MAX 2560

int main(int argc, char **argv)
{
    FILE *input_image;
    FILE *output_image;

    unsigned char header[14]; /* Stores the header of the file */
    unsigned char info[40]; /* Stores the DIB header of the file */
    unsigned char *image; /* Once malloc() is called, stores the image data
                           * for the file */

    unsigned int num_read; /* For error checking */

    unsigned int width; /* All derived from the DIB header */
    unsigned int row_length;
    unsigned int height;
    unsigned int im_size; 

    unsigned int row; /* Two counter variables */
    unsigned int pixel; 

    unsigned char blue; /* Stores the blue byte for each pixel */
    unsigned char green; /* Stores the green byte for each pixel */
    unsigned char red; /* Stores the red byte for each pixel */

    unsigned char *lineptr; /* We'll use this to go through each pixel */

    char filename[CHAR_MAX];

    if (argc > 2) {
        printf("Usage: inverter [filename]\n");
        exit(1);
    } else if (argc == 1) {
        printf("Please enter a filename: ");
        scanf("%s", filename);
    }

    input_image = fopen(argc == 2 ? argv[1] : filename, "rb");
    output_image = fopen("output.bmp", "wb");

    if ((input_image == NULL) || (output_image == NULL)) {
        printf("Error: Failed to open an image file\n");
    } else {
        printf("Image opened successfully\n");

        /* Retrieve the header from the input image */
        num_read = fread(header, sizeof(char), 14, input_image);
        if (num_read != 14)
            exit(2); /* Not a valid BMP file! No point in continuing. */

        /* Retrieve the bigger info block from the input image */
        num_read = fread(info, sizeof(char), 40, input_image);
        if (num_read != 40)
            exit(3); /* Not the type of BMP file we're looking for. */

        /* Using the info block to tell us necessary information about the
         * file, like width, height, et cetera. */
        width = info[4] + info[5] * 256 + info[6] * 256 * 256 + info[7] * 256
            * 256 * 256;
        height = info[8] + info[9] * 256 + info[10] * 256 * 256 + info[11] * 256
            * 256 * 256;
        im_size = info[20] + info[21] * 256 + info[22] * 256 * 256 + info[23]
            * 256 * 256 * 256;

        row_length = im_size / height;

        /* Now, allocate memory for the image and read it in. */

        image = (char *) malloc (im_size);
        if (image == NULL)
            exit(4); /* Wrong size; no point in continuing */

        num_read = fread(image, sizeof(char), im_size, input_image);
        if (num_read != im_size)
            exit(5); /* Some sort of error in reading in the file. */

        /* Now to invert the image's colours */
        for (row = 0; row < height; row++) {
            /* Define the pointer position for this row */
            /* This is the first blue byte on the row */
            lineptr = image + row * row_length;

            for (pixel = 0; pixel < width; pixel++) {           
                /* Define the current colours */
                blue = *lineptr;
                green = *(lineptr + 1);
                red = *(lineptr + 2);

                /* Invert the colours */
                *lineptr = 255 - blue ;
                *(lineptr + 1) = 255 - green;
                *(lineptr + 2) = 255 - red;

                lineptr += 3 /* Go to next pixel */
            }
        }

        /* Write this to the output file */
        fwrite(header, sizeof(char), 14, output_image);
        fwrite(info, sizeof(char), 40, output_image);
        fwrite(image, sizeof(char), im_size, output_image);

        free(image);
        fclose(input_image);
        fclose(output_image);
    }
}

It is important to note that this is not the most elegant way to write this program, nor is it the fastest. This program could be improved by using pointers rather than the array notation used here, for instance. However, this does demonstrate the sort of program which can be written as soon as you have access to operations that work on files at a binary level.

Using this program as a framework, we could also perform such tasks as changing the colours to greyscale, saturating the colours or other functions that we might see in a fully-featured image manipulation program. Similarly, knowing the details of a music file format would allow us to write programs which could manipulate that type of file.

A final function which is of interest in the context of operations on files is the remove() function. As the name suggests, it is used to remove a file from storage as long as no other filenames are linked to the file. This function allows us to create a simple program along the lines of the Unix rm command, which we will call rm_file.

#include <stdio.h>

int main(int argc, char *argv[])
{
    if (argc == 2) {
        if (remove(argc[1]) != 0) {
            printf("Error: File %s could not be deleted\n", argv[1]);
        } else {
            printf("File %s successfully deleted\n", argv[1]);
        }
    } else {
        printf("Usage: rm_file [filename]);
    }
    return 0;
}

As persistent data is such an important idea in computer programming, it is useful to have a series of functions defined in the language you are using that easily operate on that persistent data, and between plain-text and binary input/output operations in C, you have a set of functions which have been tried and tested. These functions allow us to build basic applications, like our basic “text editor”, or the colour inverter program defined above, that can form parts of a more fully-featured system.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: