The C Preprocessor

One of the peculiar things about the C programming language is that so many commonly occurring elements are not actually part of the language, per se. All of the functions in the standard library are actually extensions to C, additional parts which give us the input/output, mathematical and utility features which make C powerful. All of these are contained in a set of header files and binaries which are added to programs during the compilation process.

Another extension to C is the C preprocessor, and it is this that gives us the ability to extend the language to perform functions. The C preprocessor is a sort of computer language of its own sort, and while it is not Turing-complete, it is useful enough for the purposes for which it is called. The C preprocessor reads through a C source file, replacing various statements which are important to the preprocessor to ones which are important to the C compiler.

It is somewhat difficult to explain why the C preprocessor is important, but I will attempt to do so with a brief segue into the history of computer languages. Early high-level programming languages, such as Fortran and COBOL, were notable for being able to do one set of tasks very well and most others not so well at all. In some cases, this led to deficiencies which would be considered ghastly today; ALGOL 58 and 60, for instance, did not define any input/output operations, and any I/O routines would be completely machine-dependent.

In the later 1960s, language designers attempted to create new languages which would be suitable for multi-purpose applications. However, these languages, which included PL/I and ALGOL 68, were designed by committees who were made up of conflicting personalities, many of which were desperate to see their pet features included in the languages. As the complexity of the languages grew, so did the complexity of developing an efficient compiler. As computing resources were vastly smaller than they are now, these languages were only suitable for mainframe computers, and then not even efficiently.

Therefore, these language experiments tended to fail. PL/I has some residual support by being supported by IBM, but it is moribund outside of the confines of IBM machines; ALGOL 68 is dead and buried. When C came around, Dennis Ritchie was aiming to create a language which both implemented enough features in order to build an operating system and its applications, while being able to run efficiently on a much less powerful computer than those for which PL/I was designed.

The solution was to create a system in which only the subset of the functions that were required for a specific program would be implemented, rather than the full set. This made compilation of C more efficient, as the compiler generally only had to be concerned with a small number of functions at once. The method chosen to do this was to use the C preprocessor to keep the function definitions of most functions outside of the base language; when C was standardised in 1989 by the ANSI committee, and in 1990 by the ISO, all functions were taken out of the base language and put into header files.

Now that the history lesson is over, we can continue on to the operations of the preprocessor. As mentioned above, the preprocessor scans a C source file – or, in some circumstances, another source file; Brian Kernighan famously developed RATFOR to add similar features to Fortran as in C – and looks for statements that are important to it. It then replaces them with statements that are important to the C compiler or whatever other system the preprocessor is being used for.

The most fundamental operation of the preprocessor is #include. This operation looks for a file which is defined at a path included in the #include directive, then inserts its entire contents into the source file in place of the #include directive. The file’s contents might themselves contain C preprocessor statements, as is common in C header files, so the preprocessor goes through those and acts upon them appropriately.

One of the most common invocations of the #include directive is the following:

#include <stdio.h>

This directive locates the file, stdio.h, and places its contents into a source file. The use of angle brackets around the filename indicates that it is stored in a directory whose path is known to the C compiler, and which is defined as the standard storage path for header files. stdio.h itself contains several preprocessor statements, including #define and #include statements, which are resolved by the preprocessor appropriately.

Let’s define a simple program which can be used to test this. The program will be the standard “hello, world” program as defined in The C Programming Language (Brian Kernighan & Dennis Ritchie, 2nd Edition).

#include <stdio.h>

main()
{
    printf("hello, world\n");
}

Now, we can see some of the results when this is passed through the C preprocessor:

typedef long unsigned int size_t;
typedef unsigned char __u_char;
typedef unsigned short int __u_short;
typedef unsigned int __u_int;
typedef unsigned long int __u_long;
typedef signed char __int8_t;
typedef unsigned char __uint8_t;
typedef signed short int __int16_t;
typedef unsigned short int __uint16_t;

...

struct _IO_FILE {
  int _flags;
  char* _IO_read_ptr;
  char* _IO_read_end;
  char* _IO_read_base;
  char* _IO_write_base;
  char* _IO_write_ptr;
  char* _IO_write_end;
  char* _IO_buf_base;
  char* _IO_buf_end;
  char *_IO_save_base;
  char *_IO_backup_base;
  char *_IO_save_end;
  struct _IO_marker *_markers;
  struct _IO_FILE *_chain;
  int _fileno;
  int _flags2;
  __off_t _old_offset;
  unsigned short _cur_column;
  signed char _vtable_offset;
  char _shortbuf[1];
  _IO_lock_t *_lock;
  __off64_t _offset;
  void *__pad1;
  void *__pad2;
  void *__pad3;
  void *__pad4;
  size_t __pad5;
  int _mode;
  char _unused2[15 * sizeof (int) - 4 * sizeof (void *) - sizeof (size_t)];
};

...

extern int fprintf (FILE *__restrict __stream,
      __const char *__restrict __format, ...);
extern int printf (__const char *__restrict __format, ...);
extern int sprintf (char *__restrict __s,
      __const char *__restrict __format, ...) __attribute__ ((__nothrow__));
extern int vfprintf (FILE *__restrict __s, __const char *__restrict __format,
       __gnuc_va_list __arg);
extern int vprintf (__const char *__restrict __format, __gnuc_va_list __arg);
extern int vsprintf (char *__restrict __s, __const char *__restrict __format,
       __gnuc_va_list __arg) __attribute__ ((__nothrow__));

...

main()
{
    printf("hello, world\n");
}

Most of the file has been truncated, but as we can see, the stdio.h header contains typedef declarations for various types, structure definitions including the above one for a FILE type as used in the file input/output routines, and function definitions. By being able to call this file from elsewhere, we save ourselves a lot of time and work from having to copy all of these definitions into our program manually.

While the above definition works for the standard header files, the location of the standard header files is restricted to read-only operations for non-administration users in many operating systems. There is, therefore, another way to specify the location of a source file, which may be an absolute path or relative to the working directory. A set of definitions of this type, using a relative and then an absolute definition, are shown below.

#include "foo.h"
#include "/home/jrandom/bar.h"

The operation of these preprocessor statements is similar to that of the one used for stdio.h; the major difference is in where the files are located. Instead of checking the standard directory for header files, the first definition checks the same directory as the source file for a header file named foo.h, while the second checks the absolute path leading to the /home/jrandom directory for a file named bar.h.

As it is common practice in C programming to leave #define statements, function prototypes and structure definitions in separate header files, this allows us to create our own header files without having to access the standard directory for header files.

The other particularly common invocation of preprocessor statements is the #define statement. The #define statement has two parts, an identifier and a token sequence. The preprocessor changes all instances of the identifier for the token sequence. This is useful for defining more legible names throughout the source code, particularly for so-called “magic numbers” whose purpose is not obvious from observation. A few examples of how this may be used are shown below:

#define MAX_FILENAME 255 /* Defines the maximum length of a filename path */
#define DIB_HEADER_SIZE 40 /* Defines the size of a BMP DIB header in bytes */
#define FOO_STRING "foobarbazquux"

In most cases, the #define tag is simply used to provide effective macros for obscure or complex definitions, but there is another sort of functionality which the #define statement can be used for. The #define statement can be used to define a macro with arguments, which is an effective way of creating shorthand for a piece of simple code which one doesn’t want to consistently repeat, but for which one doesn’t want the overhead of a function. An example of this is shown below:

#define SQUARE(x) (x) * (x)

We might see this definition invoked in a program like so:

#include 
#define SQUARE(x) (x) * (x)

int main(void)
{
    int a;

    printf("Enter an integer: ");
    scanf("%d", &a);
    printf("The square of %d is %d\n", a, SQUARE(a));
    return 0;
}

When this function is called, the SQUARE(a) invocation is replaced by (a) * (a). Note the brackets around the arguments in the macro; these are imperative for preserving the appropriate order of operations. Let’s say that we were to define SQUARE(x) as the following:

#define SQUARE(x) x * x

and then call it with the following code:

SQUARE(5 + 4)

This would expand out to the following:

5 + 4 * 5 + 4

As multiplication precedes addition, the multiplication in the middle would be performed first, with the multiplication of 4 and 5, giving 20, and then the flanking additions would be performed, giving an answer of 29. This is quite short from the 81 that we would expect from the square of 9. Therefore, it is important to appropriately define your macros in accordance with the expected order of operations.

Macros can have more than one argument, such as the following definition for a macro to find the larger of two numbers:

#define max(a, b) (a) > (b) ? (a) : (b)

Having defined something, we may want to undefine it further down the source file, possibly to prevent interference with certain operations, or to ensure that something is a function rather than a macro. For instance, in the standard libraries for low-power embedded platforms, getchar() and putchar() may be defined as macros in order to prevent the overhead of a function. In order to undefine something, we use #undef. The following code would undefine the SQUARE and max operation which we defined above:

#undef SQUARE
#undef max

Beyond the realms of #include and #define lie the conditional preprocessor directives. The first set of these are used to check whether something has already been defined, while the other set are used to check whether a C statement is true or false. We’ll discuss the definition-related directives first.

#ifdef is used to check if something has already been defined, while #ifndef is used to check whether something has not been defined. In professional code, this is regularly used to check the operating system and other details about the system which the program is to be compiled for, as the elementary operations which make up basic routines differ on different systems. We can also check if something is defined using the “defined” operator; this is useful if we want to continue checking after an #ifdef or #ifndef statement which was not satisfied.

Let’s say that we had a piece of source code which we needed to maintain on Windows, Mac OS X and Linux. Various bits of the source code might not apply to one or more of those operating systems. We could therefore hide the bits of source code that don’t apply to the current operating system using the following:

#ifdef _WIN32
#include 
#include 

#elif defined MACOSX
#include 
#include 

#elif defined LINUX
#include 
#include 

#endif

Note the use of #endif to close our set of conditional directives. This is part of the remainder of the conditional directives. #if checks if a C statement is true, and proceeds if it is, #elif is used to check another alternative if the preceding condition was not satisfied, #else is a universal alternative if none of the preceding conditions were satisfied, while #endif closes a block of conditional preprocessor statements. These operations work very similarly to the if…else if…else statements in C. The following example checks whether we are compiling for a 32-bit or 64-bit system:

#if !(defined __LP64__ || defined __LLP64__) || defined _WIN32 && \
    !defined _WIN64
/* we are compiling for a 32-bit system */
#else
/* we are compiling for a 64-bit system */
#endif

In this code, we’re looking for a definition of __LP64__ or __LLP64__, which define data models for 64-bit processors, to be false, or a definition of _WIN32, which defines a Windows software platform, to be true without a corresponding definition of _WIN64, which defines a 64-bit version of Windows, to be true. If this is true, the program is compiled for a 32-bit system, which will have different machine instructions to the 64-bit system.

While there are some other details of the preprocessor to discuss, they are best left to external reading. To conclude, there are a number of predefined macros in the C preprocessor, such as __LINE__, which calculates the line number, and __FILE__, which determines the filename. The C preprocessor can be somewhat obscure, but it gives the C language a great deal of flexibility – the sort of flexibility that sees its use on everything from microcontrollers to supercomputers.

Basics of Structures in C

With this post, we get to one of the most useful elements of C: Structures. While structures didn’t exist in some of the early versions of C, they were soon added in as the developers of C and the Unix operating system saw how useful they could be for extending the utility of their pre-existing programs, and indeed, developing programs outside of the field of system programming. The C data structure is, along with functions, one of the elements that gives it extra utility compared to the programming languages that came before, like FORTRAN and BASIC.

A structure is, speaking succinctly, a way of wrapping multiple variables up into one “package”. To demonstrate why you might want to do this, we could say that you’re designing a simple data processing application like a rudimentary stock control system for a small shop. This stock control system works on a number of different variables, such as char arrays for storing the names of the products, float variables for storing the cost and sell price of these products, and int variables for storing the number of products left in stock. Without structures, we have to declare all of these variables discretely, as illustrated below:

char *name[100];

int *number;

double *cost;

double *price;

The reason we have declared all of these as pointers is because we will want to declare arrays of these variables, but unless you know exactly how many product types are in your database, you won’t know how much memory to allocate, and so, you will have to use malloc(), calloc() or realloc() to create arrays of the appropriate size.

The problem we have here is that none of the elements of these arrays has any particular connection to the corresponding elements in the other arrays. Let’s say you wanted to sort the products into alphabetical order. You’d perform your sorting algorithm on the name[] array, which would work perfectly well, but if you forgot to appropriately change around the other three arrays, you’d end up with a situation where all of the arrays were knocked out of sync. This demonstrates how fragile this stock control system would be – and how inefficient, as you’d have to appropriately change the positions of three other arrays in addition to the name[] array!

What’s more, this particular stock control program wouldn’t work very well when it came to passing the elements into a function. In order to perform a stock reordering function, what we’d have to do is pass four separate variables into the function. Let’s say that we just wanted to reorder a single product, which might use a function with a prototype like this:

int reorder(char **name, int *number, double *cost, double *price);

Pretty messy, don’t you think? One of the preferable things about structures is that they allow you to send a single variable into this function. The function prototype can be rewritten in the following fashion:

int reorder(struct item *product);

This is a lot cleaner and easier to handle. There’s a single variable in the arguments of type “pointer to struct item“, which contains all of the discrete variables which we had before. I’ll explain more about pointers to structures later. First of all, we’re going to leave our stock control system for a few moments and look at a simpler structure declaration.

One useful application that C can and regularly has been used for is graphical manipulation. In a raster graphical format, such as BMP, JPEG, GIF or PNG, each pixel has an x and a y coordinate. In the light of what we’ve seen above with the lack of robustness with discrete variables, we’re going to want to create a “package” – or structure – to hold both components of a pixel location together. We’ll create a structure type for doing this:

struct pixel {
    int x;
    int y;
};

What we’ve just created is a structure declaration, which essentially says, “Let’s create a new type, called struct pixel, whose elements will be an int variable named x, and an int variable named y“. This structure declaration now allows us to create variables of type struct pixel in other parts of the program.

Once we have the structure declaration, we create an element of this structure type like this:

struct pixel pixel_1;

The element pixel_1 is now created, just as we’d create an int, char or float variable. It follows some of the same rules as any of these other variables, in that it fills a certain amount of space (in this case, as it contains two int variables, its size will be sizeof(2 * int)). However, because it contains more than one variable itself, changing the values of these internal variables involves a different step than we’ve taken before.

To do this, we introduce the structure member operator, “.” (a single full stop). Let’s illustrate how this works by setting the value of the x coordinate in pixel_1 to 150:

pixel_1.x = 150;

The structure member operator is very high in the order of precedence in C; it has a higher precedence than the pointer dereference operator, and equivalent precedence to brackets and square brackets.

We’ll put this all together in a simple program:

#include <stdio.h>

struct pixel {
    int x;
    int y;
};

int main(void)
{
    /* We'll create a variable of type struct pixel */
    struct pixel pixel_1;

    /* Set the elements of pixel_1 /*
    pixel1.x = 200;
    pixel1.y = 350;
    return 0;
}

The elements of a structure can be any legal variable, such as int, char, float and double. Notably, a structure, once it has been declared, is a legal variable. This means that structures are themselves legal elements of a structure – as long as they are structures of a different type. Extending our metaphor of a graphics program, we can use these so-called nested structures to create a representation of a rectangle. Our simple representation of a rectangle will consist of two pixels representing the origin and the diagonally opposite corner:

struct rect {
    struct pixel pt_1;
    struct pixel pt_2;
};

Indeed, the dimensions of a standard computer screen can be represented by a rectangular structure. We will illustrate how we would use our nested structure here for a 1920×1200 screen:

struct rect screen;

screen.pt_1.x = 0;
screen.pt_1.y = 0;
screen.pt_2.x = 1919;
screen.pt_2.y = 1199;

Unfortunately, without graphical functions to act upon these structures, we can’t do much with our pixel and rectangle structures. We return, therefore, to the stock control program described earlier. Having seen some simple structures, we can now design a structure which will represent a data entry.

struct item {
  char name[100];
  int number;
  float cost;
  float price;
};

This structure contains the variables for which we would have had to declare separate arrays in the previous program. In using a structure, we have packaged all of these variables together, giving them a mutual association. We’ll demonstrate how this may be used, given a database containing the following, which we can save to a file:

Bread 12 1.65 2.25
Milk 23 1.40 1.80
Eggs 20 1.60 2.15
Cheese 11 1.85 2.50
Bacon 6 2.65 3.15

Now, we write a program to take these values in and to print them out. This program will take a single additional command-line argument, which will be the file name of the database that we defined above:

#include <stdio.h>

struct item {
    char name[100];
    int number;
    float cost;
    float price;
};

int main(int argc, char *argv[])
{
    struct item item_array[5];
    FILE *ifp;
    int i;

    if (argc != 2) {
	printf("Usage: basic_database <filename>\n");
	return -1;
    } else if ((ifp = fopen(argv[1], "r")) == NULL) {
	printf("Error: Could not open file %s\n", argv[1]);
	return -2;
    } else {
	for (i = 0; i <= 5; i++) {
	    fscanf(ifp, "%s%d%f%f", item_array[i].name,
	    &item_array[i].number,
	    &item_array[i].price, &item_array[i].cost);
	}
	fclose(ifp);
    }

    for (i = 0; i <= 4; i++) {
	printf("%s %5d %8.2f %8.2f\n", item_array[i].name, item_array[i].number,
	item_array[i].price, item_array[i].cost);
    }
    return 0;
}

When we call it with our data from before, we get the following result:

Bread 12 1.65 2.25 
Milk 23 1.40 1.80 
Eggs 20 1.60 2.15 
Cheese 11 1.85 2.50 
Bacon 6 2.65 3.15

So, we’ve defined the framework for a simple database. We could do things with these values, such as searching through them to find the item with the highest price for the consumer:

float highest_price = 0;
int highest_price_pos = 0;

for (i = 0; i < 5; i++) {
    if (item_array[i].price > highest_price) {
	highest_price = item_array[i].price;
	highest_price_pos = i;
    }
}

A problem arises, though, when we look at this in greater detail. What happens if the database – as is very likely – contains more than five items? We don’t want to have to recompile our program every time we get a new item to sell, so we do what we would do if we required additional variables of other types: We use dynamic memory allocation. In order to do this, we will require a pointer to our structure type, but before I change our program, we need to discuss pointers to structures.

Pointers to structures operate similarly to pointers to other variables, in that they store memory addresses to structures. They are also declared similarly, with the following example being a pointer to type struct item:

struct item *item_pointer;

However, the order of precedence in C creates a bit of a problem when it comes to using the traditional dereference operator to access the contents of a pointer to a structure. The traditional syntax for accessing data in the elements of a pointer to a structure looks peculiar and confusing:

item_pointer = item_array[3];

(*item_pointer).price = 2.50;

Note that we have to use brackets around the dereference operator and the thing it’s dereferencing. The order of precedence for the structure member operator is higher than that of the dereference operator, and therefore we have to use brackets, which have the same precedence as the structure member operator, in order to allow the pointer to access its memory address before the structure member is called.

Luckily, there is a separate operator used specifically for allowing a pointer to a structure to access its elements without having the potential source of confusion caused by the dereference operator in this circumstance. The structure pointer operator, -> (a minus sign, followed by a greater-than sign), has the same precedence as the structure member operator, and reduces the ambiguity existing with the traditional syntax.

Having discussed that, we can modify our program from before to dynamically allocate memory based on the number of elements contained in the list. We will use malloc() to initially allocate memory for a single element of type struct item *, then use fscanf() to read in the values into the elements of this structure. For each line that fscanf() doesn’t read in the EOF character, the program increases the size of the dynamically allocated memory by the size of the struct item type using the realloc() function. Then, the elements of the structure array are read out as above.

#include <stdio.h>
#include <stdlib.h>

struct item {
    char name[100];
    int number;
    float cost;
    float price;
};

int main(int argc, char *argv[])
{
    struct item *item_array;
    FILE *ifp;
    int i;
    int array_elements = 0;
    int el_size;

    /* Store the size of a single element */
    el_size = sizeof(struct item);

    /* Start off by declaring a single element */
    item_array = (struct item *) malloc (el_size);

    if (argc != 2) {
	printf("Usage: basic_database <filename>\n");
	return -1;
    } else if ((ifp = fopen(argv[1], "r")) == NULL) {
	printf("Error: Could not open file %s\n", argv[1]);
	return -2;
    } else {
	/* Keep going until fscanf hits the EOF character */
	while (fscanf(ifp, "%s%d%f%f", item_array[array_elements].name,
		      &item_array[array_elements].number,
		      &item_array[array_elements].cost,
		      &item_array[array_elements].price) != EOF) {
	    /* Reallocate memory, then increment the elements counter */
	    /* NOTE: We've got a bodge in here. The memory allocated will
	     * always be one element more than we need. */
	    item_array = (struct item *)
		realloc(item_array, (el_size * ++array_elements) + el_size);
	    if (item_array == NULL) {
		printf("Error: Cannot allocate memory\n");
		return -1;
	    }
	}
	fclose(ifp);
    }
    for (i = 0; i < array_elements; i++) {
	printf("%s %d %.2f %.2f\n", item_array[i].name, item_array[i].number,
	       item_array[i].cost, item_array[i].price);
    }

    free(item_array);
    return 0;
}

Fundamentals of String Manipulation in C: Part 2

Having discussed most of the important functions relating to strings before, there are only a few others of particular note. We saw gets() previously, which acts like scanf(“%s”, string), and while it performs the job it’s asked to do, it is regarded as somewhat dangerous as it is prone to buffer overflow. A function does exist in the C standard library in <stdio.h> which is somewhat safer.

fgets() is an equivalent function to gets() designed to work on file input. Unlike gets(), you can specify a maximum number of characters to be taken in, which mitigates the buffer overflow that can occur with gets(). fgets() is called with three arguments: the string where input will be stored; the maximum number of characters, including the null character and either stdin, the reference of the standard input stream, or a pointer to a variable of the type FILE. File pointers will be discussed later; for now, we are only interested in stdin.

An example of the use of fgets() is illustrated below:

#include <stdio.h>

int main(void)
{
    char string[30];
    printf("Please enter a string: ");
    fgets(string, 30, stdin);
    puts(string);
    return 0;
}

One peculiar difference between fgets() and gets() is that fgets() does not remove newline characters from its input, while gets() does. This is to be noted when trying to concatenate two strings with strcat() which have been entered from the standard input stream with fgets().

Previously, we also saw the strcmp() function for comparing two strings. A similar function, strstr() (for string string) can be used to find an instance of a string of smaller or equal size within another string. It takes two arguments, both of them strings, and returns a pointer to the first instance of the string being searched for in the string being searched. The following program demonstrates strstr() in action.

#include <stdio.h>
#include <string.h>

int main(void)
{
    char string[] = "yellow dinosaurs eat snow reluctantly";
    char *p;
    int index;

    /* Looking for the location of "eat" in string[] */
    p = strstr(string, "eat");
    /* Finding the element within the array where "eat" begins */
    index = p - string;
    printf("The string \"eat\" begins at index %d of string[]\n");

  return 0;
}

This returns the following:

The string "eat" begins at index 16 of string[]

Searching for multiple instances of strings using strstr() requires a slight modification of our program. We can do this by creating an index, or a point in the array where the last instance of the string was found, and call the strstr() function from the next contiguous point in memory (i.e. the pointer string + index + 1). This example will benefit from the following illustration:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char long_string[] = "The C programming language was invented in the \n"
    "early 1970s by the computer scientist, Dennis Ritchie, who was then \n"
    "working at Bell Labs in New Jersey, which had just removed its \n"
    "support from the Multics project.\n";
    int index = -1, count = 0, i;
    char *p;

    printf("%s", long_string);

    for (i = 0; i < strlen(long_string); i += index) {
	p = strstr(long_string + index + 1, "in");
	if ((index = p - long_string) >= strlen(long_string) || p == NULL)
	break;
	++count;
    }
    printf("\nThe string \"in\" has been located in the string %d times\n", count);
    return 0;
}

This example returns the following:

The C programming language was invented in the 
early 1970s by the computer scientist, Dennis Ritchie, who was then 
working at Bell Labs in New Jersey, which had just removed its 
support from the Multics project.

The string "in" has been located in the string 5 times

One thing to notice about this program is that the counter variable is not incremented by 1 on every loop, but instead by the value of index; this ensures that the loop continues only as long as there are still instances of the string being searched for to be counted.

In the last tutorial, We demonstrated atoi(), a function which converts a string consisting of numeral digits into a decimal integer. It was mentioned at the time that other functions of the same type exist. atof(), for instance, converts a string consisting of numeral digits, exponents and at most one radix point into a double; atol() operates like atoi(), except that it returns a long int. On most modern compilers, atol() works exactly like atoi(), but on older compilers with 2-byte ints, the two functions work differently.

The implementation of simple versions of these two functions is discussed in The C Programming Language (Kernighan & Ritchie, 2nd Edition, 1988). Other functions of this type with more flexibility exist, like strtod(), strtol() and strtoul(), the operations of which can be found in any good C reference material.

Fundamentals of String Manipulation in C: Part 1

We’ve already seen arrays, which are chains of contiguous pointers to variables, and we’ve seen how we use them. There are arrays available of all data types, including integers of all lengths, floats, doubles and chars. A char array could be used to hold a series of characters corresponding to a word or sentence, as such:

#include <stdio.h>

int main(void)
{
    char word[] = {'f', 'o', 'o', 'b', 'a', 'r'};
    int i;
    for (i = 0; i < 6; i++)
	printf("%c", word[i]);
    printf("\n");
    return 0; /* Recall that main() returns this as an error code */
}

The array is defined with a series of characters, six in total, and the for loop goes through each character in sequence and prints it to the screen. This is, however, a cumbersome procedure and requires knowledge of the length of each individual character array. Because the manipulation of text in C is so common, there is a simpler, less cumbersome way of declaring and using arrays of characters, as illustrated below:

#include <stdio.h>

int main(void)
{
    /* Note that there are no curly brackets, and quotation marks are
    * used. */
    char word[] = "foobar";
    int i;
    for (i = 0; i < 6; i++)
	printf("%c", word[i]);
    printf("\n");
    return 0;
}

The format of the character array in this example is known as a string. Indeed, we’ve seen them before many times without explicitly referring to them as such; the printf() function takes arguments which are strings. Indeed, we can change the printf() arguments in this example to make it easier and less cumbersome:

#include <stdio.h>

int main(void)
{
    char word[] = "foobar";
    printf("%s\n", word);
    return 0;
}

This is much simpler to read, less dependent on knowing the length of the string and less error-prone. The question is, however, “How does printf() know when to stop printing?” The answer is that the word[] string isn’t just declared with the six characters in “foobar”, but also with a so-called null character, which is denoted using “”. Internally, printf() goes through the characters in the string until it reaches the null character, whereby the function terminates and returns.

There is another function in <stdio.h> which acts like printf() with a %s format specifier, and which is easier to remember and use. puts() takes a single argument, a character array, and prints it to the screen with a newline character at the end. This is illustrated below:

#include <stdio.h>

int main(void)
{
    char word[] = "foobar";
    puts(foobar);
    return 0;
}

As you can see, this is a cleaner way of printing strings to the screen than printf(), albeit without the flexibility. String printing is commonplace in text-based systems with C programming, and so puts() is often useful.

Now that we have a string, we might want to add it to another string. This can be done by declaring a larger character array which is large enough to accommodate all of the characters of both strings, and then copying the contents of both strings together. This can be illustrated with the following example:

#include <stdio.h>

int main(void)
{
    char word_1[] = "foobar";
    char word_2[] = "baz";
    char combined[10];
    int i = 0; j = 0;

    /* Continue going through the string until '' is reached */
    while (word_1[i] != '')
    {
	combined[i] = word_1[i];
	i++; /* Incrementing i goes to the next slot in memory for
	* both arrays */
    }
    /* i is now equal to the first empty element in combined */

    /* Starting again with a new counter variable */
    while (word_2[i] != '')
    {
	combined[i+j] = word_2[j];
	j++;
    }
    combined[i+j] = ''; 
    printf("%s\n", combined);
    return 0;
}

This will print out a string with the contents “foobarbaz”. The problem here, though, is that as above, with the definition of the character array using individual characters, is that this is complicated, cumbersome and error-prone. For this reason, several functions have been defined in the C standard library under the header file, <string.h>, which more cleanly deal with the manipulation of strings. The following program demonstrates some of them:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char word_1[] = "foobar";
    char word_2[] = "baz";
    char combined[10];

    strcpy(combined, word_1);
    strcat(combined, word_2);
    puts(combined);
    return 0;
}

As you can see, this is a far cleaner way of defining these operations, without any need for explicit counters or anything else which would make the program more error-prone. Both strcpy() and strcat()take their arguments in the following form:
strcpy(destination, source);
strcat(destination, source);

strcpy() (standing for string copy) is used to copy the contents of one string to another, including the ” null character. strcat() (standing for string concatenate) is used to append the contents of one string onto the end of another; the word “concatenate” is merely a fancy way of saying “link

together”, in this case referring to joining two pieces of text together.

There are a number of other useful functions defined in <string.h> which are of note. strcmp() takes two strings and compares the characters in them. If the characters are exactly the same, the function returns 0; otherwise, it returns a positive or negative value depending on the difference in ASCII values between the first character that disagrees in both strings. strlen() returns the length of a string. Both functions are illustrated below:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char word_1[]="hello";
    char word_2[]="world";
    char word_3[]="concatenate";

    /* Note the different values returned each time. */
    printf("The difference between word_1 and word_2 is %d\n",
    strcmp(word_1, word_2));

    printf("The difference between word_2 and word_1 is %d\n",
    strcmp(word_2, word_1));

    printf("The length of word_3 is %d\n", strlen(word_3));

    return 0;
}

This program returns the following results:

The difference between word_1 and word_2 is -15
The difference between word_2 and word_1 is 15
The length of word_3 is 11

This is consistent with the difference between ‘h’ and ‘w’ in the ASCII table; this is also consistent with the length of the word “concatenate”. These functions can therefore be used usefully when comparing text, which is a common task in the C programming language.

Using scanf(), we can write our own text into a character array as a string; however, this does not have any security against the characters overflowing the string and causing a segmentation fault. The use of scanf() to write into a string is illustrated below:

#include <stdio.h>

int main(void)
{
    char string[21]; /* 20 characters plus the null character. */

    /* Note that we don't use the & dereference operator here. */
    scanf("%s", string);
    printf("%s", string);
    return 0;
}

This will work successfully unless the characters entered are greater than 20; this will cause a segmentation fault and cause the program to halt unexpectedly. As with printf() and puts(), there is a function defined for entering characters specifically into a string called gets(). gets() is, however, considered dangerous and replacements are available in many extended C standard libraries; however, gets() is defined in ISO C90 and C99 and is available on all implementations of Standard C. It is illustrated below:

#include <stdio.h>

int main(void)
{
    char string[21];
    gets(string);
    puts(string);
    return 0;
}

Another thing to note is that strings, unlike character arrays defined with discrete characters, are not modifiable. Assignment statements cannot be used on a string or its elements. A function must therefore be used on the string in order to modify it.

Finally, there are a number of functions which are defined in <stdlib.h> which can be used to convert a string of characters with numerical values (i.e. ‘0’ to ‘9’) into integers, floats or doubles. atoi(), for example, (standing for ASCII to integer) converts a string consisting of such values, e.g. “1234” into an integer value. Other similar functions are also contained in <stdlib.h> and would be described in a reference of the C standard library (such as Appendix B of The C Programming Language (Kernighan & Ritchie, 2nd Edition, 1988)).

Bitwise Operators in C – A Short Overview

The logical operators, logical AND (&&), logical OR (||) and logical NOT (!) are some of the most fundamental in the C programming language, and their use is typically demonstrated almost as soon as one has seen the if statement. Their use allows multiple statement evaluations within a single if, while, do … while or for loop, which is clearly very useful.

C has another set of operators which do Boolean evaluations and operations, but at a bitwise level. The so-called bitwise operators require two integer variables (i.e. variables of type char, short, int, long int, or in ISO C99 and the GNU extensions to C, long long int). Each bit in both of the integer variables is then checked, and the Boolean operation corresponding to the bitwise operator is then performed.

It is more difficult to explain why we might use these operations than it is to demonstrate them, so I will begin by demonstrating these bitwise operators. There are six bitwise operators: bitwise AND, bitwise OR, bitwise XOR (exclusive-or), the one’s complement operator (corresponding to bitwise NOT), left shift and right shift.

The Boolean AND operation is performed between two binary operands, whereby the values of both are checked. Each of these binary operands can have, as the name suggests, one of two values: 0 or 1 (or on and off, or in a computer, high voltage and low voltage). In the AND operation, if both binary operands are 1, the result is 1; otherwise, the result is 0. This is straightforward enough, and the results are illustrated here in a truth table using the symbol for the C bitwise AND operator, which is &, a single ampersand:

& |0 1
------
0 |0 0
  |
1 |0 1

This operation is performed in the C language like so:

#include <stdio.h>

int main(void)
{
    char i = 75;
    char j = 25;
    printf("i & j == %d\n", i & j); /* Using the & operator here */
    return 0;
}

This returns the following:

i & j == 9

This result will make more sense if we look at the binary calculations involved in this calculation. 75 in base 10 corresponds to 1001011 in base 2, while 25 in base 10 corresponds to 11001 in base 2. We will pad these values to eight bits to correspond to the eight-bit nature of a char variable in C. Having done this, we can then illustrate the bitwise operator using these binary operands:

  01001011
& 00011001
----------
  00001001
      *  *
in base 10: 8 + 1 = 9

Note that the bits in the result become 1 only if both of the corresponding bits in the operands are 1. This corresponds exactly to the definition of the Boolean AND operator that we wrote above in the truth table, and therefore, we get the expected answer.

Another of these operators is the bitwise OR operator. In this circumstance, the bits of the binary operands are checked, and if one or both of the values correspond to 1, the corresponding bit in the result will be 1. The truth table for this operation looks like this:

| |0 1
------
0 |0 1
  |
1 |1 1

The symbol for the bitwise OR is |, a single pipe symbol. The bitwise OR operator is demonstrated below:

#include <stdio.h>

int main(void)
{
    char i = 75;
    char j = 25;
    printf("i | j == %d\n", i | j); /* Using the OR operator here */
    return 0;
}

The result for this program is:

i | j == 91

Again, this answer can be better explained by showing the operation in binary.

  01001011
| 00011001
----------
  01011011
   * ** **
in base 10: 64 + 16 + 8 + 2 + 1 = 91

Again, we can see that the OR operator works as expected: The result has bits with value 1 in each of the places where either or both of the operands have bits with value 1. The next of the binary bitwise operators, the XOR (or exclusive-or) operator, works very similarly to the previous two operators, and especially like the OR operator. The difference between OR and XOR is that while the result for each bit operated upon by bitwise OR will equal 1 if one or both of the bits in the operands equals 1, the result of bitwise XOR will only equal 1 if one, but -not both- of the bits in the operands equals 1. The truth table for XOR looks like this:

^ | 0 1
-------
0 | 0 1
  |
1 | 1 0

We will use the same format that we have previously used to demonstrate the bitwise XOR operator.

#include <stdio.h>

int main(void)
{
char i = 75;
char j = 25;
printf(“i ^ j == %d\n”, i ^ j);
return 0;
}

The result from this program is:

i ^ j == 82

The binary calculations corresponding to this operation are as follows:

  01001011
^ 00011001
----------
  01010010
   * *  *
In base 10: 64 + 16 + 2 = 82

Once again, the operation has come up with the expected result. These three operators correspond to three of the standard Boolean operations. The fourth operator, the one’s complement operator, corresponds to the NOT Boolean operation. Unlike the previous three operators, the one’s complement operator is not a binary operator, but instead a unary operator, working on a single operand, and examines each bit in the operand. If the value of a bit in the operand is 1, the one’s complement operator changes it to a 0 and vice-versa. The truth table for this operation is as follows:

~ | 0 1
-------
R | 1 0

Using our previously defined values for the i and j variables, we can demonstrate the one’s complement operator below:

#include <stdio.h>

int main(void)
{
    char i = 75;
    char j = 25;
    printf("~ i == %d\n", ~i);
    printf("~ j == %d\n", ~j);
    return 0;
}

This program executes with the following results:

~ i == -76
~ j == -26

Before we do the binary proof for both of these values, note that the values are the negative value of the number minus one. This results from the format of the char variable in the C standard library of the computer this was written on; the char variable in this computer, an x86_64 AMD Athlon 64 X2 running on 64-bit Linux, is a signed integer value, and the most significant bit has been toggled with both of these operations, as will be shown below.

The reason that the values for both of these operations are equal to the negative value of the number -minus one- is because while this operation takes the one’s complement of the integral operand, the mathematics in C, as with most programming languages and computers, works on the two’s complement system, which can be explained by the use of a one’s complement operation on a positive variable and adding 1 to that value. (In fact, the x86 architecture has a machine instruction directly corresponding to two’s complement negation, and therefore, the explanation is merely metaphorical.)

Let’s do the binary operations on both of these values.

~ 01001011
----------
  10110100
  * ** *

~ 00011001
----------
  11100110
  *** **

As mentioned above, the most significant bit has been toggled in both of these cases. With a signed variable, this makes both variables correspond to negative values, while with unsigned values, the values would become the maximum values that the variables can have (in the case of an unsigned char, 255) minus the previous value. The details of the mathematics behind these operations is beyond the scope of this tutorial.

That leaves two bitwise operators to cover, and both of them are very similar. The left and right shift operators, which are represented by the respective symbols << and >>, are both binary operators, with the first operand being an integral variable or constant which is to be operated on, and the second operand is the number of bits to shift the first operand. Given the following binary value, 00011000, which corresponds to 24 in base 10, and which we will represent as a char variable, we can see the results of left-shifting and right-shifting by a number of places:

00011000 << 1 == 00110000
00011000 >> 1 == 00001100
00011000 << 2 == 01100000
00011000 >> 2 == 00000110
00011000 << 3 == 11000000
00011000 >> 3 == 00000011

Bit-shifting this value by up to three places in either direction causes no problems. In the case of left-shifting, the bits on the right which have been vacated have been replaced by zeroes; correspondingly, with right-shifting, the bits on the left that were vacated have been replaced by zeroes. The problem arises when we try to bit-shift this value by four or more places in either direction. Let’s see what happens if we try to do that:

00011000 << 4 == 0000000110000000 /* Promotion to short int */
00011000 >> 4 == 00000001 /* A bit has moved out of range */

Our bit-shifting operation has pushed a bit out of range in both occasions, replacing the vacated bit with a zero. In the case of the left-shifting operation, the result is correct, as the value is promoted to a short integer, filling 16 bits rather than 8. The problem is with the right shift, where the value is shifted out of range. Performing another right-shift on the second result by one more place will move the remaining bit out of range, giving a result of zero, as illustrated below:

10000000 << 1 == 0000001100000000 /* This is OK */
00000001 >> 1 == 00000000 /* Zero! Probably not what we want! */

This result is most likely not what we want, but it is defined behaviour for the shift operators in C. There are further rules for the operation of the shift operators defined in the ISO C standard, but instead of spending too long discussing this one set of operators, let’s write a program to show the results of the operations we’ve just shown above:

#include <stdio.h>
int main(void)
{
    unsigned char k = 24;
    int count;
    for (count = 1; count <= 5; count++) {
        printf("24 << %d == %d\n", count, k << count);
        printf("24 >> %d == %d\n", count, k >> count);
    }
    return 0;
}

The result from this program is the following:

24 << 1 == 48
24 >> 1 == 12
24 << 2 == 96
24 >> 2 == 6
24 << 3 == 192
24 >> 3 == 3
24 << 4 == 384
24 >> 4 == 1
24 << 5 == 768
24 >> 5 == 0

Notice that each time we perform a shift operation, the value of the number is equivalent to a multiplication or division by two. The question is why we would use these operators rather than simply multiplying or dividing a value by two. The answer is that unlike a multiplication or division procedure, which is often very complex for a machine and requires several machine instructions to perform, a processor has a shifting unit built into the machine, making it quicker and less complex to use a single machine instruction for shifting rather than five or six for multiplying or dividing.

Similarly, we can answer the question posed above about why we would use the other bitwise operators by noting their potential efficiency. In fact, these operations are typically used for programming device drivers, microcontrollers and other systems where low-level access is either necessary or very desirable. As well as this, bitwise operators are often used to simulate several Boolean “flags” inside a single integral variable, creating up to eight “flags”, for instance, in a single unsigned char variable.

The use of these techniques creates highly idiomatic code which is often tied to a limited set of hardware, but which can run very efficiently. These techniques are therefore often foregone for programs which are not heavily dependent on exact timing or efficient code, such as applications for home computers, and it is difficult to demonstrate a non-trivial application of these operations.

The following piece of source code from Quake III Arena demonstrates one of the idiomatic pieces of code where bitwise operations can really come in handy. The function described here is an application of the so-called “fast inverse square root” procedure, which manages to return a relatively accurate approximation to the inverse of a square root several times faster than previously-defined implementations, and was used for generating light rays in the game.

/* Original source: id Tech 3 game engine */
/* Source code released under the GNU General Public License in 2005. */

float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;

x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration,
// this can be removed
return y;
}

The bitwise right-shift operation is demonstrated in the line of code aptly commented with the words, “what the fuck?”. In order to actually understand the code, one has to know various principles of mathematics, the IEEE 754 floating point specification, et cetera, which are difficult to understand and even more difficult to explain. All that really needs to be understood is that the number in y is arbitrarily treated as an integer, subtracted from the “magic number” in the “what the fuck?” line and then changed back into a floating point number. This could hardly be described as an intuitive operation, but it worked.

While that example demonstrated what could be done usefully with bitwise operations, it is not a terribly illuminating example unless you understand all of the principles illustrated by it. In an attempt to demonstrate some sort of understandable use of bitwise operators, we can look at the following example, which is not incredibly useful, but which can be explained more easily.

As we have previously established, text is generally stored in a computer in the ASCII, EBCDIC or Unicode systems, which are ways of encoding a character in a binary format so that a computer can read and store them. Knowing this, we can extract the individual bits from a character and print them to the screen.

#include <stdio.h>
#define CHAR_BITS 8

/* Function prototype */
int expt(int n, int power);

int main(void)
{
    char bits[CHAR_BITS]; /* Used to store the values of each bit */
    char sample='T'; /* A sample ASCII character */
    int i;

    for (i = 0; i < CHAR_BITS; i++) {
        /* Define each element of bits[] to be the corresponding bit */
        bits[i] = (sample & expt(2, 7-i)) >> 7 - i;
    }

    for (i = 0; i < CHAR_BITS; i++) {
        /* Print each bit */
        printf("%d ", bits[i]);
    }
    printf("\n");
    return 0;
}

int expt(int n, int power)
{
    int result = 1;
    for (power; power > 0; power--) {
        result *= n;
    }
    return result;
}

The result of this operation is:

0 1 0 1 0 1 0 0

Converting this value into base 10, we get 64 + 16 + 4 = 84, which is 65 + 19. ‘T’ is the twentieth letter in the alphabet, and as ‘A’ is defined in ASCII as 65 in base 10, we get the expected answer.

The program works in the following manner: An array of char variables is defined, followed by a sample character (in this case, ‘T’). A for loop counts through each variable in the array, assigning a value using the bitwise AND operator which will be equal to the corresponding bit in the ASCII binary representation for the character. The resultant number is then bit-shifted a number of places so that the answer for that element in the character array will be 1 if the bit is present in the ASCII code for the sample character, and 0 if it is not.

Fundamentals of Binary Input/Output in C

Previously, we examined some of the operations which C allows you to do with files in the form of ASCII text, and how the use of these operations could greatly increase the flexibility of the programs that you can write in C. With these functions, we can take our first steps towards writing a text editor, a file concatenator, a database program or many other useful utilities. However, being able to deal only with ASCII text has its limitations; we are unable, for instance, to operate on image, sound or formatted text files.

In order to expand our programs further, we must delve into the world of the binary file operations which exist in the C programming language. As with the plain-text operations, the binary operations are defined in the <stdio.h> header file. The fopen() function includes special modes for reading from or writing to a file in binary mode, which looks something like this:

fp = fopen("foo.bin", "rb");

Note the addition of the “b” tag to the mode specification. This is required in some operating systems to act upon the file with operations other than the plain-text operations we discussed before, opening up the block input/output operations such as fread() and fwrite(). In a POSIX-compliant system (such as Unix, Linux, Mac OS X, et cetera), this is not required, but for maximum cross-compatibility, it should be included anyway.

While we’re discussing opening files, there is a function included in the C standard library which closes and reopens a file, allowing one to change the read-write permissions which the file is opened with. This function is named freopen(), and is called in the following fashion:

freopen("foo.file", "r+b", fp);

where “foo.file” is the name of the file in question, the mode is as in fopen() and fp is the file pointer or stream which the file is to be associated with. This can be used to change the permissions of a file from read-only to read-write, to append, to write-only or to change the access from plain-text mode to binary mode. It is more traditionally used to change the streams that the standard input/output streams are associated with, particularly in systems where stdin, stdout and stderr cannot be closed manually.

Now that we have a file open in binary mode, we may perform some binary operations on it. The fread() and fwrite() functions are more complicated to declare than the plain-text functions we have seen so far, but these functions have greater flexibility as a result. We will begin with fread(), which takes in a number of items in binary mode. It is declared in the following fashion:

fread(buffer, size_z, number, fp);

where buffer represents the variable or array where the things read by the function are stored, size_z represents the size of the objects to be taken in (for example, sizeof(char) for single bytes, or sizeof(short) for 2-byte short integers), number represents the number of objects to be read in, and fp represents the file pointer or stream which the data is to be read from.

The following simple program takes in a number of integers from a file with the following hexadecimal contents, written in the Intel x86-compatible little-endian format:

0A 00 00 00 14 00 00 00 1E 00 00 00 28 00 00 00 32 00 00 00
37 00 00 00 3C 00 00 00 41 00 00 00 46 00 00 00 4B 00 00 00

If read into a text editor like Emacs, this file is represented by the following meaningless string:

^@^@^@^T^@^@^@^^^@^@^@(^@^@^@2^@^@^@7^@^@^@<^@^@^@A^@^@^@F^@^@^@K^@^@^@

However, we can read this into a C program in binary format and get some meaning out of it.

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int bar[10];
    int i;

    fp = fopen("foo.bin", "rb");
    if (fp == NULL) {
        puts("Error: Input file cannot be read");
        return -1;
    }
    else {
        fread(bar, sizeof(int), 10, fp);
        for (i = 0; i < 10; i++) {
            printf("%d ", bar[i]);
        }
        putchar('\n');
    }
    fclose(fp);
    return 0;
}

This program prints out the following:

10 20 30 40 50 55 60 65 70 75

In any machine with 32-bit int variables and little-endian bit organisation, the same results will apply. However, this code isn’t particularly machine-portable, and will have different results in different computers. For instance, using this code in a computer which uses the IBM POWER architecture, such as a seventh-generation games console like the Xbox 360, will have a radically different result than the one we get with x86 processors. This has to be borne in mind when using binary files in different computers.

Just as we may wish to both read and write plain-text files using C’s functions, we may wish to do the same with binary files. We may perform these writing operations using the fwrite() function, which takes similar arguments to fread(). The declaration of this function is illustrated below:

fwrite(buffer, size_z, number, fp)

where again, buffer refers to the variable or array where the values are stored, size_z refers to the size of the values, number refers to the number of values to be written, and fp refers to the file pointer or stream where the binary is to be written. The following program calculates the first ten powers of two, stores them in an integer array and writes these values to a file.

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int powers_two[10];
    int i;

    powers_two[0] = 1;
    for (i = 1; i < 10; i++) {
        powers_two[i] = powers_two[i-1] * 2;
    }

    fp = fopen("bar.bin", "wb");
    if (fp == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        fwrite(powers_two, sizeof(int), 10, fp);
    }
    fclose(fp);
    return 0;
}

In my little-endian AMD x86_64 machine, the binary values written to the bar.bin file have the following hexadecimal values:

01 00 00 00
02 00 00 00
04 00 00 00
08 00 00 00
10 00 00 00
20 00 00 00
40 00 00 00
80 00 00 00
00 01 00 00
00 02 00 00

Converted into decimal, we get the values 1, 2, 4, 8, 16, 32, 64, 128, 256 and 512, as expected. Again, these values are almost meaningless in ASCII; most of the bit arrangements correspond to control characters, and the string of characters is of little interest. We can, however, read in the binary data as we did before and perform calculations on the values as standard integers.

Now that we have our basic block input/output functions, we can start investigating some of the other file access functions which we can use on our file pointers. There are a considerable number of these functions defined in <stdio.h>, not all of which are of immediate interest, but which have their own utility in a C program.

One of the functions which is of interest is the rewind() function, which returns the file position to the beginning of the file, clearing the end-of-file and error flags in the process. One might liken this to rewinding a tape; this would also be of use when reading a sound file in a music player, which is one type of program which works on binary files. The rewind() function is illustrated below, reading in integers from the bar.bin file defined in the last program:

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int foo[10];
    int i;

    if ((fp = fopen("bar.bin", "rb")) == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        for (i = 0; i < 10; i+=5) {
            fread(&foo[i], sizeof(int), 5, fp);
            rewind(fp);
        }
    }

    fclose(fp);

    for (i = 0; i < 10; i++) {
        printf("%d ", foo[i]);
    }
    putchar('\n');

    return 0;
}

This program opens the bar.bin file in binary format, then starts reading values into the foo array one set of four bytes at a time as before. Note the use of the & reference operator; because five integers are read at a time, we need to reference the specific element of the array which we wish to read into, otherwise, we will end up with the first five elements of the array being read into twice, and the others being undeclared. When we run this program, we get the following results:

1 2 4 8 16 1 2 4 8 16

The first five powers of two have been read in to the array twice, with the file rewinding on each occasion. If we were to read into a larger array, the function would start reading from the first value in bar.bin, 01 00 00 00.

Given that we have a function which rewinds the file fully, we might want to prove to ourselves explicitly that the rewind() function has actually rewinded the file position back to the start, and to find the file position before the rewind() function is called. To do this, we can use the ftell() function, a function which returns a long integer value which tells us the current file position. We can expand our previous program with calls to the ftell() function contained within our for loop:

#include <stdio.h>

int main(void)
{
    FILE *fp;
    int foo[10];
    int i;

    if ((fp = fopen("bar.bin", "rb")) == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        for (i = 0; i < 10; i+=5) {
            fread(&foo[i], sizeof(int), 5, fp);
            printf("%d\n", ftell(fp)); /* Calling ftell() */
            rewind(fp);
            printf("%d\n", ftell(fp)); /* And calling it again */
        }
    }

    fclose(fp);

    for (i = 0; i < 10; i++) {
        printf("%d ", foo[i]);
    }
    putchar('\n');

    return 0;
}

The results from this program are:

20
0
20
0
1 2 4 8 16 1 2 4 8 16

Using this, we can see that the file position was at the twenty-first (noting that as usual in C, we start counting from zero) byte or the start of the sixth integer position in the file before the rewind() function was called, and that the file position returns to the first byte after the rewind() function is called. We have verified that the rewind() function does, in fact, work as we expect it to.

It is clear, though, that the rewind() function is rather limited in its scope; it can only return to the beginning of the file. A music player that could only rewind to the start of a song would be considered terribly limited, and just as with a modern digital music player, we can seek out a specific byte within a file and operate from that position. This is where the fseek() function comes in to play, which works somewhat like rewind(), but with a lot more flexibility. fseek() is called in the following fashion:

fseek(fp, offset, origin)

where fp is the file pointer for which we wish to change the file position, offset is the number of bytes we want to move the file position, given a certain origin, which is equal to the defined value SEEK_SET for the beginning of the file, SEEK_CUR for the current file position and SEEK_END for the end of the file. For this function, we might want to define a larger binary input file, so that we can see the full extent of the fseek() function. The following set of hexadecimal values was saved to the baz.bin file:

01 01 00 00 01 02 00 00 01 04 00 00 01 08 00 00 01 10 00 00 01 20 00 00 01 40
00 00 01 80 00 00 01 00 01 00 01 00 02 00 01 00 04 00 01 00 08 00 01 00 10 00
01 00 20 00 01 00 40 00 01 00 80 00 01 00 00 01 01 00 00 02 01 00 00 04 01 00
00 08 01 00 00 10 01 00 00 20 01 00 00 40 01 00 00 80 02 01 00 00 02 02 00 00
02 04 00 00 02 08 00 00 02 10 00 00 02 20 00 00 02 40 00 00 02 80 00 00 02 00
01 00 02 00 02 00 02 00 04 00 02 00 08 00 02 00 10 00 02 00 20 00 02 00 40 00
02 00 80 00 02 00 00 01 02 00 00 02 02 00 00 04 02 00 00 08 02 00 00 10 02 00
00 20 02 00 00 40 02 00 00 80

The following program declares an integer array of size 48, and moves the file pointer to i bytes past the start of the array every time the for loop runs.

#include <stdio.h>
#define ARRAYSIZE 48

int main(void)
{
    FILE *fp;
    int values[ARRAYSIZE];
    int i, j;

    if ((fp = fopen("baz.bin", "rb")) == NULL) {
        puts("Error: Input file invalid");
        return -1;
    } else {
        for (i = 0; i < ARRAYSIZE; i++) {
            fread(&values[i], sizeof(int), 1, fp);
            fseek(fp, i, SEEK_SET);
        }
    }

    for (i = 0; i < ARRAYSIZE; i++) {
        printf("%d ", values[i]);
    }
    putchar('\n');

    return 0;
}

This returns the following:

257 257 16777217 33619968 131328 513 16777218 67174400 262400 1025 16777220
134283264 524544 2049 16777224 268500992 1048832 4097 16777232 536936448
2097408 8193 16777248 1073807360 4194560 16385 16777280 -2147418112 8388864
32769 16777344 65536 16777472 65537 16777472 65537 33554688 131073 16777728
65538 67109120 262145 16778240 65540 134217984 524289 16779264 65544

All of these values are somewhat closely related to powers of two, as the series of values which we placed into the bar.bin file would suggest. Nevertheless, this is not an incredibly exciting program, nor are the programs we defined before. We are simply working on a set of integers, but binary files can be more interesting.

Executable files are a sort of binary file, although with modern computer architectures – and for that matter, even for older, more consistent architectures – working on these executables in unadorned hexadecimal is difficult. Other types of files that are in binary format are the likes of MP3 music files, MPEG movies and various formats of image files.

One of the simplest image formats is the uncompressed BMP bitmap format defined by Microsoft. The mandatory components of a BMP file are the 14-byte header, which stores general information about the file, a DIB header of various size which contains more detailed information about the bitmap image, and a pixel array, which consists of blocks of bytes which encode the red, green and blue colour values of the pixels, along with an optional transparency (or alpha) value.

Using this information, we can write a simple application which reads in a BMP file, then does something with this file, such as inverting the values of the colours. The application then writes this information to a separate file, including the header and DIB header. The following program takes an input file, which has been defined in this instance as having a 40-byte DIB header corresponding to one of the seven BMP header types that exist. The filename of the input file can be taken in as a command-line argument, or defined within the program itself.

/* Image manipulation: Colour inverter */
/* Based on the work of Leo Tilson @ Dublin Institute of Technology,
   modifications made by Richard Kiernan */

#include <stdio.h>
#include <stdlib.h>
#define CHAR_MAX 2560

int main(int argc, char **argv)
{
    FILE *input_image;
    FILE *output_image;

    unsigned char header[14]; /* Stores the header of the file */
    unsigned char info[40]; /* Stores the DIB header of the file */
    unsigned char *image; /* Once malloc() is called, stores the image data
                           * for the file */

    unsigned int num_read; /* For error checking */

    unsigned int width; /* All derived from the DIB header */
    unsigned int row_length;
    unsigned int height;
    unsigned int im_size; 

    unsigned int row; /* Two counter variables */
    unsigned int pixel; 

    unsigned char blue; /* Stores the blue byte for each pixel */
    unsigned char green; /* Stores the green byte for each pixel */
    unsigned char red; /* Stores the red byte for each pixel */

    unsigned char *lineptr; /* We'll use this to go through each pixel */

    char filename[CHAR_MAX];

    if (argc > 2) {
        printf("Usage: inverter [filename]\n");
        exit(1);
    } else if (argc == 1) {
        printf("Please enter a filename: ");
        scanf("%s", filename);
    }

    input_image = fopen(argc == 2 ? argv[1] : filename, "rb");
    output_image = fopen("output.bmp", "wb");

    if ((input_image == NULL) || (output_image == NULL)) {
        printf("Error: Failed to open an image file\n");
    } else {
        printf("Image opened successfully\n");

        /* Retrieve the header from the input image */
        num_read = fread(header, sizeof(char), 14, input_image);
        if (num_read != 14)
            exit(2); /* Not a valid BMP file! No point in continuing. */

        /* Retrieve the bigger info block from the input image */
        num_read = fread(info, sizeof(char), 40, input_image);
        if (num_read != 40)
            exit(3); /* Not the type of BMP file we're looking for. */

        /* Using the info block to tell us necessary information about the
         * file, like width, height, et cetera. */
        width = info[4] + info[5] * 256 + info[6] * 256 * 256 + info[7] * 256
            * 256 * 256;
        height = info[8] + info[9] * 256 + info[10] * 256 * 256 + info[11] * 256
            * 256 * 256;
        im_size = info[20] + info[21] * 256 + info[22] * 256 * 256 + info[23]
            * 256 * 256 * 256;

        row_length = im_size / height;

        /* Now, allocate memory for the image and read it in. */

        image = (char *) malloc (im_size);
        if (image == NULL)
            exit(4); /* Wrong size; no point in continuing */

        num_read = fread(image, sizeof(char), im_size, input_image);
        if (num_read != im_size)
            exit(5); /* Some sort of error in reading in the file. */

        /* Now to invert the image's colours */
        for (row = 0; row < height; row++) {
            /* Define the pointer position for this row */
            /* This is the first blue byte on the row */
            lineptr = image + row * row_length;

            for (pixel = 0; pixel < width; pixel++) {           
                /* Define the current colours */
                blue = *lineptr;
                green = *(lineptr + 1);
                red = *(lineptr + 2);

                /* Invert the colours */
                *lineptr = 255 - blue ;
                *(lineptr + 1) = 255 - green;
                *(lineptr + 2) = 255 - red;

                lineptr += 3 /* Go to next pixel */
            }
        }

        /* Write this to the output file */
        fwrite(header, sizeof(char), 14, output_image);
        fwrite(info, sizeof(char), 40, output_image);
        fwrite(image, sizeof(char), im_size, output_image);

        free(image);
        fclose(input_image);
        fclose(output_image);
    }
}

It is important to note that this is not the most elegant way to write this program, nor is it the fastest. This program could be improved by using pointers rather than the array notation used here, for instance. However, this does demonstrate the sort of program which can be written as soon as you have access to operations that work on files at a binary level.

Using this program as a framework, we could also perform such tasks as changing the colours to greyscale, saturating the colours or other functions that we might see in a fully-featured image manipulation program. Similarly, knowing the details of a music file format would allow us to write programs which could manipulate that type of file.

A final function which is of interest in the context of operations on files is the remove() function. As the name suggests, it is used to remove a file from storage as long as no other filenames are linked to the file. This function allows us to create a simple program along the lines of the Unix rm command, which we will call rm_file.

#include <stdio.h>

int main(int argc, char *argv[])
{
    if (argc == 2) {
        if (remove(argc[1]) != 0) {
            printf("Error: File %s could not be deleted\n", argv[1]);
        } else {
            printf("File %s successfully deleted\n", argv[1]);
        }
    } else {
        printf("Usage: rm_file [filename]);
    }
    return 0;
}

As persistent data is such an important idea in computer programming, it is useful to have a series of functions defined in the language you are using that easily operate on that persistent data, and between plain-text and binary input/output operations in C, you have a set of functions which have been tried and tested. These functions allow us to build basic applications, like our basic “text editor”, or the colour inverter program defined above, that can form parts of a more fully-featured system.

Fundamentals of ASCII File Input/Output in C

Since the beginning of digital computing, people have seen the necessity for persistent data which can be quickly loaded into or saved from a computer. Konrad Zuse’s Z3 mechanical computer from 1940 supported instruction loading from punched 35mm film stock; similarly, Charles Babbage’s unbuilt Analytical Engine was to support punch cards similar to those used in Jacquard looms.

When C was first developed in Bell Labs, it was designed with the purpose of being a systems programming language, a more portable and friendly alternative to assembly language. File input and output features were therefore an important part of the language, and are dealt with in a consistent fashion by a series of functions contained within the C Standard Library. In the ISO C90 and C99 standards for C, the file input/output functions are contained within <stdio.h>.

The most elementary operations which one may wish to do with files are to open them, allowing them to be read from and written to, and to close them when those operations have completed. In order to do this, we need some way of representing a file. For this purpose, C provides us with a special type of structure called a file pointer, denoted with the type FILE *. The specifics of the FILE * type are defined in <stdio.h>, but understanding them is not important for understanding file input/output operations. It merely suffices to say that the type facilitates these operations.

Once we have a file pointer within our file, we can use it to open a file. The function used for this task is fopen(), which is declared in the following fashion:

fp = fopen("foo.file", "r");

where fp is the name of the file pointer, the first argument of fopen() is a string containing the name of the file to be opened and the second argument is the mode, a string containing directives to the operating system regarding what level of access is to be allowed to the file. In this case, the mode is “r“, allowing read-only access.

If the file can be opened, the file pointer is updated and the program may continue. However, if a file cannot be opened, e.g. because the user lacks permissions for the file, the fopen() function returns the value NULL, which can be used to set up error reporting and functions. This will be demonstrated below.

As most operating systems have a limit on the number of files that can be open at any one time, and as we may want to clear a buffer which a certain function is operating on, it is a good idea to close a file when we are done with it. Just as C provides fopen() for the purposes of opening a file, it provides the function fclose() for closing a file, which is initialised in the following manner:

fclose(fp);

where fp is again the name of the file pointer. fclose() is automatically called on every file pointer which is still open at the end of a program; it is still a good idea to call it manually on each file as a matter of habit.

The use of both the fopen() and fclose() functions is illustrated below:

#include <stdio.h>

int main(void)
{
    FILE *fp;

    /* Checking for errors! */
    if ((fp = fopen("foo.txt", "r")) == NULL) {
        printf("Error: File cannot be opened\n");
        return 1;
    } else {
        printf("File successfully opened\n");
        fclose(fp);
    }
    return 0;
}

Now that we have these elementary operations, we will want to actually manipulate the files we have opened. Two simple operations which act character by character, like getchar() and putchar(), are fputc() and fgetc(), which insert a character into a file and retrieve a character from a file respectively. These functions can be used to copy the contents of one file to another, for instance:

#include <stdio.h>
#define FILENAME 256

int main(void)
{
    FILE *ifp, *ofp;
    char c;
    char input[FILENAME], output[FILENAME];

    printf("Enter the name of the input file: ");
    scanf("%s", input);

    if ((ifp = fopen(input, "r")) == NULL) {
        puts("Error: input file invalid");
        return -1;
    }

    printf("Enter the name of the output file: ");
    scanf("%s", output);

    if ((ofp = fopen(output, "w")) == NULL) {
        puts("Error: output file invalid");
        return -1;
    }

    while ((c = fgetc(ifp)) != EOF) {
        fputc(c, ofp);
    }
    fclose(ifp);
    fclose(ofp);

    return 0;
}

This set of functions has its uses, but is somewhat limited in its scope. As well as this, only a single character is read or written at a time, which was an issue which we addressed with string functions previously. As well as this, all of the input and output performed by the fputc() and fgetc() functions involves ASCII text.

We will address the issue of strings first. Just as there are functions in the C standard library for getting strings from the standard input and printing them to the standard output, there are functions for getting and printing strings to and from files. fgets(), a function which we briefly saw when dealing with strings previously, and fputs() are equivalents to the gets() and puts() functions found in <string.h>.

We can illustrate a function which gets the contents of a file and prints it to the screen. Several programs which perform functions similar to this exist in Unix. The closest to this program is the cat program called with a single input file; we will therefore call this program “meow“.

#include <stdio.h>
#define MAXCHARS 81

int main(int argc, char **argv)
{
    FILE *ifp;
    char line[MAXCHARS];

    if (argc != 2) {
        puts("Usage: meow ");
        return -1;
    } else if ((ifp = fopen(argv[1], "r")) == NULL) {
        puts("Error: Input file invalid");
        return -2;
    } else {
        while ((fgets(line, MAXCHARS, ifp)) != NULL) {
            fputs(line, stdout);
        }
    fclose(ifp);
    }
    return 0;
}

Note the use of fputs() in this program rather than puts() or printf(). puts() prints another newline character for every line, which does not print the statements faithfully as they are contained in the input files; printf() similarly contains problems with printing files with percent signs, as these are of significance to the printf() function. The fputs() function prints the contents of all text files more faithfully than either of the other functions we may use.

fputs() isn’t just good for printing file contents to the screen; it is also useful for printing strings to files. With this function, we could implement a very simple text editor, similar in concept to the ed text editor which was once the standard Unix line editor. Our editor, which I will call “eddie“, does not have anywhere near the complexity or feature set of even the ed text editor, yet it will serve adequately as an example. Indeed, it is not really accurate to compare our text editor with ed; it is closer to the functionality of the cat program called with the < operator.

#include <stdio.h>
#define MAXCHARS 81

int main(int argc, char *argv[])
{
    FILE *ofp;
    char line[MAXCHARS];

    if (argc != 2) {
        puts("Usage: eddie ");
        return -1;
    } else if ((ofp = fopen(argv[1], "w")) == NULL) {
        puts("Error: Output file invalid");
        return -1;
    } else {
        while ((fgets(line, 81, stdin)) != NULL && strcmp(line, ".\n")) {
            fputs(line, ofp);
        }
    fclose(ofp);
    }
    return 0;
}

Note the strcmp() comparison made in the while loop. A similar comparison is made in the editing mode of the actual ed text editor, checking if the string being entered is a full stop followed by a newline character. This combination of characters is relatively rare in text editing, whether it be in writing source code or in writing human-language texts, and serves as an adequate delineation character to distinguish the end of a text file.

Now that the file string functions have been demonstrated, we can move onto formatted input. I mentioned above that one of the limitations of the fgets() and fputs() functions was that they could only work with ASCII text input. Often, we want to take in the contents of a file in an integer or floating-point form, allowing us to create the likes of database programs.

This is where the fprintf() and fscanf() functions come into play. As the names suggest, these functions are equivalent to the printf() and scanf() functions used for the standard output and input strings. They allow us to address incoming or outgoing data in numerical (and Boolean, in C99) form, as well as ASCII characters. This gives us a greater deal of flexibility, particularly when it comes to data processing applications and the like.

The following functions from a Little Man Simulator whose source code is available under the GNU General Public License demonstrate the loading of a file into memory and saving it to a file. While these functions are regrettably not particularly illuminating outside of their original context, they do adequately demonstrate the use of the fscanf() and fprintf() functions.

/* Function: load_file
Prompts the user to enter a filename, then loads the contents of this file
into the program as instructions/data contents of the mailbox array.
Arguments: mailbox[], an array of short integers.
Returns: a short integer equal to -1 or 0. */
short load_file (short mailbox[])
{
    FILE *fp;
    short i = 0;
    int digits;
    char filename[FILE_LENGTH];

    printf("Enter a valid filename: ");
    scanf("%s", filename);

    /* If file won't open, i.e. *fp == NULL, exit with error state; otherwise,
    open file and use fscanf to enter the digits into the mailboxes. */
    if ((fp = fopen(filename, "r")) == NULL)
    return -1;
    else {
        while (i < MAILBOXES && (fscanf(fp, "%d", &digits)) != EOF) {
            mailbox[i] = digits;
            ++i;
        }
    }
    fclose(fp);
    return 0;
}

In this first function, a filename is obtained, which is used for the operation of the fopen() function. The contents of the file, which contains a set of integers which have no greater than three significant digits, are taken and each integer is placed into a variable named “digits”, before being placed into a corresponding mailbox slot. This continues until the i variable exceeds the number of mailboxes (typically 100 in a standard Little Man implementation) or until the end-of-file character is encountered.

/* Function: save_file
Prompts the user to enter a filename, then saves the instructions/data
contents of the mailbox array as contents of this file.
Arguments: mailbox[], an array of short integers.
Returns: a short integer equal to -1 or 0. */
short save_file (short mailbox[])
{
    FILE *fp;
    short i;
    char filename[FILE_LENGTH];

    printf("Enter a valid filename: ");
    scanf("%s", filename);

    /* If file won't open, i.e. *fp == NULL, exit with error state; otherwise,
    open file and save the contents of memory to it. */
    if((fp = fopen(filename, "w+")) == NULL)
        return -1;
    else {
        for (i = 0; i < MAILBOXES; i++) {
            fprintf(fp, "%d\n", mailbox[i]);
        }
    }
    fclose(fp);
    return 0;
}

This corresponding function takes the contents of memory and for each mailbox slot, prints an integer of at most three significant digits to the file specified by the user. Empty mailbox slots are printed as zeroes to the output file.

It should be noted that all of these functions only operate on ASCII text files, even with the formatted input. Binary file input/output will be explained subsequently, but the functions are contained within <stdio.h> just as with the ASCII file input/output functions.