The C Preprocessor

One of the peculiar things about the C programming language is that so many commonly occurring elements are not actually part of the language, per se. All of the functions in the standard library are actually extensions to C, additional parts which give us the input/output, mathematical and utility features which make C powerful. All of these are contained in a set of header files and binaries which are added to programs during the compilation process.

Another extension to C is the C preprocessor, and it is this that gives us the ability to extend the language to perform functions. The C preprocessor is a sort of computer language of its own sort, and while it is not Turing-complete, it is useful enough for the purposes for which it is called. The C preprocessor reads through a C source file, replacing various statements which are important to the preprocessor to ones which are important to the C compiler.

It is somewhat difficult to explain why the C preprocessor is important, but I will attempt to do so with a brief segue into the history of computer languages. Early high-level programming languages, such as Fortran and COBOL, were notable for being able to do one set of tasks very well and most others not so well at all. In some cases, this led to deficiencies which would be considered ghastly today; ALGOL 58 and 60, for instance, did not define any input/output operations, and any I/O routines would be completely machine-dependent.

In the later 1960s, language designers attempted to create new languages which would be suitable for multi-purpose applications. However, these languages, which included PL/I and ALGOL 68, were designed by committees who were made up of conflicting personalities, many of which were desperate to see their pet features included in the languages. As the complexity of the languages grew, so did the complexity of developing an efficient compiler. As computing resources were vastly smaller than they are now, these languages were only suitable for mainframe computers, and then not even efficiently.

Therefore, these language experiments tended to fail. PL/I has some residual support by being supported by IBM, but it is moribund outside of the confines of IBM machines; ALGOL 68 is dead and buried. When C came around, Dennis Ritchie was aiming to create a language which both implemented enough features in order to build an operating system and its applications, while being able to run efficiently on a much less powerful computer than those for which PL/I was designed.

The solution was to create a system in which only the subset of the functions that were required for a specific program would be implemented, rather than the full set. This made compilation of C more efficient, as the compiler generally only had to be concerned with a small number of functions at once. The method chosen to do this was to use the C preprocessor to keep the function definitions of most functions outside of the base language; when C was standardised in 1989 by the ANSI committee, and in 1990 by the ISO, all functions were taken out of the base language and put into header files.

Now that the history lesson is over, we can continue on to the operations of the preprocessor. As mentioned above, the preprocessor scans a C source file – or, in some circumstances, another source file; Brian Kernighan famously developed RATFOR to add similar features to Fortran as in C – and looks for statements that are important to it. It then replaces them with statements that are important to the C compiler or whatever other system the preprocessor is being used for.

The most fundamental operation of the preprocessor is #include. This operation looks for a file which is defined at a path included in the #include directive, then inserts its entire contents into the source file in place of the #include directive. The file’s contents might themselves contain C preprocessor statements, as is common in C header files, so the preprocessor goes through those and acts upon them appropriately.

One of the most common invocations of the #include directive is the following:

#include <stdio.h>

This directive locates the file, stdio.h, and places its contents into a source file. The use of angle brackets around the filename indicates that it is stored in a directory whose path is known to the C compiler, and which is defined as the standard storage path for header files. stdio.h itself contains several preprocessor statements, including #define and #include statements, which are resolved by the preprocessor appropriately.

Let’s define a simple program which can be used to test this. The program will be the standard “hello, world” program as defined in The C Programming Language (Brian Kernighan & Dennis Ritchie, 2nd Edition).

#include <stdio.h>

main()
{
    printf("hello, world\n");
}

Now, we can see some of the results when this is passed through the C preprocessor:

typedef long unsigned int size_t;
typedef unsigned char __u_char;
typedef unsigned short int __u_short;
typedef unsigned int __u_int;
typedef unsigned long int __u_long;
typedef signed char __int8_t;
typedef unsigned char __uint8_t;
typedef signed short int __int16_t;
typedef unsigned short int __uint16_t;

...

struct _IO_FILE {
  int _flags;
  char* _IO_read_ptr;
  char* _IO_read_end;
  char* _IO_read_base;
  char* _IO_write_base;
  char* _IO_write_ptr;
  char* _IO_write_end;
  char* _IO_buf_base;
  char* _IO_buf_end;
  char *_IO_save_base;
  char *_IO_backup_base;
  char *_IO_save_end;
  struct _IO_marker *_markers;
  struct _IO_FILE *_chain;
  int _fileno;
  int _flags2;
  __off_t _old_offset;
  unsigned short _cur_column;
  signed char _vtable_offset;
  char _shortbuf[1];
  _IO_lock_t *_lock;
  __off64_t _offset;
  void *__pad1;
  void *__pad2;
  void *__pad3;
  void *__pad4;
  size_t __pad5;
  int _mode;
  char _unused2[15 * sizeof (int) - 4 * sizeof (void *) - sizeof (size_t)];
};

...

extern int fprintf (FILE *__restrict __stream,
      __const char *__restrict __format, ...);
extern int printf (__const char *__restrict __format, ...);
extern int sprintf (char *__restrict __s,
      __const char *__restrict __format, ...) __attribute__ ((__nothrow__));
extern int vfprintf (FILE *__restrict __s, __const char *__restrict __format,
       __gnuc_va_list __arg);
extern int vprintf (__const char *__restrict __format, __gnuc_va_list __arg);
extern int vsprintf (char *__restrict __s, __const char *__restrict __format,
       __gnuc_va_list __arg) __attribute__ ((__nothrow__));

...

main()
{
    printf("hello, world\n");
}

Most of the file has been truncated, but as we can see, the stdio.h header contains typedef declarations for various types, structure definitions including the above one for a FILE type as used in the file input/output routines, and function definitions. By being able to call this file from elsewhere, we save ourselves a lot of time and work from having to copy all of these definitions into our program manually.

While the above definition works for the standard header files, the location of the standard header files is restricted to read-only operations for non-administration users in many operating systems. There is, therefore, another way to specify the location of a source file, which may be an absolute path or relative to the working directory. A set of definitions of this type, using a relative and then an absolute definition, are shown below.

#include "foo.h"
#include "/home/jrandom/bar.h"

The operation of these preprocessor statements is similar to that of the one used for stdio.h; the major difference is in where the files are located. Instead of checking the standard directory for header files, the first definition checks the same directory as the source file for a header file named foo.h, while the second checks the absolute path leading to the /home/jrandom directory for a file named bar.h.

As it is common practice in C programming to leave #define statements, function prototypes and structure definitions in separate header files, this allows us to create our own header files without having to access the standard directory for header files.

The other particularly common invocation of preprocessor statements is the #define statement. The #define statement has two parts, an identifier and a token sequence. The preprocessor changes all instances of the identifier for the token sequence. This is useful for defining more legible names throughout the source code, particularly for so-called “magic numbers” whose purpose is not obvious from observation. A few examples of how this may be used are shown below:

#define MAX_FILENAME 255 /* Defines the maximum length of a filename path */
#define DIB_HEADER_SIZE 40 /* Defines the size of a BMP DIB header in bytes */
#define FOO_STRING "foobarbazquux"

In most cases, the #define tag is simply used to provide effective macros for obscure or complex definitions, but there is another sort of functionality which the #define statement can be used for. The #define statement can be used to define a macro with arguments, which is an effective way of creating shorthand for a piece of simple code which one doesn’t want to consistently repeat, but for which one doesn’t want the overhead of a function. An example of this is shown below:

#define SQUARE(x) (x) * (x)

We might see this definition invoked in a program like so:

#include 
#define SQUARE(x) (x) * (x)

int main(void)
{
    int a;

    printf("Enter an integer: ");
    scanf("%d", &a);
    printf("The square of %d is %d\n", a, SQUARE(a));
    return 0;
}

When this function is called, the SQUARE(a) invocation is replaced by (a) * (a). Note the brackets around the arguments in the macro; these are imperative for preserving the appropriate order of operations. Let’s say that we were to define SQUARE(x) as the following:

#define SQUARE(x) x * x

and then call it with the following code:

SQUARE(5 + 4)

This would expand out to the following:

5 + 4 * 5 + 4

As multiplication precedes addition, the multiplication in the middle would be performed first, with the multiplication of 4 and 5, giving 20, and then the flanking additions would be performed, giving an answer of 29. This is quite short from the 81 that we would expect from the square of 9. Therefore, it is important to appropriately define your macros in accordance with the expected order of operations.

Macros can have more than one argument, such as the following definition for a macro to find the larger of two numbers:

#define max(a, b) (a) > (b) ? (a) : (b)

Having defined something, we may want to undefine it further down the source file, possibly to prevent interference with certain operations, or to ensure that something is a function rather than a macro. For instance, in the standard libraries for low-power embedded platforms, getchar() and putchar() may be defined as macros in order to prevent the overhead of a function. In order to undefine something, we use #undef. The following code would undefine the SQUARE and max operation which we defined above:

#undef SQUARE
#undef max

Beyond the realms of #include and #define lie the conditional preprocessor directives. The first set of these are used to check whether something has already been defined, while the other set are used to check whether a C statement is true or false. We’ll discuss the definition-related directives first.

#ifdef is used to check if something has already been defined, while #ifndef is used to check whether something has not been defined. In professional code, this is regularly used to check the operating system and other details about the system which the program is to be compiled for, as the elementary operations which make up basic routines differ on different systems. We can also check if something is defined using the “defined” operator; this is useful if we want to continue checking after an #ifdef or #ifndef statement which was not satisfied.

Let’s say that we had a piece of source code which we needed to maintain on Windows, Mac OS X and Linux. Various bits of the source code might not apply to one or more of those operating systems. We could therefore hide the bits of source code that don’t apply to the current operating system using the following:

#ifdef _WIN32
#include 
#include 

#elif defined MACOSX
#include 
#include 

#elif defined LINUX
#include 
#include 

#endif

Note the use of #endif to close our set of conditional directives. This is part of the remainder of the conditional directives. #if checks if a C statement is true, and proceeds if it is, #elif is used to check another alternative if the preceding condition was not satisfied, #else is a universal alternative if none of the preceding conditions were satisfied, while #endif closes a block of conditional preprocessor statements. These operations work very similarly to the if…else if…else statements in C. The following example checks whether we are compiling for a 32-bit or 64-bit system:

#if !(defined __LP64__ || defined __LLP64__) || defined _WIN32 && \
    !defined _WIN64
/* we are compiling for a 32-bit system */
#else
/* we are compiling for a 64-bit system */
#endif

In this code, we’re looking for a definition of __LP64__ or __LLP64__, which define data models for 64-bit processors, to be false, or a definition of _WIN32, which defines a Windows software platform, to be true without a corresponding definition of _WIN64, which defines a 64-bit version of Windows, to be true. If this is true, the program is compiled for a 32-bit system, which will have different machine instructions to the 64-bit system.

While there are some other details of the preprocessor to discuss, they are best left to external reading. To conclude, there are a number of predefined macros in the C preprocessor, such as __LINE__, which calculates the line number, and __FILE__, which determines the filename. The C preprocessor can be somewhat obscure, but it gives the C language a great deal of flexibility – the sort of flexibility that sees its use on everything from microcontrollers to supercomputers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: