Fundamentals of String Manipulation in C: Part 2

Having discussed most of the important functions relating to strings before, there are only a few others of particular note. We saw gets() previously, which acts like scanf(“%s”, string), and while it performs the job it’s asked to do, it is regarded as somewhat dangerous as it is prone to buffer overflow. A function does exist in the C standard library in <stdio.h> which is somewhat safer.

fgets() is an equivalent function to gets() designed to work on file input. Unlike gets(), you can specify a maximum number of characters to be taken in, which mitigates the buffer overflow that can occur with gets(). fgets() is called with three arguments: the string where input will be stored; the maximum number of characters, including the null character and either stdin, the reference of the standard input stream, or a pointer to a variable of the type FILE. File pointers will be discussed later; for now, we are only interested in stdin.

An example of the use of fgets() is illustrated below:

#include <stdio.h>

int main(void)
{
    char string[30];
    printf("Please enter a string: ");
    fgets(string, 30, stdin);
    puts(string);
    return 0;
}

One peculiar difference between fgets() and gets() is that fgets() does not remove newline characters from its input, while gets() does. This is to be noted when trying to concatenate two strings with strcat() which have been entered from the standard input stream with fgets().

Previously, we also saw the strcmp() function for comparing two strings. A similar function, strstr() (for string string) can be used to find an instance of a string of smaller or equal size within another string. It takes two arguments, both of them strings, and returns a pointer to the first instance of the string being searched for in the string being searched. The following program demonstrates strstr() in action.

#include <stdio.h>
#include <string.h>

int main(void)
{
    char string[] = "yellow dinosaurs eat snow reluctantly";
    char *p;
    int index;

    /* Looking for the location of "eat" in string[] */
    p = strstr(string, "eat");
    /* Finding the element within the array where "eat" begins */
    index = p - string;
    printf("The string \"eat\" begins at index %d of string[]\n");

  return 0;
}

This returns the following:

The string "eat" begins at index 16 of string[]

Searching for multiple instances of strings using strstr() requires a slight modification of our program. We can do this by creating an index, or a point in the array where the last instance of the string was found, and call the strstr() function from the next contiguous point in memory (i.e. the pointer string + index + 1). This example will benefit from the following illustration:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char long_string[] = "The C programming language was invented in the \n"
    "early 1970s by the computer scientist, Dennis Ritchie, who was then \n"
    "working at Bell Labs in New Jersey, which had just removed its \n"
    "support from the Multics project.\n";
    int index = -1, count = 0, i;
    char *p;

    printf("%s", long_string);

    for (i = 0; i < strlen(long_string); i += index) {
	p = strstr(long_string + index + 1, "in");
	if ((index = p - long_string) >= strlen(long_string) || p == NULL)
	break;
	++count;
    }
    printf("\nThe string \"in\" has been located in the string %d times\n", count);
    return 0;
}

This example returns the following:

The C programming language was invented in the 
early 1970s by the computer scientist, Dennis Ritchie, who was then 
working at Bell Labs in New Jersey, which had just removed its 
support from the Multics project.

The string "in" has been located in the string 5 times

One thing to notice about this program is that the counter variable is not incremented by 1 on every loop, but instead by the value of index; this ensures that the loop continues only as long as there are still instances of the string being searched for to be counted.

In the last tutorial, We demonstrated atoi(), a function which converts a string consisting of numeral digits into a decimal integer. It was mentioned at the time that other functions of the same type exist. atof(), for instance, converts a string consisting of numeral digits, exponents and at most one radix point into a double; atol() operates like atoi(), except that it returns a long int. On most modern compilers, atol() works exactly like atoi(), but on older compilers with 2-byte ints, the two functions work differently.

The implementation of simple versions of these two functions is discussed in The C Programming Language (Kernighan & Ritchie, 2nd Edition, 1988). Other functions of this type with more flexibility exist, like strtod(), strtol() and strtoul(), the operations of which can be found in any good C reference material.

Advertisements

Fundamentals of String Manipulation in C: Part 1

We’ve already seen arrays, which are chains of contiguous pointers to variables, and we’ve seen how we use them. There are arrays available of all data types, including integers of all lengths, floats, doubles and chars. A char array could be used to hold a series of characters corresponding to a word or sentence, as such:

#include <stdio.h>

int main(void)
{
    char word[] = {'f', 'o', 'o', 'b', 'a', 'r'};
    int i;
    for (i = 0; i < 6; i++)
	printf("%c", word[i]);
    printf("\n");
    return 0; /* Recall that main() returns this as an error code */
}

The array is defined with a series of characters, six in total, and the for loop goes through each character in sequence and prints it to the screen. This is, however, a cumbersome procedure and requires knowledge of the length of each individual character array. Because the manipulation of text in C is so common, there is a simpler, less cumbersome way of declaring and using arrays of characters, as illustrated below:

#include <stdio.h>

int main(void)
{
    /* Note that there are no curly brackets, and quotation marks are
    * used. */
    char word[] = "foobar";
    int i;
    for (i = 0; i < 6; i++)
	printf("%c", word[i]);
    printf("\n");
    return 0;
}

The format of the character array in this example is known as a string. Indeed, we’ve seen them before many times without explicitly referring to them as such; the printf() function takes arguments which are strings. Indeed, we can change the printf() arguments in this example to make it easier and less cumbersome:

#include <stdio.h>

int main(void)
{
    char word[] = "foobar";
    printf("%s\n", word);
    return 0;
}

This is much simpler to read, less dependent on knowing the length of the string and less error-prone. The question is, however, “How does printf() know when to stop printing?” The answer is that the word[] string isn’t just declared with the six characters in “foobar”, but also with a so-called null character, which is denoted using “”. Internally, printf() goes through the characters in the string until it reaches the null character, whereby the function terminates and returns.

There is another function in <stdio.h> which acts like printf() with a %s format specifier, and which is easier to remember and use. puts() takes a single argument, a character array, and prints it to the screen with a newline character at the end. This is illustrated below:

#include <stdio.h>

int main(void)
{
    char word[] = "foobar";
    puts(foobar);
    return 0;
}

As you can see, this is a cleaner way of printing strings to the screen than printf(), albeit without the flexibility. String printing is commonplace in text-based systems with C programming, and so puts() is often useful.

Now that we have a string, we might want to add it to another string. This can be done by declaring a larger character array which is large enough to accommodate all of the characters of both strings, and then copying the contents of both strings together. This can be illustrated with the following example:

#include <stdio.h>

int main(void)
{
    char word_1[] = "foobar";
    char word_2[] = "baz";
    char combined[10];
    int i = 0; j = 0;

    /* Continue going through the string until '' is reached */
    while (word_1[i] != '')
    {
	combined[i] = word_1[i];
	i++; /* Incrementing i goes to the next slot in memory for
	* both arrays */
    }
    /* i is now equal to the first empty element in combined */

    /* Starting again with a new counter variable */
    while (word_2[i] != '')
    {
	combined[i+j] = word_2[j];
	j++;
    }
    combined[i+j] = ''; 
    printf("%s\n", combined);
    return 0;
}

This will print out a string with the contents “foobarbaz”. The problem here, though, is that as above, with the definition of the character array using individual characters, is that this is complicated, cumbersome and error-prone. For this reason, several functions have been defined in the C standard library under the header file, <string.h>, which more cleanly deal with the manipulation of strings. The following program demonstrates some of them:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char word_1[] = "foobar";
    char word_2[] = "baz";
    char combined[10];

    strcpy(combined, word_1);
    strcat(combined, word_2);
    puts(combined);
    return 0;
}

As you can see, this is a far cleaner way of defining these operations, without any need for explicit counters or anything else which would make the program more error-prone. Both strcpy() and strcat()take their arguments in the following form:
strcpy(destination, source);
strcat(destination, source);

strcpy() (standing for string copy) is used to copy the contents of one string to another, including the ” null character. strcat() (standing for string concatenate) is used to append the contents of one string onto the end of another; the word “concatenate” is merely a fancy way of saying “link

together”, in this case referring to joining two pieces of text together.

There are a number of other useful functions defined in <string.h> which are of note. strcmp() takes two strings and compares the characters in them. If the characters are exactly the same, the function returns 0; otherwise, it returns a positive or negative value depending on the difference in ASCII values between the first character that disagrees in both strings. strlen() returns the length of a string. Both functions are illustrated below:

#include <stdio.h>
#include <string.h>

int main(void)
{
    char word_1[]="hello";
    char word_2[]="world";
    char word_3[]="concatenate";

    /* Note the different values returned each time. */
    printf("The difference between word_1 and word_2 is %d\n",
    strcmp(word_1, word_2));

    printf("The difference between word_2 and word_1 is %d\n",
    strcmp(word_2, word_1));

    printf("The length of word_3 is %d\n", strlen(word_3));

    return 0;
}

This program returns the following results:

The difference between word_1 and word_2 is -15
The difference between word_2 and word_1 is 15
The length of word_3 is 11

This is consistent with the difference between ‘h’ and ‘w’ in the ASCII table; this is also consistent with the length of the word “concatenate”. These functions can therefore be used usefully when comparing text, which is a common task in the C programming language.

Using scanf(), we can write our own text into a character array as a string; however, this does not have any security against the characters overflowing the string and causing a segmentation fault. The use of scanf() to write into a string is illustrated below:

#include <stdio.h>

int main(void)
{
    char string[21]; /* 20 characters plus the null character. */

    /* Note that we don't use the & dereference operator here. */
    scanf("%s", string);
    printf("%s", string);
    return 0;
}

This will work successfully unless the characters entered are greater than 20; this will cause a segmentation fault and cause the program to halt unexpectedly. As with printf() and puts(), there is a function defined for entering characters specifically into a string called gets(). gets() is, however, considered dangerous and replacements are available in many extended C standard libraries; however, gets() is defined in ISO C90 and C99 and is available on all implementations of Standard C. It is illustrated below:

#include <stdio.h>

int main(void)
{
    char string[21];
    gets(string);
    puts(string);
    return 0;
}

Another thing to note is that strings, unlike character arrays defined with discrete characters, are not modifiable. Assignment statements cannot be used on a string or its elements. A function must therefore be used on the string in order to modify it.

Finally, there are a number of functions which are defined in <stdlib.h> which can be used to convert a string of characters with numerical values (i.e. ‘0’ to ‘9’) into integers, floats or doubles. atoi(), for example, (standing for ASCII to integer) converts a string consisting of such values, e.g. “1234” into an integer value. Other similar functions are also contained in <stdlib.h> and would be described in a reference of the C standard library (such as Appendix B of The C Programming Language (Kernighan & Ritchie, 2nd Edition, 1988)).