0x05 - String Theory
Text in computers
I hope I won’t scare you when I tell you that for computers, text is actually just a bunch of numbers. Look at an arbitrary sentence, it’s just comprised of several letters, spaces and some punctuation. But computers only understand numbers, so to work with text, every character has a corresponding number assigned to it. The value of the number depends on the encoding used.
Encoding
The easiest and also one of the oldest encodings is ASCII which only defines 128 characters, ranging from number 0 up to 127. For example, the letter A
character is represented by the number 65
and the digit 0
has the decimal value of 48
in ASCII. From the previous chapters, we should know that C++ has a special type char
for storing characters. This type can only store values from -128 to 127, which perfectly meets our requirements for storing ASCII characters. If we wanted, we could even store our characters in an int
type, but int
can store numbers ranging from around -2 billion to little over 2 billion, which is quite an overkill for storing numbers from 0 to 127. So while char
is also an integer type, it’s mostly used for storing characters.
The following table shows some of the characters and their corresponding integer values in ASCII encoding:
Character | Value in ASCII |
---|---|
'A' |
65 |
'C' |
67 |
'Z' |
90 |
' ' (space) |
32 |
'0' |
48 |
'1' |
49 |
'9' |
57 |
'\n' (new line) |
10 |
'a' |
97 |
'c' |
99 |
'z' |
122 |
'\0' (end of string) |
0 |
So, how would we create a variable holding the letter A
? We could assign the corresponding ASCII value like this: char c = 65;
which would work, but it’s by far not the best way to do it. Let’s look for an easier way to initialize our characters.
Literals
In C++, we have something called literals. Literals are constants that the program may not change the value of. For example, we have:
- Integer literals:
12
-int
12U
-unsigned int
- Floating point literals:
3.14f
-float
3.141592654
-double
- Character literals:
'a'
-char
L'a'
-wchar_t
- String literals:
"Hello, World"
-char[]
L"Hello, World"
-wchar_t[]
From this list, we can see that to create a constant of type char
, we can use apostrophes (''
). So, a better way to create a variable initialized with A
is simply: char c = 'A';
#include <cstdio>
int main() {
char c = 'A';
std::printf("Printing a single character: %c\n", c);
return 0;
}
This code prints: Printing a single character: A
.
Words
Now we know how to store a single character, but how would we go about storing a word? Or even multiple words? A word or a sentence is just a bunch characters stacked together. Consider the word Xertz
which we can break down into individual characters like this:
'X'
'e'
'r'
't'
'z'
.
In the previous chapter, we’ve learned that arrays are used to store multiple elements next to each other. And because we’re dealing with char
acters, let’s create an array of characters:
char charArray[5];
charArray[0] = 'X';
charArray[1] = 'e';
charArray[2] = 'r';
charArray[3] = 't';
charArray[4] = 'z';
In the previous chapter, I’ve already explained that this type of initialization isn’t the best. Let’s try the aggregate initialization:
char charArray[5] = { 'X', 'e', 'r', 't', 'z' };
That’s a bit better, but still not the best we can do. This is where literals come into play again. Quoted text ("text"
) gives us the type char[]
which is exactly what we need now:
const char charArray[6] = "Xertz";
Note, this time, even though our word only has 5 characters, the length of the array is not 5, but 6. To understand why, I also have to explain to you one important peculiarity about text (also known as string in computer science) in computers. As I’ve already explained, string is just a bunch of numbers stacked together. The problem is, the computer somehow needs to know the length of the string to know where the string ends. For example, when printing a string, the printing function prints one character at a time until it reaches a special character that tells it to stop. But what is this special character? It’s a special escape character '\0'
. This character has the value 0
and denotes the end of string. When you create a "string literal"
, this special '\0'
character is automatically added at the end of the string for us. So in the example above, we actually need 6 characters to store the word Xertz
, even though the word only has 5 characters. If we wanted, we could also initialize the array by hand, which would look like this:
char charArray[6];
charArray[0] = 'X';
charArray[1] = 'e';
charArray[2] = 'r';
charArray[3] = 't';
charArray[4] = 'z';
charArray[5] = '\0';
But we don’t have to bother because the C++ compiler can do it for us and all we have to do is to put the text in quotation marks. In the previous chapter, I also told you that the compiler can calculate the length of the array for us if we initialize it during declaration. It can also do it with string literals:
const char charArray[] = "Xertz";
Let’s see a practical example:
#include <cstdio>
int main() {
const char name[] = "Marek";
std::printf("My name is %s.\nAnd I really like C++! :-)\n", name);
return 0;
}
Which prints:
My name is Marek.
And I really like C++! :-)
Formatting text
Looking at all the code examples, you should be asking: what is up with all these %d
%c
and %s
in the std::printf
function calls? The std::printf()
function has a formatting feature for inserting variables into our text. That means you can use certain specifiers as placeholders to say that you want to print a variable of a certain type. You can see all these specifiers in this table, but I will only go through the most important ones:
%d
is used to printint
types%f
is used to printfloat
types%c
is used to printchar
types%s
is used to printchar
array types (strings)
The placeholders are replaced in the same order as the variables are passed to the function as additional arguments, separated by comma (,
). For example:
int x = 17;
int y = 10;
std::printf("The result of %d - %d is %d\n", x, y, x - y);
Will print:
The result of 17 - 10 is 7
Exercise
With use of variables, try making a program that prints information about yourself on multiple lines. For example, your name, age, address and occupation/hobbies.