EasyC++06, char type with IO acceleration

Hello, I’m Liang Tang.

Today is EasyC++ series 6, char types and I/o acceleration.

Click to jump to github repository, welcome star, welcome pr~

The type char

Char is the full name of character, which means a character. As the name suggests, the char type is specifically designed to store characters.

It is very convenient for computers to store numbers; you only need to convert them to binary. Storing characters is a bit trickier, usually by digitally encoding them. This is why char is essentially another integer, because it stores the numeric encoding of a character.

Char has eight binary bits, or one byte, and is theoretically capable of storing 256 characters. Enough to cover almost all the letters, punctuation marks, and numbers in a computer, known as ASCII.

ASCII is the American Standard Code for Information Interchange. It is a computer coding system that contains all the English letters as well as punctuation marks and some special characters. There are 128 characters in the entire table, just enough to be stored in a char.

If you look at the table below, Dec stands for numbers and Char stands for characters.

The number 0 is numbered 48, the letter A is numbered 97, and the letter A is numbered 65.

When we assign a character to a char variable, it looks up the ASCII table to find the character number. Similarly, when we print a character using %c, it also looks for the symbol that corresponds to the encoding stored in char.

Since characters are stored as numbers in C++, we can add and subtract them.

Such as:

char c = 'a';
cout << ++c << endl;
Copy the code

So it’s going to be b, plus and minus, and we can subtract from it.

char c = 'b';
cout << --b << endl;
Copy the code

So this is going to be ‘A’.

Alternatively, we can subtract two char variables. A common example is to convert a character number to an int.

char c = '1';
int num = c - '0';
Copy the code

So we get num of numeric type 1.

For example, we can also determine the range of the char type by the greater than or less symbol:

char c = '1';
if (c >= '0' && c <= '9') {
    cout << "c is a number" << endl;
}
Copy the code

Getchar, putchar, cin.get, cout.put

Getchar and putchar are C specific character IO functions that read and output characters.

Getchar and putchar are more efficient than scanf and printf because it is determined that the data type being processed is characters and no additional formatting is required.

So in the algorithm-racing world, getChar is used instead of scanf to read data in a rush to improve performance.

I’ll post a code that uses getchar to read an int as a reference. This is a standard bizarre technique and is not recommended.

void read(int &x) {
    int f = 1; x = 0; char s = getchar(a);while (s < '0' || s > '9') {
        if (s == The '-') {
            f = - 1;
            s = getchar();
        }
    }
    while (s >= '0' && s <= '9') {
        x = x * 10 + s - '0';
        s = getchar(a); } x *= f; }Copy the code

Cin. get and cout.put are similar to getchar and putchar, but are C++ features. We can refer to the following example, but more verbose.

char c;
cin.get(c);
cout.put(c);
Copy the code

I/O Chinese

I hesitated for a long time about whether to add this paragraph, because I really have no relevant experience, after all, I only brushed the question before. After a long struggle, I decided to write it, because this question should be very important for many students, especially those who want to do C++ projects. My level is limited, reluctantly sorted out the information of all parties, if there is any mistake, welcome to point out ~

In fact, it is possible to output Chinese directly in C++, which will not be any problem.

For example, the following code should work perfectly:

string str;
cin >> str;
cout << str << endl;
cout << str.length() << endl;
Copy the code

Just why is the length of the final output 6? Because I’m running this code on a Mac. The Mac uses UTF-8 encoding by default. The length of a Chinese character is 3 bytes. The length of a string in C++ is calculated by the number of bytes, so the length of two characters is 6.

If we write Chinese in the source code, for example:

string str = "Chinese";
cout << str << endl;
Copy the code

This can cause some problems. The most common problem is that the default encoding of the code storage environment and the runtime environment is different. For example, the default encoding of the IDE is UTF-8, but the default encoding of the terminal is GBK (common on Windows). This will cause the output to be garbled.

The solution is to use wchar_t, the wide version of char, which takes up two bytes. Can be used to store unicode encoded characters:

const wchar_t* str = L "Chinese";
Copy the code

We put the L modifier before the Chinese, which tells the compiler that this is a wide character, and we need the compiler to translate it for the locale.

A locale is defined based on the language used by a computer user, the country or region where the computer is located, and cultural traditions. You can think of a locale as a set of environment variables. The locale environment variable value is in the format of language_area.charset. Languag stands for language, such as English or Chinese; Area indicates the area where the language is spoken, such as the United States or mainland China. Charset indicates the character set encoding, such as UTF-8 or GBK.

These environment variables affect date formatting, number formatting, currency formatting, character processing, and more. In Linux, run the Terminal command and run the locale command to view the locale used by the current system.

The locale results have 12 categories, and I found the table on the Internet:

LANG refers to an unset default value, and most programs use the LANGUAGE specified as the interface LANGUAGE. LC_ALL sets all content at the same time and has a higher priority than each content set individually, while LANG has the lowest priority.

Cin and cout can be considered streams for char, so they are not suitable for wchar_t processing. Instead we should use wCIN and wcout. Wcout uses C local by default, so we need to set wcout local first. Set it to be consistent with the local of the runtime environment.

There are about the following Settings:

#include <codecvt>
const wchar_t* str = L "Chinese";

// Use the default local
locale loc("");
wcout.imbue(loc);

// The result of using the local command is displayed
locale loc("en_US.UTF-8");
wcout.imbue(loc);

// Use the standard facet
locale utf8(locale(), new codecvt_utf8_utf16<wchar_t>);
wcout.imbue(utf8);

// Use system local
locale sys_loc("");
wcout.imbue(sys_loc);

wcout << str << endl;
cout << wcslen(str) << endl;
Copy the code

We can use wcslen to calculate the length of a wide-byte string, which outputs 2 instead of 6.

Coding Settings in C++ are a big problem because they are rarely encountered in the brush, and we will only scratch the surface here. If you need to, you can do your own research.

References:

Internationalization of C language

C++ Primer (6th edition)

EasyC++06, char type with IO acceleration

The type char

Getchar, putchar, cin.get, cout.put

I/O Chinese

Related Posts

Dubbo combined with SpringBoot

Python uses decorators to get function arguments and function return values

Tencent officially open source high-performance ultra-lightweight PHP framework Biny