“This is the 11th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge.”

The vast sea of thousands of thousands, thank you for this second you see here. Hope my article is helpful to you!

Wish you in the future, keep love, go to the mountains and seas!

Senior stream I/O

Yesterday we learned about buffering streams in advanced streams, and in the larger system of IO streams, there are some advanced streams waiting for us to unlock.

So without further discussion, today we’re going to look at one of these advanced streams — transformational streams.

Transformation flows

Above, we know that when reading Chinese using a byte stream, there will be garbled characters. Why will there be garbled characters? What is the encoding format. We used the character stream to manipulate the text at the end, so can we convert between the two? Let’s look at the origin of this transformation flow in detail.

You can see that the conversion of bytes and characters is done by certain encoding and decoding operations. Why does garbled code appear? Let’s see.

Character encoding and decoding

All information stored in computers is represented by binary numbers, and the characters we see on the screen, such as numbers, English, punctuation marks, Chinese characters, etc., are the result of the conversion of binary numbers. To store characters in a computer according to certain rules is called encoding. Conversely, the binary number stored in the computer is parsed and displayed according to some rules, which is called decoding. For example, if stored in accordance with rule A and parsed in accordance with rule A, the correct text f symbol will be displayed. On the contrary, storage in accordance with rule A and then parsing in accordance with rule B will lead to garbled characters.

We can use the Java we learn to explain:

String(byte[] bytes, String charsetName): decodes a byte array using the specified character setbyte[] getBytes(String charsetName): Encodes a String into an array of bytes using the specified character setCopy the code

Easy to understand, I do not know if you have seen sun Honglei played the < lurk > this TV series, even if not seen, we also know the spy film. Suppose, you and I are spies, lurking in the enemy camp, and then want to communicate with each other. Would you just say something, or send a letter and say, meet me on the rooftop tonight? I promise, with a team like you, it’ll be Over in less than a day, and we’ll all be done. So we can’t do this, we have to use certain formatting rules to convert, right, so that even if the enemy did get hold of this letter, it would be confused, thinking this… What is? And then there’s nothing to do with it. If we don’t meet tonight, it’ll keep us alive, safe.

  • In this case, we have to have some rules that you and I can understand after the conversion of tabulated data, we call this rule character encoding: a set of natural language characters and binary numbers corresponding to the rules.

    This table can be the equivalent of the rules of conversion that we refer to, called the character set (encoding table) : the rules of correspondence between the living text and the computer binary.

  • When you write a letter, the process is to take something that you and I understand, according to the rules that we know, and turn it into something that nobody understands, and that process is coding.

  • When I get your letter, this process, I do not read this paragraph of text to parse the rules, this process is decoding.

  • Among them, if you drink drunk and write to me, not according to the set cards, young people do not speak martial arts, according to another rule to write, and then I according to our rules to parse, after parsing, do not understand on the face of a question mark, this can be called messy code.

So let’s look at the rules, the character set:

  • Character set Charset: also called code table. A collection of all characters supported by a system, including national characters, punctuation marks, graphic symbols, numbers, etc.

    To accurately store and recognize all kinds of character set symbols, the computer needs to carry out character encoding, a set of character set must have at least one set of character encoding. Common character sets include ASCII character set, GBK character set, Unicode character set and so on. Once we know the encoding format, the character set it corresponds to is specified, so the encoding is ultimately what we care about.

    The following character set is relatively complete I found on the Internet, if you want to know more, you can baidu. After all, we are oriented to Baidu programming. Hey hey.

    • The ASCII character set
      • ASCIIThe American Standard Code for Information Interchange is a system of computer coding based on the Latin alphabet for displaying modern English, It mainly includes control characters (return key, backspace, newline key, etc.) and displayable characters (English upper and lower case characters, Arabic numerals, and Western characters).
      • The basicASCIIA character set that uses 7 bits to represent one character, a total of 128 characters.ASCIIThe extended character set uses 8 bits to represent a character, a total of 256 characters, to facilitate the support of common European characters.
    • ISO – 8859-1 character set:
      • Latin code table, alias Latin-1, used to display the languages used in Europe, including Dutch, Danish, German, Italian, Spanish, etc.
      • ISO-5559-1Use single byte encoding, compatibleASCIIEncoding.
    • GBxxx character set:
      • GB stands for national standard, which is a set of characters designed to display Chinese characters.
      • GB2312: Code table in Simplified Chinese. A character less than 127 has the same meaning as before. However, when two characters greater than 127 are joined together, they represent a Single Kanji, which can be combined to contain more than 7,000 simplified Chinese characters. In addition, mathematical symbols, Roman and Greek letters, and Japanese kana names are woven into the combinationASCIIThe numbers, punctuation, and letters that were already there were all recoded two bytes long, and these were called “full corner” characters, while those below 127 were called “half corner” characters.
      • GBK: the most commonly used Chinese code table. Is in theGB2312Standard based on the extended specification, the use of double byte encoding scheme, a total of 21003 Chinese characters, fully compatibleGB2312Standard, and support traditional Chinese characters and Japanese and Korean Characters.
      • GB18030: The latest Chinese code table. Included 70,244 Chinese characters, using multi-byte encoding, each word can be made up of 1, 2 or 4 bytes. It supports the written characters of ethnic minorities in China, as well as traditional Chinese characters and Japanese and Korean characters.
    • The Unicode character set
      • UnicodeThe encoding system is designed to express any character in any language. It is an industry standard, also known as unicode or UNICODE.
      • It uses up to four bytes of numbers to represent each letter, symbol, or text. There are three coding schemes,UTF-8,UTF-16andUTF-32. Most commonly usedUTF-8Encoding.
      • UTF-8Code, can be used to representUnicodeAny character in the standard that is the preferred encoding for E-mail, web pages, and other applications that store or transmit text. The Internet Engineering Working Group (IETF) requires that all Internet protocols must be supportedUTF-8Encoding. So, when we develop Web applications, we use themUTF-8Encoding. It uses one to four bytes to encode each character. The encoding rules are:
        1. 128US-ASCIICharacter, only one byte encoding.
        2. Characters such as Latin, which require two bytes of encoding.
        3. Most common characters (including Chinese) are encoded in three bytes.
        4. Others are rarely usedUnicodeAuxiliary characters, using four-byte encoding.

The code problem

The reason why we read garbled files is because our editor IDEA defaults to utF-8 encoding, and if we don’t use UTF-8, we will read them wrong. Generally, IDEA creates files in UTF-8 format, and it doesn’t have any problems reading or writing them. If we are creating a file under Windows, the default is ASCII, which will follow the system’s default encoding format, which is actually GBK. So our file is GBK format, and read is UTF-8 format, natural garbled.

Code demo garbled:

public class ReaderTest {
    public static void main(String[] args) throws IOException {
        FileReader fr = new FileReader("E:\\demo\\China.txt");

        int ch;
        while((ch = fr.read()) ! = -1) {
            System.out.println((char) ch); } fr.close(); }} Program execution result: � see �Copy the code

Did you not see anything at all? You say you can see, I think you are good.

So how do we solve the garble problem, the coding problem? It’s time to sacrifice the flow of transformation. It makes you think garbled code isn’t a problem.

InputStreamReader

InputStreamReader: Inputs byte streams as character streams, and Bridges from byte streams to character streams. It reads bytes and decodes them into characters using the specified character set. Its character set can be specified by name, or it can be the default character set which is the character set of your editor.

1. Construction method

  • Public InputStreamReader(InputStream in) : creates a character stream that uses the default character set.

  • Public InputStreamReader(InputStream in, String charsetName) : Creates a stream of characters from a specified character set.

    Here is a demonstration:

    public class IpsrTest {
        public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
            InputStreamReader isr = new InputStreamReader(new FileInputStream("e:\\demo\\China.txt"));
    
            InputStreamReader isr2 = new InputStreamReader(new FileInputStream("e:\\demo\\China.txt"), "GBK"); }}Copy the code

2. Solve garbled characters

public class ReadTest2 {
    public static void main(String[] args) throws IOException {
        String fileName = "E:\\demo\\China.txt";

        // Create a conversion stream with the default character set
        InputStreamReader isr = new InputStreamReader(new FileInputStream(fileName));

        // Create a conversion stream, specifying the character set
        InputStreamReader isr2 = new InputStreamReader(new FileInputStream(fileName), "GBK");

        int ch;
        // The default character set is read
        while((ch = isr.read()) ! = -1) {
            System.out.print((char) ch);
        }

        isr.close();

        // Specifies the character set to be read
        while((ch = isr2.read()) ! = -1) {
            System.out.print((char) ch); } isr2.close(); � � China (see � ChinaCopy the code

Is not a good solution to the garbled problem, my mother is no longer worried that I do not understand the file. There is a read conversion stream, and of course there is a write conversion stream. Let’s take a look.

OutputStreamWriter

OutputStreamWriter: Input a byte stream as a character stream. It is a bridge from the character stream to the byte stream. Encodes characters into bytes using the specified character set. Its character set can be specified by name, or it can be the default character set which is the character set of your editor.

1. Construction method
  • Public OutputStreamWriter(OutputStream in) : Creates a character stream that uses the default character set.

  • Public OutputStreamWriter(OutputStream in, String charsetName) : Creates a character stream of the specified character set.

    Here is a demonstration:

    public class WriterTest {
        public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
            OutputStreamWriter osr = new OutputStreamWriter(new FileOutputStream("e:\\demo\\ChinaOut.txt"));
            OutputStreamWriter osr2 = new OutputStreamWriter(new FileOutputStream("e:\\demo\\ChinaOut.txt"),"GBK"); }}Copy the code
2. Write the data in the specified code
public class WriterTest2 {
    public static void main(String[] args) throws IOException {
        // Define the file path
        String fileName = "E:\\demo\\ChinaOut.txt";
        // Create stream object, default UTF8 encoding
        OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(fileName));
        // Write the data
        osw.write("North Star");
        osw.close();


        String fileName2 = "E:\\demo\\ChinaOut2.txt";
        // Create the stream object, specifying the GBK encoding
        OutputStreamWriter osw2 = new OutputStreamWriter(new FileOutputStream(fileName2),"GBK");
        // Write the data
        osw2.write("I'm called."); osw2.close(); }} The program executes the result: Polaris calls meCopy the code

Note that if you store a utF-8 character set, the notepad format is changed to UTF-8 encoding, and the GBK character set is specified, the notepad format is ASCII encoding.

conclusion

I believe that you have a certain understanding of IO flow in the advanced flow conversion flow class, looking forward to the next chapter of advanced flow – print flow teaching!

Of course, there are many streams waiting to watch together next time! Welcome to the next chapter!

So far, the world is closed for today, good night! Although this article is over, BUT I still, never finished. I will try to keep writing articles. The coming days are long, and the horse is slow!

Thank you for seeing this! May you be young and have no regrets!

Note: If there are any mistakes and suggestions, please leave a message! If this article is also helpful to you, I hope you give a lovely and kind attention, thank you very much!