Hello, everyone, my name is 711. I have been working in the crawler industry for 4 years, and I have written numerous crawlers, both large and small. I have work needs, and I also have my own interests and hobbies, as well as some small projects to earn extra money. All of these are beside the point. Today, I will share with you my first crawler article in zhihu: “the invoice inspection platform of state administration of taxation with high difficulty of crawler collection” written in c#. Of course, my most common language is Python, and I will share some crawlers written by Python in the future. Please pay attention.
The following deals with basic languages and tools:
javascript
c#
fiddler
chrome
Basic knowledge of encryption and decryption
First step, we visit the home page of the state administration of taxation. This step is to get cookies and save them in c# CookieContainer object for reuse later.
First, the page looks like this:
Note: the website certificate of the State Administration of Taxation is signed by itself, and the root certificate needs to be installed for the first inspection. It seems that they do not trust CA, when the CA signature certificate, or the Taxation bureau even certificate are reluctant to buy.
Ok, download and install the root certificate.
The code is a bit messy, the editor copy is not format, only the map, please forgive me.
The second step is to get the image verification code. Note that an FPDM (or invoice code) parameter is passed to get the image verification code
Ok, we got the image verification code. Take any one of these. It looks like this. There are several types of captcha:
The request verification code interface returns json as follows:
jQuery1102035659710036181713_1508466648376({“key1″:”iVBORw0KGgoAAAANSUhEUgAAAFoAAAAjCAIAAACb54pcAAAJnklEQVR42tVZCVRTZxZO iUUQtchSUBYFQaqgiKisAgKyDDUYwQKiLMNWZJNVZREEqhYHEUotCC61tiNTW7FTrdPpiK0OYyt6KKd6HMQRdRxLRVQSQoCA8738GB8JCYSlIueenJuX/713 73e/+93/PRgP7zx4hczE/rcRnKWW+49hrmRMkDzbcw6O3vLPlIzyCmMGh3UfewRnRUUHjgkQss07l0X/mlxjO1o4/ns07OUyopDbIeNXVZW5YwIc4xXti7Ey j2BDfa/vB4fDRk9rQgGhbR40ymzX8G6OkB0Zqf8T+W3f1jSUlEjms9Y0LdD4yjgBERVuN+K0bRSPDnNly56P5GsWbk6eYJ5Jn5JSR2yCVClpbD5cfjAtMSUs KHRnTt7Z6tNDAnFpU27l2ogD7PDP/GOvJ+4ZNJqOP27jvps9nKySb68aj94Rh+Np+cFO/0A4HclpAIXPWvPw9n3JJCtKy9axfZZqO+0rKAQi7q5uC03NTn/x lTRGlLJCXmcyVxkvXD1/yTJdQ6aCQqK9p2Q0An3jXlWNZ5Mm9RiZ8Z28eYEJ7dsrZET/77mF4wLHB4c24/PxV99Q+T/PB7g8YzAef/OdJBwfHzgMaiTFbX43 PLL23AUc+eVy/Spn1z8fOSYCoiT/B+K0ba+YOU21LjYf/pmQNHDkQfp+TxPzHOUIeijczQWd7HA4nJQinl9M93JngY7BMyaTq2f0hZXLicAE2JcbEptSirAm Szl7ZAnXG9RK+8kn1YGCY/dxE0ovav7J9/RqbWwmPkWQhCTA8fTQUUk46v91hclkaqhrpKdsNTQwrLtwCQdbbt9fMcfkSXal2G2qNyYl2HnAac0qV35d0est CxCkOa3YRt94QKeEpAkM5nc5sjpCt7w4vr3C+LXXWCrT3Q0XuM9bZKVnBJatW2jVJpM10uxHZ52rRbOHZsejq790OTq1NtygtCNtm8BwLlXk4g8BR3vJfrKG VWZBR2RX7ntgBOgAUC4KGx41V58y9Sw9GaEd9o3a7eEPB0ipKk/5T+o+oBZt7YrGGbAyu7J3hibu2PP8eJeNG98zIEBZpXuhVTM7LMDcFpwqWxMGNDNWrpGd +W+ZZRCpNIe3I5c7y9kst+51Obs+qr0M/8mnVQgI9rDp7tMDh+HwomOxgM/26QwKbb3WSEckPjqWwWBEWbmQa328Llpr6huSdfs6OBUJED9lhddOd78drr5Y CUcyIJ5vVPciG6p3YvN7Z+ojgCAms09JuXp9/NTJSmgWkKs+fpe+qobYiSgDIJ7/5qzAxXagnq/ZcsQD6HGX/d6hcsDB2VMERlAcudLQq64hMDDsSNn6uPo0 gQZAcAoKO9dvaC8tw/FHl+tFSgmBRGTkQmgEKOVf1sfTr35/W+n5iMw6UxaiPOYXQxhkPlP/VFAyOmjQgLrsPPhu78Dh/yGwy9qV5x8bqGfUEZwKfqHRcBwc ActQBuCeX2xDzjrkE7l41uwsZ7aljkHwEgewb6sj67uwdPwE5s6ZoQmyyCGl3MwcKEXPYotuewe0zJOqL5E8CAI4ulzdeCHUDh0HIWyda1/oH0SxcVMLnFmz lxhraCNJ+qUBDYiDWiEU/IR6ElnJd3sHNKb3SDtNbtApEZHpxOk2px4u4mzd8ck2XYp5hIRvJP3pXHgG4GjJ+IjejwAajt8im+MBcbgvjsBIh06eNKl4ddBw 4Si+ZYpPCoXjJ/oboekuAILTp6jYq6bWJlTK7mUr+yYrtWeVp9Vr4kzHN42QJIqAOEBOdCnK8kLA43fZzZ5HRomdFUUZcAdUh4MBQQLFBIE69Gpo92rrkbPO 7zuFvrBmfA6ABHpGAuEVQi0dSSfiLHAQfoyNGwSVrtm48g+R20kzuhiZQXRRBlRrvRq1AZ+rrhVgbjuqZxYoBQWBrT3f1ZcKPXkvqNElLBTdUDoyNWCQfZSO +GgNxLTNyRs+iQ9iURu9A18LPAOI3PaYLuWk74dGPFNQIHfhBcT1mJhTUyY4VaBr2GO2nHQH/Y53t3wwRXEywUhkV+N2AmXiADWsof8KAfYUXnbkcPCFs6AT HBMWAW2MoLE1oJ8MeYcoQOrJVzCFlJ3IJzoWgwbyBlaj89HeJBkcJ4XttnTkCLennd6huDiY0jdNlUzZLisXbDr4Lj50OBoSduMTioDGhCrRI4GOiG4Nw64X bBLt9NBoKIZsIAa8/jlbXnrC+n36VpIvLGz/jiAik5p/pkvpJUKRseOmj1VUrJKmLBAIjFXon4nmTGCBhgJlAB9Q679seMbffHX692BxO6FKHULCowB9KtP7 pkwlX0VweC+wJLcrensjPXliZJbDoKaM53+kudA40B04Fyy2sH8tPWOXLYaC+NswSbSIvFOWWYb2BhzcTbmiX8OXrWxM3it2CjpCbLLgOQWCjz4XxQctHDBH VnhhT8EZeClQErcTzOrfL220sC8PqCEOOI/ZCR+VJ0okMhEXIGEoDECEoEK5yATEremEkvVycFDycKN39LxlwY3Jw+RDcAha9BOojuFPX3xtcwGCAIclt6SJ dZsQE/QF4RLtEMfdwx9EgGp2W9h3skK6lzigcfqmz+AKWwOGfRQZIoASQxrdR2LApvYarXkLvTYQB8tAW/ACawibUCR0KBl8Q78rldZOnNR92BEhOMTKET4p 0FUDwgkSomLYXKL+YEHD8wTktuxKaCeFhaUjDA5HODWJYRMFJYKDJNFouClGNe6l+4Yauka0TLT7BBcIHISeKIOpli6eJD6pPCLrFUQKbwg4ZBv2UZAr3Bif P8Xkib+8iGaJHdH6+q7kRQKi+rewVj8dkHYj6A6ZGpAeYAEf0qipMh1wiAYZDLtPiAu6CfIkakmAAiDQKXjCOnPyr0O/Sa9aeZKc+fcw6g3C1uojYtFcCrF9 ue/vUGFMVmCB+iNhQhM8xdGnqR+b6iBHg/nYyMMw0UQZVvnXAIgHt+4N9x8LYxj6jxbHxgMRbDTRknjSET0TSLMPOT9LSzXs83O/NxxjaLu/d5L3FBl5hiS2 vQR2jMb01zoMf7GX0fVhAiGXjRkc9+qe0L+qBev/Dgj2v4viVEgmVt3oI3lQILiIT7uf33v1mmXEfTEh2CHNTqb3SR78lXNH2nrdlkEOfhZ9QuTnJBUPMzfP h5rjCwfvxsUJyIievTeHn/CdJvZLY0dbbsVEbo1B4Dg1rXn8sJjGV27NKx9ypeqc1WMOxPkS7si142aT+nggsqBF52KZfP8QUbhdKy8QO64flTdtm6pvpcKR pd8+TgRZ12BVfjJswraGpP0fB8ROfuUjMiwAAAAASUVORK5CYII=”,”key2″:”2017-10-20 10:38:54″,”key3″:”cab180f9f851b8e7802dc1e5cf275413″,”key4″:”01″})
By analyzing the page js, we found that different key4 values correspond to different types of captcha, so we can use c# to process the image and only display the corresponding color of the character captcha.
The key1 value is the base64 encoding of the captcha image, which is decoded as:
The problem of obtaining the verification code is solved, and all that remains is to fill in the remaining parameters of the form (Figure 1), using the values in the actual invoice.
The third step is to submit the form. As a reminder, the URL for submitting the form varies from province to province. Find the key. This step is not difficult, just follow the form to enter.
The form looks like this:
It’s been a long road, and we finally see a milestone.
But here’s the thing: if we look at the data that’s returned, this is what’s really troubling about the crawler:
Let me give you an example
The original page looks like this:
So when you come back
{
“key1”: “001”,
“key2″: N▽ 8th floor, 24 Qilin Middle Street, Shatai Road, Baiyun District, Guangzhou 87095513, 10, 20170820, 91220101556397833M, Guangzhou Bank Sha Tai South Branch 800205874708016, 91340421MA2NE7XFXF, North of Nongshui Road, Chengguan Town, Fengtai County, Huainan City 17729909400, 79799616160577875799, Agricultural Bank 6228482019200386470, 1422.64, 1271.67, 661526257466, Fengtai Guangyi Drugstore, Guangdong Kangaido Drugstore Chain Co., LTD. -316.03”,
“key3”: “Shu – wei twenty – bismuth citrate acid mind rafter and acid – ray mind, / for/butyl rubber/sac 0 /. / / * 2 / g / 1/4 / grain of mind / / box and █ 0.2 g * 14 / box █ box █ 17 █ █ █ 2 █ 4.01 23.58 11.79 del Leon – hin mind flat – sulfur and acid – sand – butyl – amine/alcohol/smoking/into / / agent/gas/fog – 1-0-0 – mu/g / * – 2-0-0 / press / █ 100 mu g * 200 press █ box █ 17 █ █ █ 10 █ del 20.84 to 122.58-12.258 ridge – even mind flowers and clear mind blast/rubber/sac / 0 /. / / * – 2-3-5 – g – 0.35 g * 24 █ 4 – grain █ box █ 17 █ █ █ 10 █ del 15.79 times 92.88-9.288 – resistant force sixty male/with/injection/agent/(/ new/bales / / / – 1-5 of 15 ml/m/l / █ █ box █ 15 █ 0 0 █ █ 7.30066667 109.51 █ del Dan mind mei twenty left and acetylene mind’s pregnant – ketone/im/solution / / 5 / m / 1 /. / / * / g / 1-1.5 mg / █ * 1 pills █ box █ 0 █ █ 14.8232 741.16 █ del – born – yuan (us $50 █ 0 -s/raw/yuan (us $/ / raw/bacteria/impact/agent/(/ son/child / -) – 5 / g / 1 /. / / * – 2-6 – bags – █ 1.5 g * 26 bag █ box █ 17 █ █ █ 1 █ 20.81 122.39 122.39 del – resistant’s sixty male/with/injection/agent/(/ new/bales / / / Mind 1 and 0 / m/l / █ 10 ml █ box █ 17 █ █ 28.506 142.53 █ 5 █ del 9-9-9-24.23 – the feeling of taking mind spirit – grain – 1-0 – g / * / 9 / bag / █ 10 g * 9 bags █ box █ 17 █ █ █ 50 █ 65.29 384.04 7.6808 “,
“key4″: ” “,
“Key5” : “var FPXX FPHM = FPDM + ‘≡’ + + ‘≡’ + + ‘≡’ SWJGMC + jsonData. ‘≡’ key2 + + yzmSj”,
“key6”: “var result={\”template\”:0,\”fplx\”:fplx,\”fpxx\”:fpxx,\”hwxx\”:hwxx,\”jmbz\”:jmbz,\”sort\”:jmsort}”,
“key7”: “fa367b974b0b9f01f4dbe9f48283ef7e”,
“key8”: “068c64196a8ae94ed7f7c9953254a7ac”,
“key9”: “31891b7b47c110f7f582b0864b55634c”,
“key10”: “WPX4IOUjwAHDBWJT9vaWaWR+A1LfXTDdCBsZ0FBUKv50T6s4XcNHy3uDuNARnktp”,
“key11”: “dc1de”,
“key12”: “”
}
What??? What the hell is this? Garbled or mixed with the correct data, even mixed with some JS code…
Take your time, calm down, analyze and deal with it calmly. Next, we will start the fourth step, restoring the confused and encrypted data
Step 4: Restore the obfuscated and encrypted data
Through the analysis of webpage JS we found
The KEY3 data in JSON is split and scrambled in a specific way at a time.
In addition, the sorting rules of scrambled data are encrypted through AES, that is to say, we first have to decrypt the sorting rules, and then restore the sorting rules, and then restore the original scrambled data. This gives us the power of c# to execute js.
As for how c# executes js, search for source code everywhere.
In the figure above, the extracted JS is executed to restore data on the one hand and decrypt data on the other. It is worth noting that the note information is encrypted separately and must be carefully decrypted using JS.
Finally, decrypt, restore sort, and clean the data:
The final product looks like this:
To complete.