First, a test
In class, you need to try several times to get the simple interface before baidu translation. In fact, we can check it as long as we open XHR in network.
Step 1: Open the browser: fanyi.baidu.com, and press F12>>>Network>>>XHR.
Step 2: Input job(the server is constantly requested during the input process). For example, “J” is requested, “Jo” is requested, and “job” is requested, as shown in the following figure:
Step 3: Click on number 3 to see the URL, request mode, and request data.
Step 4: Let’s start writing the crawler.
Since we have seen how easy it is for this interface to get the parameters needed for translation results, we write a crawler using translation “job” as an example.
Import requests import json • def baidu_fanyi(kw): """ :param kw: return "" "# m target base_url url = "https://fanyi.baidu.com/sug" data = {" kw ": kw} # request, Json_data = response.json() # print(json_data) data_list = json_data["data"] for dict_data in data_list: print(dict_data) if __name__ == '__main__': baidu_fanyi("job")Copy the code
The results are as follows:
{'k': 'job', 'v': 'n. 2. a job Responsibilities; Work done as a unit Valet business vi. '} {' k ':' Jobo ', 'v' : '[name] [of Ecuador] warhol} {' k' : 'jobs',' v ':' n. Professional; A piece of work (plural noun); Responsibilities; (as a processing unit) homework '} {' k ':' Jobs', 'v' : '[name] Steve Jobs'} {' k' : 'Jobs',' v ': }{'k': 'k', 'v': 'n. Professional; 2. a job Responsibilities; Work done as a unit Valet business vi. '} {' k ':' Jobo ', 'v' : '[name] [of Ecuador] warhol} {' k' : 'jobs',' v ':' n. Professional; A piece of work (plural noun); Responsibilities; Job Opportunities and Basic Skills '}{'k': 'Jobs', 'v': 'Abbr '}{'k': 'Jobs', 'v': 'abbr. Job Opportunities and Basic Skills '}Copy the code
2. Parameter cracking of secret translation interface
Above you are not feeling too simple, do not worry! The interface request parameter below is the most difficult to decipher.
The most difficult parameters in this interface are sign and token.
1. Crack sign
Step 1: CTRL +F open the search box for “sign” and find the file index_e36080d.js where sign is located
Step 2: We found that it was a JS file, so we opened index_e36080d.js in Sources.
Step 3: We paste index_e36080d.js into our Pycharm and search for sign.
1. At the beginning, it is not difficult to find that there will be a lot of unimportant information.
2. In the process of searching, we will find sign:f(n)
3, Use CTRL + the left mouse button to click F, it will be transferred to the place as shown below.
4, f = t (translation: “widget/translate/input/pGrab”), the apparent translation: widget/translate/input/pGrab interface is defined as a method of routing, Presumably the so-called t function might call this interface method. Then we search translation: in pycharm widget/translate/input/pGrab, find the function.
Step 4: We execute this JavaScript function.
Since many students do not have a Node.js environment, today we will use the PyexecJS package to execute JS code.
Install the PyexecJS package.
2. Test the function we found.
(1) Create a sign.js file and paste the found JS code into it.
function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a), a = "+" === o.charAt(t + 1) ? r >>> a : r << a, r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r) { var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) "" ! == e[C] && f.push.apply(f, a(e[C].split(""))), C ! == h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join("")) } var u = void 0 , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = null ! == i ? i : (i = window[l] || "") || ""; for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + "." + (p ^ m) }Copy the code
(2) Create a new param_find.py and start testing the JS code. But we do not know which function e or function generates sign, we can test in sources.
A. Make a breakpoint at sign: f(n) and then brush the page.
B. The function prompt will appear when the mouse pointer is placed over F
C. After clicking, we go to function E and click the debug button to start debugging.
D. It is easy to see that the function e is used to generate sign, when we see “job”, indicating that the function argument should be the translation content, so we are done, start to write Python code to execute js function.
import execjs
def read_js(path):
with open(path, "r")as f:
f = f.read()
return f
res = execjs.compile(read_js("signa.js")).call("e", "job")
print(res)
Copy the code
If I is not defined, you will pass in the value 320305.131321201.
E. Through debugging js code, it is found that I should be 320305.131321201. It is found that I =window[l] and l in the previous line is obviously a fixed value GTK.
It is important to note that the value of the other word I has not changed
F. We add this line of code to sign.js
Var I = "320305.131321201"Copy the code
G. Run param_find.py to get 231901.486124. If sign is 231901.486124, run param_find.py to get 231901.486124.
2. Crack the token
With the same idea as above, we first search for tokens in network and then solve the problem according to the same idea.
Step 1: Search which files contain tokens.
If the token is empty, the baidu server does not receive the Baiduid cookie when it accesses the baidu website for the first time. As a result, the verification of the translation interface fails. You need to refresh the token
1. Get the js file. We found that he could actually get the HTML file by requesting the link fanyi.baidu.com/.
Obtain the token values of the following JS files through the re
The specific code is as follows:
import re import requests import execjs url = "https://fanyi.baidu.com" headers = { "Cookie": "BIDUPSID=EE1FBAB64E978CA7E15A21204784E059; PSTM=1574385249; MCITY=-233:; BAIDUID=8FE349D493E4028413DDDC33C39D13B2:FG=1; __yjs_duid=1_eb4f55bc190bb694fbbccd0271c4cce61614069225050; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; PSINO=2; BAIDUID_BFESS=BAF37D17762844A4E0C7143382B922B2:FG=1; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; H_PS_PSSID=33636_33256_33272_33709_33689_33595_33590_26350; ZD_ENTRY=baidu; BA_HECTOR=0la02ga1818l2ga4v31g58k0l0r; BCLID=10547301397971942590; BDSFRCVID=LMIOJexroG3VC5QeobuohXPMALweG7bTDYLEOwXPsp3LGJLVJeC6EG0Pts1-dEu-EHtdogKK0gOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJKJoCtatD_3fP36qR6sMJ8thmT22-us5Cnd2hcH0KLKEpnGWhoDbt4UKGj3XpOuJIjiaMjwWfb1MRjvh-LhefCN547QBTbd-H5bol5TtU JceCnTDMRh-lIZMb5yKMniMRr9-pPX3pQrh459XP68bTkA5bjZKxtq3mkjbPbDfn02eCKuDjRDKICV-frb-C62aKDshJTgBhcqJ-ovQTb4LTLubfni36Q3yN Ru_P555l0bHxbeWfvpXn-R0hbjJM7xWeJpaJ5nJq5nhMJmKTLVbML0qto7-P3y523ion3vQpP-OpQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0D6c3DG 0Ht5Lsb5vfstcS24JED6rnhPF3hM-mXP6-35KHyH783fbt5R7vEfndWh3Yj-uUynQkBq37JD6y2UQOQ-JJSDQL0Ic_M6JXhtoxJpOgMnbMopvaKfcooqRvbU Rv2jDg3-A80U5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoCvt-5rDHJTg5DTjhPrMK4QRWMT-MTryKKOC0KKheJA4LfosWh8IjhofKx-fKHnRhlR2B- 3iV-OxDUvnyxAZyxomtfQxtNRJQKDE5p5hKq5S5-OobUPUDMJ9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCvDqTrP-trf5DCShUFs-UbWB2 Q-XPoO3KO4EI5hbtPKhl4JWb3utnQf5mkf3fbgylRp8P3y0bb2DUA1y4vpBtQmJeTxoUJ2-KDVeh5Gqfo15-0ebPRiJPb9Qg-qahQ7tt5W8ncFbT7l5hKpbt -q0x-jLTnhVn0MBCK0hDvPKITD-tFO5eT22-usBgjA2hcHMPoosIOKLToGbt4V5fn3XpOu0mJf0l05KfbUoqRHXnJi0btQDPvxBf7pWDTm_q5TtUJMqIDzbM ohqfLn5MOyKMnitIv9-pPKWhQrh459XP68bTkA5bjZKxtq3mkjbPbDfn028DKu-n5jHjJyeH8j3J; BCLID_BFESS=10547301397971942590; BDSFRCVID_BFESS=LMIOJexroG3VC5QeobuohXPMALweG7bTDYLEOwXPsp3LGJLVJeC6EG0Pts1-dEu-EHtdogKK0gOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g 0M5; H_BDCLCKID_SF_BFESS=tJKJoCtatD_3fP36qR6sMJ8thmT22-us5Cnd2hcH0KLKEpnGWhoDbt4UKGj3XpOuJIjiaMjwWfb1MRjvh-LhefCN547QBTbd-H5b ol5TtUJceCnTDMRh-lIZMb5yKMniMRr9-pPX3pQrh459XP68bTkA5bjZKxtq3mkjbPbDfn02eCKuDjRDKICV-frb-C62aKDshJTgBhcqJ-ovQTb4LTLubfni 36Q3yNRu_P555l0bHxbeWfvpXn-R0hbjJM7xWeJpaJ5nJq5nhMJmKTLVbML0qto7-P3y523ion3vQpP-OpQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0 D6c3DG0Ht5Lsb5vfstcS24JED6rnhPF3hM-mXP6-35KHyH783fbt5R7vEfndWh3Yj-uUynQkBq37JD6y2UQOQ-JJSDQL0Ic_M6JXhtoxJpOgMnbMopvaKfco oqRvbURv2jDg3-A80U5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoCvt-5rDHJTg5DTjhPrMK4QRWMT-MTryKKOC0KKheJA4LfosWh8IjhofKx-fKHnR hlR2B-3iV-OxDUvnyxAZyxomtfQxtNRJQKDE5p5hKq5S5-OobUPUDMJ9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCvDqTrP-trf5DCShUFs -UbWB2Q-XPoO3KO4EI5hbtPKhl4JWb3utnQf5mkf3fbgylRp8P3y0bb2DUA1y4vpBtQmJeTxoUJ2-KDVeh5Gqfo15-0ebPRiJPb9Qg-qahQ7tt5W8ncFbT7l 5hKpbt-q0x-jLTnhVn0MBCK0hDvPKITD-tFO5eT22-usBgjA2hcHMPoosIOKLToGbt4V5fn3XpOu0mJf0l05KfbUoqRHXnJi0btQDPvxBf7pWDTm_q5TtUJM qIDzbMohqfLn5MOyKMnitIv9-pPKWhQrh459XP68bTkA5bjZKxtq3mkjbPbDfn028DKu-n5jHjJyeH8j3J; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574 = 1616109895161110, 171161114, 964161138, 265; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1616138526; Ab_sr = 1.0.0 _NmFiZmE4ZDBiODNkMjA5YTYwYmQyODExOWNlNjgwYjhkNGE4NTlmNWZmNDBmNTIwYzQxZmEyMzMyOWM0ZjA2ZjkyZDlhMGQ5YTI2YjIxYjlj NjFmOTE4MzhjOWNiNjI1; __yjsv5_shitong = 1.0 _7_80992e39d5bba9b7737d7e4cc0e4876b0f75_300_1616138529336_124. 114.149.34 _63e643e3 ", "the user-agent" : "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", } html = requests.get(url,headers=headers).text # print(html) res = re.findall("<script>(.*?) </script>",html,re.S) print(res[2]) res1 = re.findall(" token:(.*)systime:",res[2],re.S) token = res1[0].strip().strip("'").strip("',") print(token)Copy the code
So far we have cracked baidu Translate’s encrypted request parameters.
To get the source of crawler, please pay attention to wechat public number: Zidong Code, reply “Baidu”, to get the source Code.
3. Json data processing
Parse the JSON data to get the results we need.
The method code for processing is as follows:
Def processing_json(self): """ "with open(f"{self.query}. Json ","r",encoding=" utF-8 ")as f: res = f.read() json_data = json.loads(res) # print(json_data) for _json in json_data["dict_result"]["synonym"][0]["synonyms"]: print(_json["bx"]) res = _json["syn"].get("d")[0] if isinstance(res,dict): print(res["text"]) else: print(res) explain = json_data["liju_result"]["tag"] explains = "-".join(explain) print(explains)Copy the code