Python uses 5000+ pieces of data for you (I'm not The God of Medicine) to get to the top! (With code)

Dying to Survive is a Chinese comedy film directed by Wen Muye and starring Xu Zheng, Wang Chuanjun, Zhou Yiwei, Tan Zhuo, Zhang Yu and Yang Xinming. It was released on July 6, 2018.

Before the film was released, the large-scale screening accumulated quite high popularity and reputation, as of the early morning of July 9: Douban score: 9.0 points, Maoyan score: 9.7 points, Taopiao score: 9.5 points, Mtime: 8.8 points.

Why do I mention these three websites? Because we have nearly 5000+ short comments today from these sites, using professional data is more convincing.

Synthesize several data: the five-star recommendation is so high, the living environment is real, the mood is real, the dilemma is real, even the heroine is a real aging beauty, with real wrinkles! Authenticity is what creates an immersive experience. The surface is medicine, the deep is life.

Medicine can cure diseases, but life can’t help it. It faces the suffering and dignity of the bottom life in China, and does not escape the torture of social system and commercial law. This is the key for the film to get stuck into The Reality of China, and also the core to arouse public resonance.

Apocalyptic yet hopeful, this film has the potential to be the most explosive topic of 2018. This may be the reason why Douban’s first 9.0 high score movie in 16 years is the first of its kind.

Today we are using 5,000 + pieces of data to analyze which regions and which people like the movie.

Cheng Yong was just a peddler of Indian sacred oil, making a tolerable living. Daddy hemangioma urgent surgery, hospital money, his wife to take their son to emigrate to develop abroad, by selling Indian god oil to earn money even utilities can not afford to pay, everywhere need money.

Mysterious man Lu benefit to find Cheng Yong, let him help buy a drug from India. Lu has a blood cancer that requires long-term treatment with anti-cancer drugs.

The legitimate drug “Swiss Glenin” is very expensive, ordinary people can not afford to supply, but in India there is a generic drug “Indian Glenin” price is only 1/20, but in China is a banned drug, smuggling caught, is to bear legal responsibility.

Driven by huge interests, Think Hui, father, Yellow hair has appeared, selling medicine five group built successfully, he became a “drug dealer”.

For patients, they have a chance to live, have to cheng Yong to send a banner, since then called its “god of medicine”.

The medicine that buys on behalf of a person appears a problem, the occurrence of zhang Changlin of fake medicine peddler threatens Cheng Yong, be afraid of being caught, sell medicine group formal disband.

Cheng Yong opened a factory, Lu benefit died, Zhang Changlin ran away, let Cheng Yong complete the first transformation, many patients have no medicine to eat, Cheng Yong went to India again and re group built to sell medicine.

Zhang changlin was caught after police cracked down on fake drug dealers. The police found Cheng Yong dens, yellow hair to cover Cheng Yong died, let him complete the second transformation.

Continued to buy Indian drugs at a loss, sent his son to emigrate, selling drugs at night was caught by the police. He gets out three years later, and things are different out there.

The realistic significance of Dying to Survive is greater than the film itself. Many people feel proud when commenting on the film. Everyone is dreaming of a Dream in which Chinese films finally dare to speak the truth.

By the early morning of July 9, the cumulative box office exceeded 1.3 billion yuan, accounting for nearly 84% of the day’s box office.

Which regions contribute more to the box office?

As you can see from this dynamic picture, the biggest contributors are Still Beijing, Shanghai and Guangzhou, with second-tier cities also contributing to the box office.

Everyone is afraid of old age, illness and death, everyone is afraid of difficulties, everyone has to do something to make a living, everyone yearned for truth and kindness… It’s these moments that come together to make the sensational “God of Medicine” feel less detached from reality.

“Leader, I beg you, stop looking for fake drugs. Don’t we who take it know if it’s fake?”

“I took the legal medicine for three years. I lost my house and my family. Now we finally have cheap medicine, but you say it is “fake medicine”. If we don’t take it, we’ll just die.”

Dying to Survive is a pain point for everyone. Who can guarantee that you and your family won’t get sick during your lifetime?

Once encountered serious illness, often tens of thousands of high medical expenses for ordinary people simply can not afford. It is not an exaggeration to say that one man’s illness brings down his family.

Regression techniques: How did we get the data

Douban has banned all data crawling since October last year. Only 500 pieces of data are released. Douban blocks IP addresses, which can be accessed 40 times a minute during the day and 60 times a minute at night.

import urllib import requests from urllibimport request import time header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win32; x32; Rv :54.0) Gecko/20100101 Firefox/54.0', 'Connection': 'keep-alive'} cookies = 'v=3; rv:54.0) Gecko/20100101 Firefox/54.0', 'Connection': 'keep-alive'} cookies = 'v=3; iuuid=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; webp=true; ci=1%2C%E5%8C%97%E4%BA%AC; __guid = 26581345.3954606544145667000.1530879049181.8303; _lxsdk_cuid=1646f808301c8-0a4e19f5421593-5d4e211f-100200-1646f808302c8; _lxsdk=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; monitor_count=1; _lxsdk_s=16472ee89ec-de2-f91-ed0%7C%7C5; __mta = 189118996.1530879050545.1530936763555.1530937843742.18 'def html_prase (url) : r = requests.get(url).content return r cookie = {} for line in cookies.split('; '): name, value = cookies.strip().split('=', 1) cookie[name] = value def html_prase(url): R = requests. Get (url). Content return r for iin range(1, 100): print(' %s page '% I) try: url= 'http://m.maoyan.com/mmdb/comments/movie/1200486.json?_v_=yes&offset=%s&' % ( i* 15) print(url) proxy = Html_prase (' http://172.17.0.29:5010/get/ ').. Decode (' UTF-8 ') # decode(' UTF-8 ') # decode(' UTF-8 ') # Get (URL = URL, cookies=cookie, headers=header, headers=header, Proxies ={" HTTP ": "http://{}".format(proxy)}).content data = json.loads(html.decode('utf-8'))['cmts'] for item in data: comment = item['content'] date = item['time'].split(' ')[0] rate = item['score'] city = item['cityName'] img= item['avatarurl'] print(date, rate, comment, city, ) with open('maoyan_08.txt', 'a', encoding='utf-8') as f: f.write(date + ',' + str(rate) + ',' + comment + ',' + comment + ',' + city + '\n') if img: f = open('C:\\Users\My\Desktop\yaoshen\img\\' + img.split('/')[-1], 'wb') f.write((urllib.request.urlopen(img)).read()) except: continue time.sleep(5 + float(random.randint(1, 100)) / 20)Copy the code

Another way: Anyproxy+JS+Python+ Monkeyrunner, can climb static Web sites, App applications, JS rendering data of dynamic website data can be climbed.

Installation and use, please refer to:

Official Github: github.com/alibaba/any…

JS code:

var logMap = {} var fs = require('fs'); var iconv = require('iconv-lite'); var logger = fs.createWriteStream('./urlLog.log', { flags: 'a' // 'a' means appending (old data will be preserved) }) function logPageFile(url) { if (! logMap[url]) { logMap[url] = true; logger.write(url + '\r\n'); } } function postData(post_data, path, cb) { // // Build the post string from an object // var post_data = JSON.stringify({ // 'data': data // }); // An object of options to indicate where to post to var post_options = {host: '127.0.0.1', port: '9999', path: '/' + path, method: 'POST', headers: { 'Content-Type': 'application/json', 'Content-Length': Buffer.byteLength(post_data) } }; var http = require('http'); // Set up the request var post_req = http.request(post_options, function (res) { res.setEncoding('utf8'); res.on('data', cb); }); logger.write('request post data 1\r\n') // post the data post_req.write(post_data); logger.write('request post data 2\r\n') post_req.end(); } module.exports = { summary: 'a rule to modify response', * beforeSendResponse(requestDetail, responseDetail) { if (/movie\/1200486/i.test(requestDetail.url)) { logger.write('matched: ' + requestDetail.url + '\r\n'); if (responseDetail.response.toString() ! == "") { logger.write(responseDetail.response.body.toString()); var post_data = JSON.stringify({ 'url': requestDetail.url, 'body': responseDetail.response.body.toString() }); logger.write("post comment to server -- ext"); postData(post_data, 'douban_comment', function (chunk) { }); }}}};Copy the code

Load JS code using AnyProxy:

anyproxy -i --rule wxrule.js
Copy the code

Service code section:

#! /usr/bin/env python3 import asyncio import re import textwrap import threading import time import os import pymysql from  mysqlmgrimport MysqlMgr from mongomgrimport MongoManager from subprocess import call import requests from lxmlimport etree from lxmlimport html from aiohttp.webimport Application, Response, StreamResponse, run_app import json STATE_RUNNING = 1 STATE_IN_TRANSACTION = 2 running_state= 0 run_swipe= True last_history_time= time.clock() # A thread to save data to database in background def insert_to_database(biz, msglist): try: for msg in msglist: print(biz) print(msg['comm_msg_info']['id']) mongo_mgr.enqueue_data(msg['comm_msg_info']['id'], biz, msg ) except Exception as e: print(e) def save_data(biz, msglist_str): save_thread= threading.Thread(target=insert_to_database, args=(biz, msglist_str,)) save_thread.setDaemon(True) save_thread.start() def swipe_for_next_page(): while run_swipe: time.sleep(5) if time.clock() - last_history_time>120: if running_state== STATE_RUNNING: reenter() continue call(["adb", "shell", "input", "swipe", "400", "1000", "400", "200"]) def reenter(): Call (["adb", "shell", "input", "swipe", "0", "400", "400", "400"]) time.sleep(2) # click "enter history message ", each phone location is different, X and Y call(["adb", "shell", "input", "tap", "200", Sleep (2) header={' user-agent ':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; Rv :54.0) Gecko/20100101 Firefox/54.0','Connection':'keep-alive'} def html_prase(url): r = requests.get(url,header).content return html.fromstring(r) async def report_url(request): resp = StreamResponse() data = await request.json() url= data['url'] # print("url reported: " + url) biz = re.findall('__biz=(.*?)\&', url) if len(biz) == 0: await resp.prepare(request) return resp biz = biz[0] print('----------------\r\n'+ biz + '\r\n----------------\r\n') mysql_mgr.enqueue_biz(biz, '') bizs.add(biz) biz = biz.encode('utf8') resp.content_type= 'text/plain' await resp.prepare(request) resp.write(biz) await resp.write_eof() return resp async def intro(request): txt = textwrap.dedent("""\ Type {url}/hello/John {url}/simple or {url}/change_body in browser url bar Format (url='127.0.0.1:8080') binary = txt.encode('utf8') resp = StreamResponse() resp.content_length= len(binary) resp.content_type= 'text/plain' await resp.prepare(request) resp.write(binary) return resp async def simple(request): return Response(text="Simple answer") async def change_body(request): resp = Response() resp.body= b"Body changed" resp.content_type= 'text/plain' return resp # coding=utf-8 async def app_douban_comment(request): resp = StreamResponse() data = await request.json() global running_state global last_history_time msg_data= json.loads(data['body'])['data']['cts'] for item in msg_data: comment = item['ce'].strip().replace('\n','') rate = item['cr'] print(comment, rate) with open('date_rate_comment_sg.txt', 'a', encoding='utf-8') as f: f.write('2018-07-06' + ',' + str(rate) + ',' + comment + '\n') last_history_time= time.clock() resp.content_type= 'text/plain' await resp.prepare(request) await resp.write_eof() return resp last_history_time= time.clock() resp.content_type= 'text/plain' await resp.prepare(request) await resp.write_eof() return resp async def init(loop): app = Application() app.router.add_get('/', intro) app.router.add_post('/url', report_url) app.router.add_post('/douban_comment', app_douban_comment) return app def start_swipe_thread(): try: t = threading.Thread( target=swipe_for_next_page, name='swipe') # set daemon so main thread can exit when receives ctrl-c t.setDaemon(True) t.start() except Exception: print("Error: unable to start thread") loop = asyncio.get_event_loop() app = loop.run_until_complete(init(loop)) run_app(app, = '127.0.0.1' host, port = 9999)Copy the code

This is the sample code, the actual use of the process, need to be fine-tuned. The most difficult thing to obtain cat eye data is to find the data interface of Cat Eye App.

I had a hard time finding:

M.maoyan.com/mmdb/commen… ‘

Interface how to use, directly look at the code, obtain the data of tao tickets need you to try to find.

import json import random import urllib import requests from urllibimport request import time header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win32; x32; Rv :54.0) Gecko/20100101 Firefox/54.0', 'Connection': 'keep-alive'} cookies ='v=3; rv:54.0) Gecko/20100101 Firefox/54.0', 'Connection': 'keep-alive'} cookies ='v=3; iuuid=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; webp=true; ci=1%2C%E5%8C%97%E4%BA%AC; __guid = 26581345.3954606544145667000.1530879049181.8303; _lxsdk_cuid=1646f808301c8-0a4e19f5421593-5d4e211f-100200-1646f808302c8; _lxsdk=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; monitor_count=1; _lxsdk_s=16472ee89ec-de2-f91-ed0%7C%7C5; __mta = 189118996.1530879050545.1530936763555.1530937843742.18 'cookie = {} for the line in cookies, split ('; '): name, value = cookies.strip().split('=', 1) cookie[name] = value def html_prase(url): R = requests. Get (url). Content return r for iin range(1, 100): print(' %s page '% I) try: url= 'http://m.maoyan.com/mmdb/comments/movie/1200486.json?_v_=yes&offset=%s&' %(i*15) +'startTime=2018-07-01%2012%3A30%3A42' print(url) html = requests.get(url=url, cookies=cookie, headers=header).content data = json.loads(html.decode('utf-8'))['cmts'] for item in data: comment = item['content'] date = item['time'].split(' ')[0] rate = item['score'] city = item['cityName'] img= item['avatarurl'] print(date, rate, comment, city, ) with open('maoyan_08.txt', 'a', encoding='utf-8') as f: f.write(date + ',' + str(rate) + ',' + comment +',' + comment + ','+ city +'\n') if img: f = open('C:\\Users\My\Desktop\yaoshen\img\\' + img.split('/')[-1], 'wb') f.write((urllib.request.urlopen(img)).read()) except: break time.sleep(5 + float(random.randint(1, 100)) / 20)Copy the code

Dynamic map display code:

from pyechartsimport Style from pyechartsimport Geo city =[] with open('maoyan.txt', mode='r', encoding='utf-8') as f: rows = f.readlines() for row in rows: if len(row.split(',')) == 5: city.append(row.split(',')[4].replace('\n','')) def all_list(arr): result = {} for iin set(arr): result[i] = arr.count(i) return result data = [] for item in all_list(city): data.append((item,all_list(city)[item])) style = Style( title_color="#fff", title_pos="center", width=1200, height=600, Background_color ='#404a59') geo = geo Attr, value = geo. Cast (data) geo. Add ("", attr, value, visual_range=[0, 100], visual_text_color="#fff", is_legend_show=False, symbol_size=20, is_visualmap=True, tooltip_formatter='{b}', label_emphasis_textsize=15, label_emphasis_pos='right') geo.render()Copy the code

Code for daily data volume:

from pyechartsimport EffectScatter from pyechartsimport Style style= Style( title_color="#191970", title_pos="left", Width =900, height=450, background_color='#F8F8FF') es = EffectScatter ", **style. Init_style) es. Add ("", [1], [270], symbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [2], [606], symbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [3], [542], symbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [4], [550], symbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [5], [656], ssymbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [6], [850], ssymbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [7], [993], symbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.add("", [8], [903], symbol_size=20, effect_scale=4, effect_period=5, symbol="pin") es.render()Copy the code

Five-star recommended river map code:

From pyechartsimPort Style from pyechartsimPort ThemeRiver data = [['2018/07/08', 802, '5 '], ['2018/07/08', 28, 'four'], [' 2018/07/08 '9,' samsung '], [' 2018/07/08, 8, 'star'], [' 2018/07/08 ', 4, 'week'], [' 2018/07/07 ', 802, 'five-star']. [' 2018/07/07 ', 166, 'four'], [' 2018/07/07, 17, "samsung"], [' 2018/07/07 '0,' star '], [' 2018/07/07, 8, 'week'], [' 2018/07/06 ', 667, 'five-star'], [' 2018/07/06 ', 156, 'four'], [' 2018/07/06, 13, 'samsung'], [' 2018/07/06, 10, 'star'], [' 2018/07/06 ', 4, 'week']. [' 2018/07/05 ', 567, 'five-star'], [' 2018/07/05 ', 76, 'four'], [' 2018/07/05, 13, 'samsung'], [' 2018/07/05 '0,' star '], [' 2018/07/05 ', 0, 'week'], [' 2018/07/04 ', 467, 'five-star'], [' 2018/07/04 ', 67, 'four'], [' 2018/07/04, 16, "samsung"], [' 2018/07/04 '0, 'two stars'], [' 2018/07/04', 0 'week'], [' 2018/07/03 ', 478, 'five-star'], [' 2018/07/03, 56, 'four'], [' 2018/07/03, 8, 'samsung']. [' 2018/07/03 '0,' star '], [' 2018/07/03 ', 0 'week'], [' 2018/07/02 ', 531, 'five-star'], [' 2018/07/02 ', 67, 'four'], [' 2018/07/02 '8, 'samsung'], [' 2018/07/02 '0,' star '], [' 2018/07/02 ', 0 'week'], [' 2018/07/01 ', 213, 'five-star'], [' 2018/07/01 ', 45 'four']. [' 2018/07/01 '5,' samsung '], [' 2018/07/01 ', 1, 'star'], [' 2018/07/01 ', 1, 'a week],] style = style (title_color = "# 191970", Title_pos ="left", width=1200, height=600, background_color='#F8F8FF') tr = ThemeRiver Sermon ", * * style. Init_style) tr. Add ([' five-star ', '4', 'samsung', 'star', 'week'], data, is_label_show = True) tr. The render ()Copy the code

Word cloud:

import pickle from osimport path import jieba import matplotlib.pyplotas plt from wordcloudimport WordCloud, STOPWORDS, ImageColorGenerator def make_worldcloud(file_path): text_from_file_with_apath= open(file_path,'r',encoding='UTF-8').read() wordlist_after_jieba= jieba.cut(text_from_file_with_apath, cut_all=False) wl_space_split= " ".join(wordlist_after_jieba) print(wl_space_split) backgroud_Image= Plt. imread('./1.jpg') print(' Copy () stopwords.add(" ha ha ") stopwords.add(" movie ") stopwords.add(" real ") Add (" China ") stopwords.add(" no ") stopwords.add(" can ") stopwords.add(" a ") Wc = WordCloud(width=1024, height=768, width= 100) wc= WordCloud(width=1024, height=768, width= 100) Background_color ='white',# set background color mask=backgroud_Image,# set background image font_path='E:\simsun.ttf', # set Chinese font, if there is Chinese, this code must be added, Otherwise the box will appear, Max_words =600, # set the maximum realistic word stopwords=stopwords,# set the stopword max_font_size=400,# set the maximum font random_state=50,# set the number of random generated states, Wc.generate_from_text (wl_space_split)# Start loading text img_colors= ImageColorGenerator(backgroud_Image) Wc.recolor (color_func=img_colors)# display the color of the background image plt.imshow(wc)# display the word cloud image plt.axis('off')# display the x and y subscript plt.show()# display # D = path.dirname(__file__) # os.path.join() : wc.to_file(path.join(d, "h11.jpg")) print(' Generated word cloud successfully! ') make_worldcloud('cloud.txt')Copy the code

Image drawing code:

Import OS from math import SQRT from PIL import Image #path = 'C:\ Users\My\Desktop\yaoshen\img\\' pathList= [] for item in os.listdir(path): ImgPath = os.path.join(path,item) pathList.append(imgPath) Total = len(pathList)#total Is the total number of friend profile pictures line = Int (SQRT (total))#line NewImage= image.new ('RGB', (128*line,128*line)) x = y = 0 for item in pathList: try: Img = image.open (item) img= img.resize((128,128), image.antialias) newImage.paste (img, (x * 128), Y * 128)) x += 1 except IOError: print(" %d,%d ") x += 1 except IOError: print(" %d ") IOError:%s" % (y,x,item)) x -= 1 if x == line: x = 0 y += 1 if (x+line*y) == line*line: break NewImage.save(path+"final.jpg")Copy the code

This article is from “Data THU”, a partner of the cloud community. For relevant information, you can pay attention to “Data THU”.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python uses 5000+ pieces of data for you (I’m not The God of Medicine) to get to the top! (With code)

Python uses 5000+ pieces of data for you (I’m not The God of Medicine) to get to the top! (With code)

Related Posts

Two options for a WebMvcConfigurerAdapter class that is deprecated

MySQL storage engine

Dry goods | Redis publish subscribe to the theory and practice