Recently, our chief temp worker did a very interesting thing, by writing his own crawler to crawl the Ricequant online community data and a series of analysis, you can also get a preliminary understanding of the python crawler principle.

The original

Recently, I finished reading Data Analysis with Python, and was frustrated that I had no way to use the dragon’s magic tricks, so I took aim at the Rice Basket community: the open community data is well placed there. It took three days to crawl, store, clean and organize the data, analyze, summarize and visualize the data, and finally write this article (don’t ask me why I’m so inefficient, I’ll explain later).

### Crawl save

As a crawler with no crawler mechanism (but let me be fair, the source code of the community web page is not that easy to decipher), I stumbled through all the data in Requests and BS4 files, It includes the title, address, Posting time, re-editing time, amount of posts, page views, likes, whether there is backtest sharing, whether there is research sharing, and the number of clones shared.

Let’s start with the idea. First enter the community home page, a total of 170+ pages of all information in the source page. After using the Requests request and then bS4 parsing (brute force if necessary, visually parsing yourself), you should be able to get most of the data. And post time and edit time again, need to enter each topic post, and then find the data needed in it.

r = requests.get(url + str(num)) # parse
soup = BeautifulSoup(r.text, 'html.parser')
body = soup.body # Posts, views, likes in groups of 3

human_readable_number = body.find_all('span', attrs={'class': 'human-readable-number'})
for item in human_readable_number:
    list_tiezi_liulan_zan.append(item.get_text())Copy the code

As for data storage, I directly used SCV format (3389 rows × 10 columns). As for database, I will work on it again after I learn a wave of MongoDB.

def save(file_name, data_list):
    with open(file_name, 'w') as f:
        writer = csv.writer(f)        
        for data in data_list:
            writer.writerow(data)Copy the code

Clean up

Cleaning and organizing the data that crawled down is another thankless task. Deleting useless data, organizing data formats, and reprocessing data are all brain-intensive tasks. But when you look at the compiled data, you feel a sense of accomplishment.

Analysis and summary

In addition to DataFrame merging, which involves resample, groupby, and the processing of the timeline itself, it’s all routine stuff (though I still have to turn to a book every now and then to do it, ho ho ho).

Data visualization

A total of 3389 topics were selected, of which 19% were strategy sharing and 10% were research sharing. Each thread was viewed an average of 1777 times (median 633 times) and generated 5.5 comments. Among those policies with policy sharing or research sharing, each share was cloned an average of 73.3 times (note that the number of clones in the figure below is in addition to the total number of topic posts).

There are two posts in the community that have been viewed more than 100,000 times: “What do you want Ricequant to do?” And “[High yield low retreat] [stop loss] [Sharpe ratio 4.0] improved small-cap stock strategy”, top50 are more than 15,000 times:

The top50 posts are as follows:

As for the number of cloning, we have to say that there are many community gods, often hundreds of thousands of cloning. Among them, the top3 are “research on cross-variety arbitrage of [futures] commodity futures — steady!” , [High Yield, Low retracement] [Stop Loss] [Sharpe ratio 4.0] Modified Small-cap Strategy, and the Graham Number Graham Digital Value Method. Plunged as follows:

image007.png

The word cloud

Use WordCloud and Jieba to participle the title of the post, but the level is too poor to make a nice picture, so we will do:

import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Read the whole text.

text = ' '
for item in data['title']:
    text = text + item

wordlist_after_jieba = jieba.cut(text, cut_all = True)    
wl_space_split = "".join(wordlist_after_jieba)
wordcloud = WordCloud(font_path="WawaTC-Regular.otf",background_color="black",width=2000, height=2000, margin=2).generate(wl_space_split)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()Copy the code


Take a look at the number of posts in the community, aggregated by week, comment refers to the number of comments, discussion refers to the number of topics:





eggs

After crawling the Posting time of the post, we have counted the time distribution of the community Posting:

conclusion

Three days of work can be scanned in about three minutes, which is pretty expensive considering. From crawling, storing, cleaning, organizing, analyzing, all the way to visualization. All sorts of things can be done in Python, and it’s fun.