Preface: This article has two target readers, one is do not know programming, want to know what programming “medium is (fourth tone)”; The second is someone who has programming experience and does not know Python, but wants to learn Python or needs to use the features shown in the question. PS: Long (LUO) text (Li) multi (LUO) graph (SUO), traffic party carefully into PPS: to directly see the programming part please directly read the first half of chapter 3 and after chapter 4





_(:⁍ “∠)_ Hello everyone, a new day, have you learned programming in Jane book?

_(:⁍ “∠)_ have more saliva \ yearning to send a few have been silently like the author ah?

_(:⁍ “∠)_ Have you ever felt so depressed on your way to writing and programming that you even began to doubt your life?

_(:⁍ “Angle)_ HMM? Why am I lying down, you ask? Emmmm, I think this Angle might have an amazing nAO dong…





It has been almost a month since I began to write in Jianshu. I have seen many famous authors share their writing experience, but when IT comes to writing, I find that my writing is like epilepsy and my writing is like paralysis (the truth is that both writing and writing are paralyzed). What can I do? So I decided to use Python to crawl down all the articles of my favorite authors and peruse them to see if there were any writing tips. So excited! .

Are a lot of people ready to shut down the page at the sight of a program to study? (nonsense, a lot of people to look at the title will pass) according to the sharing data to climb to get book Jane speculate that Jane books about 235 w of the user, which focus on programmer column nearly 44 w, accounted for 18% of total users, have two people are paying attention to every 11 people in programming information, show Jane programmer \ for programmers in the book the proportion of people who are interested in is not a few, Do not know how many people, looking at the column in the big men who write the same programming experience, daunting.

An expert blind spot is a situation where the more you know about a subject, the less you remember that you didn’t know it.

As a person who tried C, Tried Java, tried MATLAB, tried Html, CSS, and without exception all from the entry to give up, I really understand the white people! When I first learned to code, I was always black. (⊙… ⊙) Really? (((; ꒪ ꈊ ꒪;) ) What is it? _(:3 “∠ toytoy)_ what are you saying? ! The more professional the bigwigs are, the more… Didn’t understand.





A long time ago about “painting. From entry to Give up”, please call me soul painter

So as I was learning Python, I wanted to show you what programming is like if you don’t know how to program in a simpler language. (come on! Praise me without reservation!

1. Basic concepts

Computer language refers to the language used for communication between people and computers. There are two common programming languages nowadays: assembly language and high-level language. High-level language is the choice of most programmers.

The programming languages discussed here are all high-level languages, there are many high-level languages, Java, C, PHP, Ruby, Python…

The logic of a programming language is basically the same, just like writing a newspaper, a diary, or a novel, is to eat people’s food, not eat people’s food (except supernatural novels O__O “…). .

The syntax of a programming language has its own characteristics, such as “I’m happy today” in Chinese and “I’m happy today” in English.

Programming languages also have different functional focuses, objective-C for ios development, PHP for Web development, and Python for crawlers (which I made up), just like writing ideas reports in a serious tone, writing papers in a serious tone, and writing Python tutorials in a funny tone (run).

2. Define your goals








No matter what you want to write, write first, try more and practice more, and gradually you will find the direction.

But after all, before writing, I have spoken plain English for decades. Even if I am poor in writing, I can write directly as long as I know the characters. However, if I have zero basis in programming language, I don’t know anything.

Masako Wakamiya, 82, from Japan, became the oldest developer at WWDC 2017. She taught herself how to code for ios because there were no apps for older people.

Don’t understand programming languages, the programming in grammar is fatty intestine boring (I’m withered), and it’s easy to learned and forget, but why grandma to be able to learn, because she wants to develop the old app, had a target, have the power, so you need to clear objectives, what do you hope that through programming.

Python is recommended because of its high syntax freedom, concise language, suitable for beginners to learn; Second, because… Of course it can do a lot of zhuang interesting things! Every time you see other people with big data, charts, analysis, tell me you really do not envy? !

Developing an app from scratch can take months, but learning Python from scratch to building a crawler can take only a few days. Not heart?) “, and in the powerful open source environment of the Internet, writing big crawlers is just around the corner (wake up!).

Good, don’t waste (zuo) words (Meng) (wipe saliva), writing no subject matter have no way to write I can’t help you, the reptile’s target has been selected, to climb a climb Jane shu big V people write the article.

3. Write an outline

Why write an outline before programming?

Just like writing a novel, breaking it down into sections, chapters, and plots is a way to frame the story, to tell it better, and to prevent yourself from forgetting why you’re writing it as you go along (I admit I’m the last one).

In writing the outline, programming is a lot easier than writing a novel, because the code is frozen, the novel is a tension of the text, “(to quote on zhihu colorless sugar), once you know what to do with programming, the code stack is erratically, but even if you know what is the end of the novel series, how to write wonderful, it is not clear, Is programming easier than writing a novel?





When a reptile works, what is it doing


What do you understand if you haven’t learned programming? Black colour? The purple words? Or the scarlet letter? Or blue?

  • Black text is one of the things we need to do for the purpose of generating author articles for PDF.
  • Purple is a mechanical way of thinking about what a crawler needs to do.
  • The scarlet letter is the programming function, each function is a function to complete the operation.
  • Blue letters are function modules introduced from outside.

Is it already in the loop?





Let’s put it another way.

Benmy was going to write a novel of epic proportions, “With supernatural beauty of the female cause unbearable stepmother’s spite worked hard every was admitted to tsinghua university department of birds from family and by the spirit of not afraid of hard part-time fb keeper while reading the final with honours into animal city federation of mother dragon pro white dragon village work and good coincidence unfortunately there because farming experience with childhood shadow magic community Public figures, that is, the hero worship of men for his kind invitation she visited his wonderful frog seed plots of the ground near by alien spacecraft hit Japanese by aliens lose originally possessed the memory of men migrate for its rescue but accidentally crossing to the Stone Age and with the help of crazy primitive man found by the fairy fairy understanding of men and women the main story was greatly moved by the fairy With a wave of his wand, he drove away the aliens and promised them a brother-sister relationship.





When you hear the words “Fantastic Beasts”, “Blue eye white dragon” and “wonderful frog seed”, do you feel very familiar with them? Do you directly imagine a plot that already exists in your memory? This is the equivalent of an external module in programming. This module already exists, I didn’t write it, and it just happens to have the functionality I need, so I incorporate it into the code I’m writing so that I can use the functionality it contains directly.

Carefully break down the story, roughly can be divided into heroine family, heroine university, heroine work, male, male and female acquaintance, alien accident, thousands of miles to rescue heroine, reunion, where the characters, on behalf of the parameters in the program; Each chapter of the story represents a function. Different functions represent different functions, which relate to each other, accomplish the tasks of the program (similar to showing the plot of a story), and pass on common parameters (which can be understood as characters throughout the story).

There are also built-in functions in the language, which are functions that are inherent to the program itself, such as Chinese for eating, English for eat, and Japanese for べ Ru.

Through the use of built-in functions, module calls, as well as the function of the building, it formed a complete program.

4. Build code

At the beginning of this paragraph, it is recommended to have programming foundation to read, of course, you do not have programming foundation to understand that I also respect their mouth is a man.





4.1 Disguise the Browser to open the page

Def getPage(url): headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 6.1); Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'} try: request = urllib2.Request(url = url, headers = headers) response = urllib2.urlopen(request, timeout = 5) page = response.read().decode('utf-8') return page except (urllib2.URLError,Exception), e: if hasattr(e, 'reason'): Print 'fetch failed, specific reason: ', e.reason response = urllib2.urlopen(request,timeout = 5) page = response.read().decode('utf-8') return pageCopy the code

The user-agent ma3 jia3, can be understood as a crawler browser on the browser ma3 jia3, crawler will be like when we surf the Internet usually open the browser input web site can see pages get need web content, this is in response to low the crawler technical measures taken by the war of the crawler and the crawler, Lao another day.





4.2 Analyzing personal Home pages

Grab a book author, Jane huai left classmate website is “www.jianshu.com/u/62478ec15…” If you look at individual page addresses, only the string of letters and numbers following “www.jianshu.com/u/” will change.

Jane book personal home page contains the author’s information, the total number of articles, but also includes the title of the article, pull down the page, you will find that the article is constantly loaded out, should always turn down can turn to the first article, but if you right-click to view the source code, you can only see the article has been loaded out of the source code.

Click “Network” in the popup window. Then continue to scroll down the page and watch the list change in the lower left corner. There is a link file that contains the word “Page”, not necessarily in the XHR category, but usually in the same category.









www.jianshu.com/u/62478ec15…
www.jianshu.com/u/62478ec15…

4.3 Obtaining author information and article links

Excerpted personal page source code

<div class="title"> <a class="name" href="/u/62478ec15b74"> <span class="author-tag">< I class="iconfont" </span> </div>Copy the code

The author’s name is under title.name, and the regular expression is:

pattern = re.compile(u'<a.*?class="name".*?href=".*?">(.*?)</a>')
Copy the code

Excerpted personal page source code

< div class = "info" > < ul > < li > < div class = "meta - block" > < a href = "/ users / 62478 ec15b74 / following" > < p > 17 attention < / p > < I class="iconfont ic-arrow"></i> </a> </div> </li> <li> <div class="meta-block"> <a href="/users/62478ec15b74/followers"> The fans of 63600 < / p > < p > < I class = "iconfont IC - arrow" > < / I > < / a > < / div > < / li > < li > < div class = "meta - block" > < a Href = "/ u / 62478 ec15b74" > < p > 244 < / p > < I class = "iconfont IC - arrow" > < / I > < / a > < / div > < / li >Copy the code

The href used to distinguish values from each other is generated by the author and does not apply to all authors. The href used to distinguish values from each other is generated by the author and does not apply to all authors. The value of list[2] is returned.

pattern = re.compile(u'<p>(.*?)</p>')
metablock = pattern.findall(page)
titleNum = int(metablock[2])
Copy the code

Excerpted personal page source code

<a class="title" target="_blank" href="/p/c2a4a3b3490e"> Through self-discipline, I had a well-deserved summer vacation. </ A >Copy the code

/p/c2a4a3b3490e = /p/c2a4a3b3490e = /p/c2a4a3b3490e = /p/c2a4a3b3490e = /p/c2a4a3b3490e = /p/c2a4a3b3490e

pattern = re.compile(u'<span.*?class="time".*?data-shared-at="(.*?)\+08:00"></span>.*?' + u'<a.*?class="title".*?href="(.*?)">(.*?)</a>',re.S) titles = re.findall(pattern,page) for title in titles: Titlelist. append([title[0],'http://www.jianshu.com' + title[1],title[2]]) print 'STR (num) + num +=1Copy the code

4.4 Convert web pages to PDF

Gee, will the PDF be converted so soon? Didn’t you just read the article link? You haven’t read the article yet, have you?

Hiahiahiahia ~ This is about to introduce an introduced module, a artifact! Wkhtmltopdf is a module that can directly convert web pages to PDF, and can also set the specified part of the web page to PDF through the corresponding function.

Of course, we still need to read the content of the article, but there is no need to set up a separate function to read the article, in the PDF function to read directly, here refer to the Python crawler: Convert Liao Xuefen’s tutorial to PDF ebook.









Beautifulsoup

for index,url in enumerate(articlelist): try: response = requests.get(url[1]) soup = BeautifulSoup(response.content,"html.parser") title = soup.find('h1').get_text() Body = soup. Find_all (class_="show-content")[0] Center_tag = soup. New_tag ("center") title_tag = soup. New_tag (' H1 ') title_tag.string = title Center_tag.insert (1, title_tag) body.insert(1, center_tag) html = str(body) pattern = "(<img.*? src=\")(//upload.*?) (\")" def func(m): RTN = "". Join ([m.group(1), "HTTP :", m.group(2), m.group(3)]) return rtn else: return "".join([m.group(1), m.group(2), m.group(3)]) html = re.compile(pattern).sub(func, HTML = html_template.format(content= HTML) HTML = html.encode(" UTF-8 ") f_name = ".".join([str(index),"html"]) with open(f_name, 'wb') as f: f.write(html) htmls.append(f_name) except Exception as e: Print e try: pdfkit.from_file(HTMLS, authorname + ".pdf", options=options) Print e for HTML in HTMLS: os.remove(HTML) #Copy the code





PDF effect (MAC)

Mac word is a little bit small QAQ, look at Windows.





PDF effect (Windows)

(* ^ __ ^ *)

4.5 Obtaining word Clouds

In the process of writing the program, I added the idea of using word clouds in Python to analyze the key words of the author’s articles, so I wrote a few more lines. Without party A’s program, you can add any function you want (than scissor hands).

The idea is to read the Chinese content of the article according to the link extracted in section 4.3, output it to TXT file, and then use python’s third-party module Jieba (this module is a stutter? Stammer! Stammer… Ha ha ha ha ha ha ha hiccup), through this module, you can separate large sections of text according to the vocabulary, and support Chinese segmentation! Don’t ask me what this participle is for, the computer has not learned Chinese, in its view Chinese is @! # $%… & *.

TXT fileArticle = open(filePath, 'w') try: for article in articlelist: fileArticle.write(article[3]) finally: Filearticle.close () text = open(filePath).read() os.remove(filePath) # Jieba wordlist = jieba.cut(text,cut_all = False) wl = "/ ".join(wordlist) f_stop = open(stopwords_path) try: f_stop_text = f_stop.read() f_stop_text=unicode(f_stop_text,'utf-8') finally: f_stop.close() f_stop_seg_list = f_stop_text.split('\n') for myword in wl.split('/'): if not(myword.strip() in f_stop_seg_list) and len(myword.strip())>1: mywordlist.append(myword) text = ''.join(mywordlist)Copy the code

The wordcloud is output using python’s third-party wordcloud module, which can import custom images and generate word clouds based on their contours, as well as importing fonts.

Wc. Generate (text) # generate word cloud, Generate_from_overriding function image_colors = ImageColorGenerator(back_coloring) # generates the color from the background image after we calculate the word frequency Plt.figure () plt.imshow(wc.recolor(color_func=image_colors)) Wc.to_file (imgname2) # Save the imageCopy the code

I once thought of using the author’s head picture as the background picture to generate the word cloud, but found that the effect of some authors’ head picture to generate the word cloud was not good, so I chose the fixed picture as the background of the word cloud to generate the word cloud of Huai Zuo.





Huai Zuo students article word cloud

Huai Zuo is indeed an excellent student. She works hard, lives, writes, reads, studies…

Good! I also decided to learn programming hard, hard work (Flag *MAX), say no more! I’ll try! (run

Here I am again, gayHub source files have been uploaded, download freely, the code is set interval, it is only a small crawler, but don’t put too much strain on uncle Jane’s server

For learning the basics of Python, I recommend this book, The Python 2.7 Tutorial, which also has the Python 3 tutorial.

If you have any questions about the code, I will answer them with the same earnestness as I learned: running on my computer is fine. (to escape