This article is from netease Cloud community
Author: Wang Bei
Primary school students are now learning Python, as a professional programmer, of course, can not fall behind, so, hurry up, On Saturday and weekend at home to learn Python3, python3 basic syntax is relatively simple, compared to Java development is more agile, python3 foundation does not speak, Here I mainly tell the crawler here to achieve the logic of small procedures
Upper and lower module diagram:
Be clear at a glance, the overall is this step 5, involves python3 requests, bs4, re, sqlalchemy these four modules.
(1) requests for:
HTTP is a powerful HTTP client library that provides rich apis, such as sending a GET request:
with requests.get(url,params={},headers={}) as rsp:
res.text # Return the text content of the valueCopy the code
Send a POST request with json as an input parameter:
with requests.post(url,json={},headers={}) as rsp:
res.text # Return the text content of the valueCopy the code
And so on.
In requests, the return value is RSP. At the end of the with as line, the __exit__() method is executed. The __exit__() method in requests dropped Request close, which is why the program doesn’t show a call to close. Here is an example of how with as works.
Requests and there are many powerful features, reference: https://www.cnblogs.com/lilinwei340/p/6417689.html.
(2) BS4 BeatifulSoup
For example, if we want to get news, map, video, and post bar content, we just need to:
soup=BeautifulSoup(html,'html.parser')
atags=soup.find('div', {'id':'u1'}).findChilren('a', {'class':'mnav'})
values=[]for atag in atags:
values.append(atag.text)Copy the code
Python also has an xpath for parsing HTML and scrapy framework, which we’ll discuss later.
<html> <head> <meta http-equiv=content-type content=text/html; charset=utf-8> <meta http-equiv=X-UA-Compatible content=IE=Edge> <meta content=always name=referrer> <link rel=stylesheettype= "text/CSS href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css > < title > baidu once, you'll know < / title > < / head > < body link =#0000cc>
<div id=wrapper>
<div id=head>
<div > <div > <div > <div id=lg><img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129></div>
<form id=form name=f action=//www.baidu.com/s > <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden
name=rsv_bp
value=1>
<input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span
><input id=kw name=wd > autocomplete=off autofocus></span><span
><input type<div id=u1><a href=http://news.baidu.com name=tj_trnews > name=tj_trhao123 > <a href=http://map.baidu.com name=tj_trmap > > href=http://tieba.baidu.com name=tj_trtieba > <noscript><a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login > <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=' + encodeURIComponent(window.location.href + (window.location.search === "" ? "?" : "&") + "bdorz_come=1") + 'name="tj_login" > login '); </script> <a href=//www.baidu.com/more/ name=tj_briicon > </div> </div> <div id=ftCon> <div id=ftConw><p id=lh><a Href =http://home.baidu.com> About Baidu</a>< a href=http://ir.baidu.com>About Baidu</a></p> <p id=cp>©2017 Baidu< a href=http://www.baidu.com/duty/ > before using baidu required < / a > < a href=http://jianyi.baidu.com/ > src=//www.baidu.com/img/gs.gif > < / p > < / div > </div> </div> </body> </html>Copy the code
(3) re
Re regular modules are very strong, there is a match search sub replace these apis, each has its own special skill, you can refer to: http://www.runoob.com/python3/python3-reg-expressions.html
(4) sqlalchemy
A Python database ORM framework, used under, very easy to use, a bit similar to Java Hibernate, but more flexible.
Said so much, the thread under the crawler script code, the following is the directory structure, after all, is also a professional programmer, can not write a mess, also want to pay attention to the structure, ha ha. 
— — — — — – # youku_any package name
————–datasource. Py # manages datasource sessions exclusively
————–youkubannerdao.py # program capture youkubanner information, this is the DAO layer
————–youkuservice. Py # Needless to say, business logic
There’s one more thing, which is to build tables, and I won’t go into that:
CREATE TABLE `youku_banner` ( `id` bigint(22) NOT NULL AUTO_INCREMENT, `type` int(2) NOT NULL, Genre 1: TV 2: Movie 3: variety show
`year` int(4) NOT NULL, `month` int(2) NOT NULL, `date` int(2) NOT NULL, `hour` int(2) NOT NULL, `minute` int(2) NOT NULL, `img` varchar(255) DEFAULT NULL, `title` varchar(255) DEFAULT NULL, `url` varchar(255) DEFAULT NULL, `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `idx_uniq` (`year`,`month`,`date`,`hour`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=83 DEFAULT CHARSET=utf8mb4Copy the code
Next comes the code implementation:
datasource.py
from sqlalchemy import create_enginefrom sqlalchemy.orm import sessionmaker
dburl = 'mysql+pymysql://root:123@localhost/youku? charset=utf8'Create_engine (dburl,pool_size=100,pool_recycle=3600)
Session = sessionmaker(bind=ds)Class SessionManager():
def __init__(self):
self.session=Session() def __enter__(self):
return self.session # connect pool manages session, no need to display close
def __exit__(self, exc_type, exc_val, exc_tb):
# session.close()
print('not close')Copy the code
youkubannerdao.py
from sqlalchemy import Sequence, Column, Integer, BigInteger, String, TIMESTAMP, textfrom sqlalchemy.ext.declarative import declarative_basefrom youku_any.datasource import SessionManager
Base = declarative_base()Baseclass YoukuBanner(Base)
Create table name
__tablename__ = 'youku_banner'
# define field mapping
id = Column(BigInteger, Sequence('id'), primary_key=True)
type=Column(Integer)
year = Column(Integer)
month = Column(Integer)
date = Column(Integer)
hour = Column(Integer)
minute = Column(Integer)
img = Column(String(255))
title = Column(String(255))
url = Column(String(255))
createTime = Column('create_time', TIMESTAMP) def add(self):
SessionManager __enter__() logic line ends with __exit()__
with SessionManager() as session: try:
session.add(self)
session.commit() except:
session.rollback() def addBatch(self,values):
with SessionManager() as session: try:
session.add_all(values)
session.commit() except:
session.rollback() def select(self,param):
with SessionManager() as session: return session.query(YoukuBanner).select_from(YoukuBanner).filter(param) def remove(self,parma):
with SessionManager() as session: try:
session.query(YoukuBanner).filter(parma).delete(synchronize_session='fetch')
session.commit() except:
session.rollback() def update(self,param,values):
with SessionManager() as session: try:
session.query(YoukuBanner).filter(param).update(values, synchronize_session='fetch')
session.commit() except:
session.rollback()Copy the code
youkuservice.py
import requestsimport jsonimport reimport datetimefrom bs4 import BeautifulSoupfrom sqlalchemy import textfrom youku_any.youkubannerdao import YoukuBannerdef getsoup(url):
with requests.get(url, params=None, headers=None) as req: ifreq.encoding ! ='utf-8':
encodings = requests.utils.get_encodings_from_content(req.text) if encodings:
encode = encodings[0] else:
encode = req.apparent_encoding
encode_content = req.content.decode(encode).encode('utf-8')
soup = BeautifulSoup(encode_content, 'html.parser') return soupdef getbanner(soup):
# soup = BeautifulSoup()
# soup.findChild()
bannerDivP = soup.find('div', {'id': 'm_86804'.'name': 'm_pos'})
bannerScript = bannerDivP.findChildren('script', {'type': 'text/javascript'})[1].text
m = re.search('\ [* \]', bannerScript)
banners = json.loads(m.group()) for banner in banners:
time = datetime.datetime.now()
youkubanner = YoukuBanner(type=1, year=time.year, month=time.month, date=time.day, hour=time.hour,
minute=time.minute,
img=banner['img'], title=banner['title'], url=banner['url'])
youkubanner.add()
soup=getsoup('http://tv.youku.com/')
getbanner(soup)
youkuBanner = YoukuBanner()
youkuBanner.remove(parma=text('id=67 or id=71'))
youkuBanner.update(param=text('id=70'),values={'title':YoukuBanner.title + Wuthering Heights})for i inRange (0100, 00) : youkuBanner. Update (param = the text ('id=70'), values={'title': YoukuBanner.title + Wuthering Heights})
bannerList = youkuBanner.select(param=text('id > 66 and id < 77 order by id asc limit 0,7'))
print("lines--------%d" % i) # time.sleep(10)
for banner in bannerList:
print(banner.id,banner.minute,banner.img,banner.title)Copy the code
At this point, a simple crawler script has been written, and the results of the two days over the weekend are a little satisfying, but this is just the tip of the Python iceberg, and there is much more to explore.
Netease Cloud Free experience pavilion, 0 cost experience 20+ cloud products!
For more information about NETEASE’s r&d, product and operation experience, please visit netease Cloud Community.
Use of the responsibility chain model – analysis of Netty ChannelPipeline and Mina IoFilterChain