Hi, I’m Latiao.

Believe it or not, I was set up by the family of an undergrad… Is it in such a hurry? Is this the wedding market now? Boys and girls will be caught by their families and married if they don’t work hard.

This is a chat with my mom, and I’ll show you this girl.

There’s no more. Is there a problem with my conversation? Guys, tell me I’m not straight. Do YOU want me to go out and talk? Then I thought of the matchmaking market, inspired by this want to climb a matchmaking market data down to see, not only can let everyone understand the situation of single men and women now, but also can learn technology, why not? Get straight to the point!

@TOC

The crawler target

Web site:XXXX matchmaking

Results show

Tool use

Development environment: Win10, PYTHon3.7 Development tools: PyCharm, Chrome toolkit: Requests, DOCx, LXML

Key learning content

1. Xpath extraction of data 2. Docx document data storage 3

Project idea analysis

Choose the age division of your wealth passwordGets the page data for the current page

The corresponding hyperlinks are extracted by xpath

Get the picture address, used to save the picture

def get_data(url) :
    response = requests.get(url, headers=headers)
    # print(response.text)
    data = etree.HTML(response.text)
    href_list = data.xpath("//div[@class='e-img']/a/@href")
    img_list = data.xpath("//div[@class='e-img']/a/img/@src")
Copy the code

Splice the URL address of the detail page obtain the data of the detail page Obtain the image data

  • The name
  • Record of formal schooling
  • professional
  • Marriage status
  • Work address
  • requirements
  • .

    for href, img in zip(href_list, img_list):
        img = requests.get("https://www.csflhjw.com" + img, headers=headers).content
        print(img)
        f = open("1.jpg"."wb")
        f.write(img)
        res = requests.get("https://www.csflhjw.com" + href, headers=headers)
        # print(res.text)
        html = etree.HTML(res.text)
        name = html.xpath('//div[@class="team-e"]/h2/text()') [0]
        edu = html.xpath('//div[@class="team-e"]/p[1]/text()') [0]
        profession = html.xpath('//div[@class="team-e"]/p[2]/text()')
        sponsa = html.xpath('//div[@class="team-e"]/p[3]/text()') [0]
        children = html.xpath('//div[@class="team-e"]/p[4]/text()') [0]
        house = html.xpath('//div[@class="team-e"]/p[5]/text()') [0]
        add = html.xpath('//div[@class="team-e"]/p[6]/text()') [0]
        ask_for = html.xpath('//div[@class="hunyin-1-2"]/p[2]/span/text()') [0]

Copy the code

Save the data in the docX document creation document file

		document = Document()
		document.add_heading('Sweet Matchmaker')

      	document.add_paragraph("Name:" + name)
        document.add_paragraph(edu)
        document.add_paragraph(profession)
        document.add_paragraph(sponsa)
        document.add_paragraph(children)
        document.add_paragraph(house)
        document.add_paragraph(add)
        document.add_paragraph(ask_for)
        document.add_picture("1.jpg")
        document.add_paragraph("")
Copy the code

Simple source code analysis

import requests
from docx import Document
from lxml import etree

document = Document()
document.add_heading('Sweet Matchmaker')


headers = {
    'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
}


def get_data(url) :
    response = requests.get(url, headers=headers)
    # print(response.text)
    data = etree.HTML(response.text)
    href_list = data.xpath("//div[@class='e-img']/a/@href")
    img_list = data.xpath("//div[@class='e-img']/a/img/@src")
    # print(href_list)
    for href, img in zip(href_list, img_list):
        img = requests.get("https://www.csflhjw.com" + img, headers=headers).content
        print(img)
        f = open("1.jpg"."wb")
        f.write(img)
        res = requests.get("https://www.csflhjw.com" + href, headers=headers)
        # print(res.text)
        html = etree.HTML(res.text)
        name = html.xpath('//div[@class="team-e"]/h2/text()') [0]
        edu = html.xpath('//div[@class="team-e"]/p[1]/text()') [0]
        profession = html.xpath('//div[@class="team-e"]/p[2]/text()')
        sponsa = html.xpath('//div[@class="team-e"]/p[3]/text()') [0]
        children = html.xpath('//div[@class="team-e"]/p[4]/text()') [0]
        house = html.xpath('//div[@class="team-e"]/p[5]/text()') [0]
        add = html.xpath('//div[@class="team-e"]/p[6]/text()') [0]
        ask_for = html.xpath('//div[@class="hunyin-1-2"]/p[2]/span/text()') [0]
        document.add_paragraph("Name:" + name)
        document.add_paragraph(edu)
        document.add_paragraph(profession)
        document.add_paragraph(sponsa)
        document.add_paragraph(children)
        document.add_paragraph(house)
        document.add_paragraph(add)
        document.add_paragraph(ask_for)
        document.add_picture("1.jpg")
        document.add_paragraph("")


def main() :
    for i in range(1.2):
        url = "https://www.csflhjw.com/zhenghun/9.html?page={}".format(i)
        html_data = get_data(url)


if __name__ == '__main__':
    main()
    document.save('demo.docx')
Copy the code

PS: I have been able to see my future was urging the scene of marriage, brothers refueling, good money to start a career after a family! Article content is for learning exchange only! If it helps you remember to give latiao three lian!