This is the 23rd day of my participation in the August More Text Challenge.More challenges in August

Before the committee wrote a hot list crawler screenshot article and fast and elegant HTML report development

Play a little bigger this time, we climbed down the hot list directly save for the report to view.

First look at the results:

Get up!

The first step is to generate a report

That’s right. The crawler was out there, making stuff out of nothing, and it was just a little bit of data that generated the report.

Save the following code as report.py, which will be introduced later with this name.

from dominate.tags import *

"""Special HTML report generating function"""
def generate_html(tuples):
    _html = html()
    _head = head()
    _head.add(title("CSDN Hot List Report compiled by Lei School Committee"))
    _head.add(meta(charset="utf-8"))
    _html.add(_head)
    _body = _html.add(body())
    _table = table(border=1)
    with _table.add(tbody()):
        index = 0
        for tp in tuples:
            index += 1
            leiXW = tr()
            leiXW += td(str(index))
            leiXW += td(a(tp[1],href=tp[0]))
    with _body.add(div(cls="leixuewei")):
        h3("CSDN Hot List compiled by Lei School Committee")
    _body.add(_table)
    return _html.render()

"""Special function for generating save report directly"""
def lei_report(leixuewei_tuples, path):
    data = generate_html(leixuewei_tuples)
    with open(path, "w") as f:
        f.write(data)
       

if __name__ == "__main__":
    lxw_tuples = []
    lxw_tuples.append(("https://blog.csdn.net/geeklevin/article/details/119594295"."Python generates Html reports"))
    lxw_tuples.append(("https://blog.csdn.net/geeklevin/article/details/116771659"."Tired of Docker, try Vagrant."))
    path = "./csdn_rank.html"
    lei_report(lxw_tuples, path)
Copy the code

Code parsing

Generates an HTML web page and saves it to the specified path variable.

  1. Prepare an array of tuples
  2. Pass in the generate_html function, which builds with head and body. Body iterates through the input array to generate a table.
  3. Writes the table content output to a file

The effect is as follows:

The second step is to transform the previous crawler code

That is, the core code of this article in the screenshot of the hot list crawler, which is directly transformed below.

Ray learning committee to deal with streaming page crawler solution tips screenshot core code: ""
def resolve_height(driver, pageh_factor=5) :
    js = "return action=document.body.scrollHeight"
    height = 0
    page_height = driver.execute_script(js)
    ref_pageh = int(page_height * pageh_factor)
    step = 150 
    max_count = 15 
    count = 0 
    while count < max_count and height < page_height:
        #scroll down to page bottom
        for i in range(height, ref_pageh, step):
            count+=1
            vh = i
            slowjs='window.scrollTo(0, {})'.format(vh)
            print('exec js: %s' % slowjs)
            driver.execute_script(slowjs)
            sleep(0.3)
        if i >= ref_pageh- step:
            print([Demo]not fully read')
            break
        height = page_height
        sleep(2)
        page_height = driver.execute_script(js)
    print("finish scroll")
    return page_height

Get the actual window height
page_height = resolve_height(driver)
print("[Demo] Page height: %s"%page_height)
sleep(5)
driver.execute_script('document.documentElement.scrollTop=0')
sleep(1)
driver.save_screenshot(img_path)
page_height = driver.execute_script('return document.documentElement.scrollHeight') # page height
print("get accurate height : %s" % page_height)

# The code above is derived from the previous article

# reference report function
from report import lei_report

# drag to the bottom of the page
driver.execute_script(f'document.documentElement.scrollTop={page_height}; ')
sleep(1)
driver.save_screenshot(f'./leixuewei_rank_end.png')
blogs = driver.find_elements_by_xpath("//div[@class='hosetitem-title']/a")

Create an array
articles = []
for blog in blogs:
    link = blog.get_attribute("href")
    title = blog.text
    articles.append((link,title))

print('get %s articles' % len(articles))
print('articles : %s ' % str(articles))

Given a path, generate an HTML report
path = "./leixuewei_csdn_rank.html"
lei_report(articles, path)
print("Save hot list to path :%s" %path)

""" Xuewei Demo code, white whoring so much, pay attention to three even support it! ""
Copy the code

Code parsing

The stream crawler code in the previous article deleted the screenshot merge code segment.

And then, here’s the point. The steps are as follows:

  1. The crawler goes straight to the bottom, gets the link, and generates the array
  2. Then take a screenshot of the end of the page, so you can keep it as a souvenir
  3. The import calls the LEI_report function to generate the page

Relatively simple, not line by line interpretation.

The effect is as follows:

The report is too long and the screenshots are cut off. Look at that.



Conclusion: read more about this article

This article is for demonstration purposes only, please inform us of any objections to the demo website.

Finally use crawler must be careful, do not as a child’s play to climb organization website. You study also can’t take serious network to brush, this behavior will let you eat LAO rice!

By the way, there is also this you can follow for a long time to read the thunder committee interesting programming story compilation or the thunder Committee NodeJS series

Continuous learning and continuous development, I am Lei Xuewei! Programming is fun. The key is to get the technology right. Creation is not easy, please support, like collection support committee!