Many people in different companies may need to collect external data from the Internet for a variety of reasons: analyzing competition, putting together news summaries, tracking trends in specific markets, or collecting daily stock prices to build predictive models…
Whether you’re a data scientist or a business analyst, you’ve probably come across this situation from time to time and asked yourself a perennial question: How can I extract the site’s data for market analysis?
One possible free method of extracting web site data and its structure is a crawler.
In this article, you’ll learn how to do data crawler tasks easily with Python.
What is a reptile?
Broadly speaking, data crawler refers to the process of programmatically extracting website data and structuring it according to its requirements.
Many companies are using data crawlers to collect external data and support their business operations: this is a common practice in many areas today.
What do I need to know to learn about fetching in Python?
It’s easy, but you’ll need some Python and HTML knowledge first.
In addition, you need to know about two very effective frameworks, such as Scrapy or Selenium.
Detailed introduction
Next, let’s learn how to turn your website into structured data!
To do this, you first need to install the following libraries:
- Requests: Mock HTTP requests (such as GET and POST), which we’ll use primarily to access the source code for any given web site
- BeautifulSoup: Easy to parse HTML and XML data
- LXML: Improves the parsing speed of XML files
- Pandas: Constructs the data as Dataframes and exports it in a format of your choice (JSON, Excel, CSV, etc.)
If you’re using Anaconda, it’s easy to configure, and these packages are pre-installed.
If you do not use Anaconda, run the following command to install the tool package:
pip install requests
pip install beautifulsoup4
pip install lxml
pip install pandas
Copy the code
What websites and data are we going to grab?
This is the first question to be answered in the crawler process.
This article uses Premium Beauty News as an example.
Featuring premium beauty news, it publishes the latest trends in the beauty market.
If you look at the home page, you’ll see that the articles we’re going to crawl are organized in a grid.
The multi-page organization is as follows:
Of course, we’ll just extract the title of each article that appears on these pages, and we’ll drill down into each post to get the details we need, for example:
- The title
- The date of
- Abstract
- The full text
Coding practices
Previously, you’ve covered the basics and the toolkits you need to use.
Next, the steps of the formal coding practice.
First, you need to import the basic toolkit:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm_notebook
Copy the code
I usually define a function to parse the contents of each page for a given URL.
This function will be called many times, and it will be called parse_url:
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml")
return parsed_response
Copy the code
Extract each post data and metadata
First, I’ll define a function that extracts the data (title, date, abstract, and so on) for each post at a given URL.
We will then call this function within the for loop that iterates through all pages.
To build our crawler, we must first understand the basic HTML logic and structure of the page. Take extracting the title of a post as an example to explain.
By checking this element in the Chrome Inspector:
We notice that the title appears in the article-title class h1.
After extracting the page content with BeautifulSoup, you can extract the title using the find method.
title = soup_post.find("h1", {"class": "article-title"}).text
Copy the code
Next, take a look at the date:
The date is displayed within a span, and the range itself is displayed within the header of the Row sub-header class.
Converting this into code using BeautifulSoup is very easy:
datetime = soup_post.find("header", {"class": "row sub- header"}).find("span")["datetime"]
Copy the code
The next step is the summary:
It comes under the article- Intro h2 tag:
abstract = soup_post.find("h2", {"class": "article-intro"}).text
Copy the code
Now you need to crawl the full content of the post. This part is easy if you already understand the previous part.
This content is in multiple paragraphs (p tags) within div of the article-text class.
BeautifulSoup extracts the complete text in one of the following ways. Instead of going through each and every P tag, extracting the text, and then concatenating all the text together.
content = soup_post.find("div", {"class": "article-text"}).text
Copy the code
Let’s put them in the same function:
def extract_post_data(post_url):
soup_post = parse_url(post_url)
title = soup_post.find("h1", {"class": "article-title"}).text
datetime = soup_post.find("header", {"class": "row sub-header"}).find("span")["datetime"]
abstract = soup_post.find("h2", {"class": "article-intro"}).text
content = soup_post.find("div", {"class": "article-text"}).text
data = {
"title": title,
"datetime": datetime,
"abstract": abstract,
"content": content,
"url": post_url
}
return data
Copy the code
Extract post urls on multiple pages
If we examine the source code for the home page, we see the title of each page’s article:
As you can see, every 10 articles appear under 1 post-style1 col-MD-6 tag:
Now, it’s easy to extract the articles for each page:
url = "https://www.premiumbeautynews.com/fr/marches-tendances/"
soup = parse_url(url)
section = soup.find("section", {"class": "content"})
posts = section.findAll("div", {"class": "post-style1 col-md-6"})
Copy the code
Then, for each individual post, we can extract the URL that appears inside the H4 tag.
We will use this URL to call the function extract_post_data that we defined earlier.
uri = post.find("h4").find("a")["href"]
Copy the code
paging
After extracting a post on a given page, you need to go to the next page and repeat the same action.
To view paging, click the “Next” button:
After reaching the last page, this button becomes invalid.
In other words, when the next button is in a valid state, it needs to perform the crawler operation, move to the next page and repeat the operation. This process should stop when the button becomes invalid.
To summarize this logic, this translates to the following code:
next_button = ""
posts_data = []
count = 1
base_url = 'https://www.premiumbeautynews.com/'
while next_button isnotNone:
print(f"page number : {count}")
soup = parse_url(url)
section = soup.find("section", {"class": "content"})
posts = section.findAll("div", {"class": "post-style1 col-md-6"})
for post in tqdm_notebook(posts, leave=False):
uri = post.find("h4").find("a")["href"]
post_url = base_url + uri
data = extract_post_data(post_url)
posts_data.append(data)
next_button = soup.find("p", {"class": "pagination"}).find("span", {"class": "next"})
if next_button isnotNone:
url = base_url + next_button.find("a")["href"]
count += 1
Copy the code
Once this loop is complete, all the data is stored in Posts_data, which can be converted into nice DataFrames and exported as CSV or Excel files.
df = pd.DataFrame(posts_data)
df.head()
Copy the code
Here, an unstructured web page into structured data!
Dry recommended
In order to facilitate everyone, I spent half a month’s time to get over the years to collect all kinds of iso technical finishing together, content including but not limited to, Python, machine learning, deep learning, computer vision, recommendation system, Linux, engineering, Java, content of up to 5 t +, I put all the resources download link to a document, The directory is as follows:
All dry goods to everyone, hope to be able to support it!
https://http://pan.baidu.com/s/1eks7CUyjbWQ3A7O9cmYljA (code: 0000)