Use Python to crawl taobao goods, actually found the way to make money!

A few months ago, my sister told me to open an online shop on Taobao, to sell the fish raised in the family pond as snacks, because there are many online fish snack shops, the sales seem to be good.

My sister knew THAT I had some skills that they considered “superior,” so she asked me if I could help her get some data on that, because she had to do some research before opening the store, such as what price points were most popular for small fish snacks. Of course, you can do it yourself manually, but that’s cumbersome and error-prone, so using technology is best.

It fell to me was not a difficult thing, so take time to give her to do a wave of crawler and data analysis, and finally gave the data to her, after she analysis, discovered that seemingly red sea fish snacks market, incredibly still have blue ocean area, following a prepared, sister has run her fish snack shops, In the first month, we made more than 30,000 yuan after deducting costs and advertising expenses.

Next, I would like to share with you how I used Python to do taobao shop market research for my sister.

1. Project requirements

First of all, I want to climb the target url of nature is Taobao, the target url:

S.taobao.com/search?q=%E…

Then let’s look at what the sister needs are:

1. Crawl the sales volume and amount of taobao products in the first 10 pages, find out the sales volume of products at each price, and display the analysis results with graphs. Prices are divided into the following 10 ranges:

2. Crawl the geographic location information of Taobao merchants in the first 10 pages and display the analysis results with graphs.

3. Crawl the names of the top 10 stores in the first 10 pages and link to the fish snacks (1 link for each store).

4. Crawl the purchased user comment keywords in the top 10 stores with the largest number of buyers to generate word maps.

From the above four needs, it is obvious that sister wants to know what price is the most small fish snack market stores, what areas are these stores in, what are the stores that do well, and what is the most concerned about by users.

2. Effect preview

Code I wrote almost 2 days of time, basically has all the needs of the sister out, let’s see the effect.

1. The quantity of all small fish snack products in different price ranges in the first 10 pages of Taobao.

It can be seen that in the price of 10~30 yuan, small fish snacks have been a red sea, too many people sell, basically no profit, followed by less than 10 yuan and 30-50 yuan these two ranges, sell a lot of; What’s interesting is that there are only five in the 70-90 range, and only one in the 110 to 130 range.

So you can think about doing the middle and high end of the market, here is still a blue ocean.

2. Regional distribution map of shops selling small fish snacks in China

Sister’s home is in Hunan, thanks to the nourishment of dongting Lake water system, there are still many businesses selling small fish snacks on the Internet, followed by the coastal area is the most, it seems that sister’s competitors in the business around.

3. Links to the top 10 stores with the most people buying

With this, my sister can often refer to how others do it, in fact, this can also check out the most sold products on Taobao, I do not understand why to do so, but still do it.

Four, user cloud word map

Word cloud figure I just made a rough word figure, we can tell, users care about most points on the packing quality, taste flavor, shelf life, product component, etc., after understanding the users care about most points, the design of the product, set up shop, advertisement, and so on can “suit the remedy to the case”.

Three, part of the source code

Because the source code is relatively long, there are three source files, so here I will only show part of the source code, the need for source code can find me, you can also do some other categories of Taobao products market analysis, maybe you can find more ways to make money!

Part of the source:

import csv
import os
import time
import wordcloud
from selenium import webdriver
from selenium.webdriver.common.by import By


def tongji() :
    prices = []
    with open('First ten pages of sales and dollars.csv'.'r', encoding='utf-8', newline=' ') as f:
        fieldnames = ['price'.'sales'.'Store Location']
        reader = csv.DictReader(f, fieldnames=fieldnames)
        for index, i in enumerate(reader):
            ifindex ! =0:
                price = float(i['price'].replace('selections'.' '))
                prices.append(price)
    DATAS = {'< 10': 0.'10 ~ 30': 0.'30 ~ 50': 0.'50 ~ 70': 0.'70~90': 0.'90~110': 0.'110~130': 0.'130~150': 0.'150~170': 0.'170~200': 0,}for price in prices:
        if price < 10:
            DATAS['< 10'] + =1
        elif 10 <= price < 30:
            DATAS['10 ~ 30'] + =1
        elif 30 <= price < 50:
            DATAS['30 ~ 50'] + =1
        elif 50 <= price < 70:
            DATAS['50 ~ 70'] + =1
        elif 70 <= price < 90:
            DATAS['70~90'] + =1
        elif 90 <= price < 110:
            DATAS['90~110'] + =1
        elif 110 <= price < 130:
            DATAS['110~130'] + =1
        elif 130 <= price < 150:
            DATAS['130~150'] + =1
        elif 150 <= price < 170:
            DATAS['150~170'] + =1
        elif 170 <= price < 200:
            DATAS['170~200'] + =1

    for k, v in DATAS.items():
        print(k, ':', v)


def get_the_top_10(url) :
    top_ten = []
    # Get proxy
    ip = zhima1()[2][random.randint(0.399)]
    # Run quicker action (you can leave it alone)
    os.system('"C:\Program Files\Quicker\QuickerStarter.exe" runaction:5e3abcd2-9271-47b6-8eaf-3e7c8f4935d8')
    options = webdriver.ChromeOptions()
    Debug Chrome remotely
    options.add_experimental_option('debuggerAddress'.'127.0.0.1:9222')
    options.add_argument(f'--proxy-server={ip}')
    driver = webdriver.Chrome(options=options)
    Implicit waiting
    driver.implicitly_wait(3)
    # Open web page
    driver.get(url)
    Click the part of the page element whose text contains' sales'
    driver.find_element(By.PARTIAL_LINK_TEXT, 'sales').click()
    time.sleep(1)
    Slide to the bottom of the page
    driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    time.sleep(1)
    # Find elements
    element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')
    items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')
    for index, item in enumerate(items):
        if index == 10:
            break
        # Find elements
        price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text
        paid_num_data = item.find_element(By.XPATH, './div[2]/div[1]/div[@class="deal-cnt"]').text
        store_location = item.find_element(By.XPATH, './div[2]/div[3]/div[@class="location"]').text
        store_href = item.find_element(By.XPATH, './div[2]/div[@class="row row-2 title"]/a').get_attribute(
            'href').strip()
        # Add data to dictionary
        top_ten.append(
            {'price': price,
             'sales': paid_num_data,
             'Store Location': store_location,
             'Shop Link': store_href
             })

    for i in top_ten:
        print(i)


def get_top_10_comments(url) :
    with open('Top 10 evaluations.txt'.'w+', encoding='utf-8') as f:
        pass
    # ip = ipidea()[1]
    os.system('"C:\Program Files\Quicker\QuickerStarter.exe" runaction:5e3abcd2-9271-47b6-8eaf-3e7c8f4935d8')
    options = webdriver.ChromeOptions()
    options.add_experimental_option('debuggerAddress'.'127.0.0.1:9222')
    # options.add_argument(f'--proxy-server={ip}')
    driver = webdriver.Chrome(options=options)
    driver.implicitly_wait(3)
    driver.get(url)
    driver.find_element(By.PARTIAL_LINK_TEXT, 'sales').click()
    time.sleep(1)
    element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')
    items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')
    original_handle = driver.current_window_handle
    item_hrefs = []
    Get the top 10 links first
    for index, item in enumerate(items):
        if index == 10:
            break
        item_hrefs.append(
            item.find_element(By.XPATH, './/div[2]/div[@class="row row-2 title"]/a').get_attribute('href').strip())
    # Crawl top 10 ratings for each item
    for item_href in item_hrefs:
        Open a new TAB
        # item_href = 'https://item.taobao.com/item.htm?id=523351391646&ns=1&abbucket=11#detail'
        driver.execute_script(f'window.open("{item_href}") ')
        # Switch over
        handles = driver.window_handles
        driver.switch_to.window(handles[-1])

        Slide down a section of the page until the comment word is displayed
        try:
            driver.find_element(By.PARTIAL_LINK_TEXT, 'evaluation').click()
        except Exception as e1:
            try:
                x = driver.find_element(By.PARTIAL_LINK_TEXT, 'evaluation').location_once_scrolled_into_view
                driver.find_element(By.PARTIAL_LINK_TEXT, 'evaluation').click()
            except Exception as e2:
                try:
                    # First slide down 100 and place 2 words not shown in the screen
                    driver.execute_script('var q=document.documentElement.scrollTop=100')
                    x = driver.find_element(By.PARTIAL_LINK_TEXT, 'evaluation').location_once_scrolled_into_view
                except Exception as e3:
                    driver.find_element(By.XPATH, '/html/body/div[6]/div/div[3]/div[2]/div/div[2]/ul/li[2]/a').click()
        time.sleep(1)
        try:
            trs = driver.find_elements(By.XPATH, '//div[@class="rate-grid"]/table/tbody/tr')
            for index, tr in enumerate(trs):
                if index == 0:
                    comments = tr.find_element(By.XPATH, './td[1]/div[1]/div/div').text.strip()
                else:
                    try:
                        comments = tr.find_element(By.XPATH,
                                                   './td[1]/div[1]/div[@class="tm-rate-fulltxt"]').text.strip()
                    except Exception as e:
                        comments = tr.find_element(By.XPATH,
                                                   './td[1]/div[1]/div[@class="tm-rate-content"]/div[@class="tm-rate-fulltxt"]').text.strip()
                with open('Top 10 evaluations.txt'.'a+', encoding='utf-8') as f:
                    f.write(comments + '\n')
                    print(comments)
        except Exception as e:
            lis = driver.find_elements(By.XPATH, '//div[@class="J_KgRate_MainReviews"]/div[@class="tb-revbd"]/ul/li')
            for li in lis:
                comments = li.find_element(By.XPATH, './div[2]/div/div[1]').text.strip()
                with open('Top 10 evaluations.txt'.'a+', encoding='utf-8') as f:
                    f.write(comments + '\n')
                    print(comments)


def get_top_10_comments_wordcloud() :
    file = 'Top 10 evaluations.txt'
    f = open(file, encoding='utf-8')
    txt = f.read()
    f.close()

    w = wordcloud.WordCloud(width=1000,
                            height=700,
                            background_color='white',
                            font_path='msyh.ttc')
    Create the word cloud object and set the properties of the generated image

    w.generate(txt)
    name = file.replace('.txt'.' ')
    w.to_file(name + 'word cloud. PNG')
    os.startfile(name + 'word cloud. PNG')


def get_10_pages_datas() :
    with open('First ten pages of sales and dollars.csv'.'w+', encoding='utf-8', newline=' ') as f:
        f.write('\ufeff')
        fieldnames = ['price'.'sales'.'Store Location']
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
    infos = []
    options = webdriver.ChromeOptions()
    options.add_experimental_option('debuggerAddress'.'127.0.0.1:9222')
    # options.add_argument(f'--proxy-server={ip}')
    driver = webdriver.Chrome(options=options)
    driver.implicitly_wait(3)
    driver.get(url)
    # driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')
    items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')
    for index, item in enumerate(items):
        price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text
        paid_num_data = item.find_element(By.XPATH, './div[2]/div[1]/div[@class="deal-cnt"]').text
        store_location = item.find_element(By.XPATH, './div[2]/div[3]/div[@class="location"]').text
        infos.append(
            {'price': price,
             'sales': paid_num_data,
             'Store Location': store_location})
    try:
        driver.find_element(By.PARTIAL_LINK_TEXT, 'next').click()
    except Exception as e:
        driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        driver.find_element(By.PARTIAL_LINK_TEXT, 'next').click()
    for i in range(9):
        time.sleep(1)
        driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        element = driver.find_element(By.ID, 'mainsrp-itemlist').find_element(By.XPATH, './/div[@class="items"]')
        items = element.find_elements(By.XPATH, './/div[@data-category="auctions"]')
        for index, item in enumerate(items):
            try:
                price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text
            except Exception:
                time.sleep(1)
                driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
                price = item.find_element(By.XPATH, './div[2]/div[1]/div[contains(@class,"price")]').text
            paid_num_data = item.find_element(By.XPATH, './div[2]/div[1]/div[@class="deal-cnt"]').text
            store_location = item.find_element(By.XPATH, './div[2]/div[3]/div[@class="location"]').text
            infos.append(
                {'price': price,
                 'sales': paid_num_data,
                 'Store Location': store_location})
        try:
            driver.find_element(By.PARTIAL_LINK_TEXT, 'next').click()
        except Exception as e:
            driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
            driver.find_element(By.PARTIAL_LINK_TEXT, 'next').click()
        # End of page
        for info in infos:
            print(info)
        with open('First ten pages of sales and dollars.csv'.'a+', encoding='utf-8', newline=' ') as f:
            fieldnames = ['price'.'sales'.'Store Location']
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            for info in infos:
                writer.writerow(info)


if __name__ == '__main__':
    url = 'https://s.taobao.com/search?q=%E5%B0%8F%E9%B1%BC%E9%9B%B6%E9%A3%9F&imgfile=&commend=all&ssid=s5-e&search_type=item&sour CeId = TB. Index&spm = a21bo. 21814703.201856 - taobao - item. 1 & ie = utf8 & initiative_id = tbindexz_20170306 & bcoffset = 4 & ntoffset = 4 & p4pp ushleft=2%2C48&s=0'
    # get_10_pages_datas()
    # tongji()
    # get_the_top_10(url)
    # get_top_10_comments(url)
    get_top_10_comments_wordcloud()
Copy the code

Four, my sister

Are you curious about what my sister looks like? I’m sorry, I can’t show you the face, I can only show you the back of my sister:

So that’s all for today’s sharing, remember to give me three lines ha!

Use Python to crawl taobao goods, actually found the way to make money!

1. Project requirements

2. Effect preview

Three, part of the source code

Four, my sister

Related Posts

Want to learn Lua? Just read this one!

Linux LVM and disk management

Looking stunned when asked about the JVM in an interview? 716 pages of premium JVM notes, giving him the straight light