Preface:

Thursday will bring you a taobao commodity data crawler. And visualize the data as usual. Without further ado, let’s begin happily

The development tools

Python version: 3.6.4

Related modules:

DecryptLogin module;

Pyecharts module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Data crawl

Since said is the simulation login related crawler small case, the first natural is to achieve taobao simulation login. Again, we use our open source DecryptLogin library to do this in three lines:

Simulated login to Taobao
@staticmethod
def login() :
    lg = login.Login()
    infos_return, session = lg.taobao()
    return session
Copy the code

Also, incidentally, I’ve often been asked to add persistent cookies to the DecryptLogin library. You could have done it yourself by writing two more lines of code:

if os.path.isfile('session.pkl'):
    self.session = pickle.load(open('session.pkl'.'rb'))
else:
    self.session = TBGoodsCrawler.login()
    f = open('session.pkl'.'wb')
    pickle.dump(self.session, f)
    f.close()
Copy the code

I really don’t want to add this feature to the library, but I’d like to add some other crawling-related features later, but I’ll talk about that later. Okay, off topic. Let’s get back to the point. Next, let’s go to the web version of Taobao to catch a wave of bags. For example, F12 opens the developer tool and type something randomly into the product search bar on Taobao, like this:

A global search for keywords such as search yields links like this:

Let’s see what it returns:

I guess that’s right. In addition, if you do not find this interface API, you can try to click the next page of product button in the upper right corner:

This will definitely catch the request interface. A simple test shows that although the number of parameters required to request this interface may seem large, there are actually only two parameters that must be submitted:

Q: merchandise name S: offset of the current page numberCopy the code

Well, according to this interface, and our test results, we can now happily start to realize taobao commodity data capture. Specifically, the main code implementation is as follows:

"External call"
def run(self) :
    search_url = 'https://s.taobao.com/search?'
    while True:
        goods_name = input('Please enter the name of the commodity information you want to capture:')
        offset = 0
        page_size = 44
        goods_infos_dict = {}
        page_interval = random.randint(1.5)
        page_pointer = 0
        while True:
            params = {
                        'q': goods_name,
                        'ajax': 'true'.'ie': 'utf8'.'s': str(offset)
                    }
            response = self.session.get(search_url, params=params)
            if(response.status_code ! =200) :break
            response_json = response.json()
            all_items = response_json.get('mods', {}).get('itemlist', {}).get('data', {}).get('auctions'[]),if len(all_items) == 0:
                break
            for item in all_items:
                if not item['category'] :continue
                goods_infos_dict.update({len(goods_infos_dict)+1: 
                                            {
                                                'shope_name': item.get('nick'.' '),
                                                'title': item.get('raw_title'.' '),
                                                'pic_url': item.get('pic_url'.' '),
                                                'detail_url': item.get('detail_url'.' '),
                                                'price': item.get('view_price'.' '),
                                                'location': item.get('item_loc'.' '),
                                                'fee': item.get('view_fee'.' '),
                                                'num_comments': item.get('comment_count'.' '),
                                                'num_sells': item.get('view_sales'.' ')}})print(goods_infos_dict)
            self.__save(goods_infos_dict, goods_name+'.pkl')
            offset += page_size
            if offset // page_size > 100:
                break
            page_pointer += 1
            if page_pointer == page_interval:
                time.sleep(random.randint(30.60)+random.random()*10)
                page_interval = random.randint(1.5)
                page_pointer = 0
            else:
                time.sleep(random.random()+2)
        print('[INFO]: select * from '%s', select * from' %s' ' % (goods_name, len(goods_infos_dict)))
Copy the code

Data visualization

Here we visualize a wave of milk tea data we caught. Let’s take a look at the nationwide distribution of milk tea merchants on Taobao:

Unexpectedly, the most milk tea shops are in Guangdong. T_T

Let’s take a look at the top 10 sales of milk tea shops on Taobao:

And the top 10 milk tea shops with the number of comments on Taobao:

Take a look at the proportion of goods in these stores that require and do not require freight:

Finally, take a look at the price range of milk tea related products:

That’s the end of this article, thank you for watching, follow me every day to share Python simulation login series, next article to share JINGdong commodity data crawler

To thank you readers, I’d like to share some of my recent programming favorites to give back to each and every one of you in the hope that they can help you.

Dry goods mainly include:

① Over 2000 Python ebooks (both mainstream and classic books should be available)

②Python Standard Library (Most Complete Chinese version)

③ project source code (forty or fifty interesting and classic practice projects and source code)

④Python basic introduction, crawler, Web development, big data analysis video (suitable for small white learning)

⑤ A Roadmap for Learning Python

All done~ Complete source code + dry + Python novice learning exchange group: 594356095