preface
Why use an agent
In the process of fetching in the network, we often meet many sites to crawl prevention technology, or because their sampling site information strength and speed is too big, bring too much pressure to the server to the other party, so you always use the same agent IP crawl the web, IP might be forbidden access web pages, Therefore, basically crawler can not avoid the PROBLEM of IP, need a lot of IP to realize their IP address switching, to achieve the purpose of normal information capture.
Common Solutions
Using the IP proxy pool, using the proxy IP address of the proxy pool, hiding our actual IP address, how to bypass the interference of anti-crawl technology. Here incidentally recommend a Githup open source project github.com/jhao104/pro… : This project builds its own proxy IP pool by collecting the proxy IP addresses of several commonly used free proxy websites.
Instead of using IP proxy pools, we will use Tor(Onion routing) to anonymously access target addresses
introduce
What is Tor(Onion Routing)
Tor (The Onion Router) is an implementation of The second generation Onion Routing. Tor specializes in protecting users from traffic filtering and sniffer analysis. Tor communicates on an Overlay network composed of Onion Routers (Onions), enabling anonymous external connections and hidden services.
Idea of Agent Implementation
- Run the tor
- Use Tor as a proxy for Selenium in Python
- Make a request to a target web site
- Repeat steps 2 and 3
The implementation code
from stem import Signal
from stem.control import Controller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
Switch IP through Tor
def switchIP(a):
with Controller.from_port(port = 9051) as controller:
controller.authenticate()
controller.signal(Signal.NEWNYM)
Get the browser for the proxy
def get_browser(PROXY = None):
chrome_options = webdriver.ChromeOptions()
ifPROXY ! =None:
chrome_options.add_argument('--proxy-server=SOCKS5://{0}'.format(PROXY)) # agent
chrome_options.add_argument('blink-settings=imagesEnabled=false') # Do not load images to increase speed
chrome_options.add_argument('--headless') # Browsers don't provide visual pages.
executable_path='/Users/fewave/project/python/demo/chromedriver' Set up the boot driver
return webdriver.Chrome(executable_path=executable_path, options=chrome_options)
def main(a):
for x in range(5):
switchIP()
browser = get_browser('127.0.0.1:9050')
browser.get('https://cip.cc')
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
print('====== the %d request =======' % (x+1))
print(soup.find_all('pre'))
browser.quit()
if __name__ == '__main__':
main()
Copy the code
The preparatory work
Before running the code, there are a few more things to do:
- Install Tor, since my local computer is a MAC, so install directly from BREW
brew install tor
After the installation is complete, start the Tor service,brew services start tor
- Download the browser drive, because of my local use Chrome, therefore to sites.google.com/a/chromium…. Download the driver of the corresponding version (the driver version must match that of the local browser).
- Download python dependencies to execute commands
pip install selenium stem bs4
- Update the TORRC file and restart Tor so that requests can be made to the Tor controller. On the MAC, you can use the
/usr/local/etc/tor
Find the torrc.sample file in. By performingmv
Command to rename torrc.sample to Torrc
mv /usr/local/etc/tor/torrc.sample /usr/local/etc/tor/torrc
Copy the code
And uncomment the following two lines in the Torrc file
ControlPort 9051
CookieAuthentication 1
Copy the code
Restart Tor
brew services restart tor
Copy the code
The code is introduced
Switch IP through Tor
def switchIP(a):
with Controller.from_port(port = 9051) as controller:
controller.authenticate()
controller.signal(Signal.NEWNYM)
Copy the code
This method lets us switch IP. It sends a Signal (signal.newnym) to the Tor controller port, which tells Tor that we need a new circuit to route traffic. This will give us a new exit node, which means our traffic looks like it’s coming from another IP.
Get the browser for the proxy
def get_browser(PROXY = None):
chrome_options = webdriver.ChromeOptions()
ifPROXY ! =None:
chrome_options.add_argument('--proxy-server=SOCKS5://{0}'.format(PROXY)) # agent
chrome_options.add_argument('blink-settings=imagesEnabled=false') # Do not load images to increase speed
chrome_options.add_argument('--headless') # Browsers don't provide visual pages.
executable_path='/Users/fewave/project/python/demo/chromedriver' Set up the boot driver
return webdriver.Chrome(executable_path=executable_path, options=chrome_options)
Copy the code
This method sets Up Selenium WebDriver to use the Chrome browser in uncountable mode and uses Tor as a proxy to route our requests. This ensures that all requests to Selenium WebDriver go through Tor.
def main(a):
print('Start program')
for x in range(5):
switchIP()
browser = get_browser('127.0.0.1:9050')
browser.get('https://cip.cc')
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
print('====== the %d request =======' % (x+1))
print(soup.find_all('pre'))
browser.quit()
Copy the code
This last code only sends a request to cip. Cc so that we can check the REQUESTED IP through Selenium WebDriver. The IP Stem printed out of the proxy is a Tor based Python controller library that can be used to script or build Tor processes using Tor’s control protocol.
The execution result
====== First request ======= IP: 23.129.64.187 Address: Seattle, Washington, USA Carrier: Emeraldonion.org Data 2: US Data 3: US URL: http://www.cip.cc/23.129.64.187 = = = = = = second request IP = = = = = = = : 109.70.100.20 address: Austria Austria data 2:3: Austria data Austria URL: http://www.cip.cc/109.70.100.20 = = = = = = 3 times request IP = = = = = = = : 185.220.101.5 address: north Holland in Amsterdam in the Netherlands province carriers: torservers.net data 2: The Netherlands data 3: Germany URL: http://www.cip.cc/185.220.101.5 = = = = = = 4 times request IP = = = = = = = : 23.129.64.194 address: Seattle, Washington, us carriers: Emeraldonion.org data 2: U.S. data 3: the URL: http://www.cip.cc/23.129.64.194 = = = = = = 5 times request IP = = = = = = = : 162.244.81.196 address: New York New York carriers: serverroom.net data 2: U.S. data 3: the United States, New York, New York URL: http://www.cip.cc/162.244.81.196Copy the code
Apparently our real IP has been hidden
conclusion
The code above hides our real IP by starting the browser driver, which drives Tor through the browser. However, the startup of the driver is relatively slow, and frequent restart of the driver will make the efficiency of crawling web pages greatly reduced. Therefore, you should minimize the number of browser driver restarts when using the above method.
Ps: Selenium: Automated Testing tool. It supports a variety of browsers, including Chrome, Safari, Firefox and other major interface browsers. If you install a Selenium plug-in in these browsers, you can easily implement Web interface testing. In other words, Selenium supports these browser drivers.
Beautiful Soup: Provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. It is a toolkit that provides users with data to grab by parsing documents, and because it is simple, it takes very little code to write a complete application.
Stem: A Tor based Python controller library that can be used to script or build Tor processes using Tor’s control protocol.
Like can pay attention to the public number: lifelong kindergarten