Recently I had to perform an operation to crawl some pages from a site that needed to be logged in to. It wasn’t as easy as I thought it would be, so I decided to write a tutorial for it.

In this tutorial, we’ll crawl a list of items from our Bitbucket account.

The code for the tutorial can be found on my Github.

We will follow the following steps:

  1. Extract the details required for login
  2. Perform site login
  3. Crawl the data you need

For this tutorial, I used the following packages (available in requirements.txt) :

requests
lxmlCopy the code

Step 1: Research the site

Open the login page

Enter the following page “bitbucket.org/account/sig…” . You will see a page similar to the one shown below (perform logout in case you are logged in)

Review images

Go through the details we need to extract for login

In this section, we will create a dictionary to hold the details of performing the login:

1. Right-click Username or Email and choose View Elements. We will use the value of the input field whose “name” attribute is “username”. “Username” will be the key, and our username/email address will be the corresponding value (on other sites these key values might be “email”, “user_name”, “login”, etc.).

Review images

Review images

2. Right-click Password and choose View Elements. In the script we need to use the value of the input field whose “name” attribute is “password”. “Password” will be the dictionary key value, and the password we enter will be the corresponding value (on other sites the key value might be “userPassword”, “loginPassword”, “PWD”, etc.).

Review images

Review images

3. In the source code page, look for a hidden input tag called “CSRFMIDDLEwareToken”. “Csrfmiddlewaretoken” will be the key value, and the corresponding value will be the hidden input value. (On other sites this value may be a value named “CsrfToken,” Hidden input value of “AuthenticationToken”). Such as: “Vy00PE3Ra6aISwKBrPn72SFml00IcUV8”.

Review images

Review images

We’ll end up with a dictionary that looks something like this:

payload = {
    "username": "<USER NAME>", 
    "password": "<PASSWORD>", 
    "csrfmiddlewaretoken": "<CSRF_TOKEN>"
}Copy the code

Remember, this is a specific example of this site. While this login form is simple, other sites may require us to examine the browser’s request log and find relevant key and value values that should be used during the login step.

Step 2: Log in to the website

For this script, we just need to import the following:

import requests
from lxml import htmlCopy the code

First, we need to create the Session object. This object will allow us to save all login session requests.

session_requests = requests.session()Copy the code

Second, we want to extract the CSRF tag from the web page to be used when logging in. In this example, we are using LXML and xpath to extract the data. We can also use regular expressions or some other method to extract the data.

login_url = "https://bitbucket.org/account/signin/?next=/"
result = session_requests.get(login_url)
 
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]Copy the code

** More information about xpath and LXML can be found here.

Next, we perform the login phase. In this phase, we send a POST request to the login URL. We use the payload created in the previous step as data. You can also use a title for the request and add a reference key to the same URL in that title.

result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)Copy the code

Step 3: Crawl the content

Now that we’ve logged in successfully, we’ll perform the actual crawl from the Bitbucket Dashboard page.

url = 'https://bitbucket.org/dashboard/overview'
result = session_requests.get(
    url, 
    headers = dict(referer = url)
)Copy the code

To test the above, we climbed the list of items from the Bitbucket Dashboard page. We’ll use xpath again to find the target element, clean up the text and whitespace in the new line, and print out the result. If everything runs OK, the output should be the buckets/Project list in your Bitbucket account.

tree = html.fromstring(result.content)
bucket_elems = tree.findall(".//span[@class='repo-name']/")
bucket_names = [bucket.text_content.replace("n", "").strip() for bucket in bucket_elems]
 
print bucket_namesCopy the code

You can also verify these request results by examining the status code returned from each request. It won’t always let you know if the login phase was successful, but it can be used as a validation indicator.

Such as:

Result. ok # will tell us if the last request was successful. Result. status_code # will return the status of the last requestCopy the code

That’s it.