Scraping Pull Requests

Sometimes you may want to gather information from a webste and create a report of the gathered data. Extracting data from a web page is called web scraping.

I had an assingment to parse Pull Requests from Bitbucket and write the information to Excel. For this we can utilize a couple of great Python libraries.

Selenium is a nifty browser automation library that can navigate to URLs, read information from the page, click links and much more.

Install the libs we will be using.

{% highlight shell %} pip install selenium pip install xlsxwriter {% endhighlight %}

Basic setup requires importing a webdriver that can be specified to a browser of your choise. It´s good to define the PullRequestScraper as a class and add methods that implement the desired logic.

{% highlight python %} from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait import xlsxwriter

class PullRequestScraper: def init(self): self.driver = webdriver.Chrome(executable_path=’~/Downloads/chromedriver_win32/chromedriver.exe’) self.wait = WebDriverWait(self.driver, 10)

if name == ‘main’: scraper = PullRequestScraper() {% endhighlight %}

We want to navigate to a page in Bitbucket that lists open Pull Requests.

{% highlight python %} def get_open(self): self.driver.get(‘https:///stash/projects//repos//pull-requests?state=OPEN’) {% endhighlight %}

Bitbucket will redirect to login page so we should implement a method for handling the login process.

First assert is for making sure we are on the correct page.

{% highlight python %} def login(self): assert “Log in - Bitbucket” in self.driver.title

    myusername = ""
    mypassword = ""

    username = self.driver.find_element_by_id("j_username")
    username.clear()
    username.send_keys(myusername)

    password = self.driver.find_element_by_name("j_password")
    password.clear()
    password.send_keys(mypassword)

    self.driver.find_element_by_id("submit").click()

{% endhighlight %}

Don´t forget to call your methods in the init phase.

{% highlight python %} scraper.get_open() scraper.login() {% endhighlight %}

Next we need to parse the Pull Requests from the list. XPATH is a utility that allows specifying an DOM tree element with a special syntax.

{% highlight python %} def parse_pull_requests(self): pull_requests_table = self.driver.find_element(By.XPATH,’//[@id=“pull-requests-content”]/div/div/table’) pull_requests_titles = pull_requests_table.find_elements(By.XPATH,’//[@class=“pull-request-title”]’)

    for title in pull_requests_titles:
        title = title.get_attribute('innerHTML')

{% endhighlight %}

Add the call for this method also.

{% highlight python %} scraper.parse_pull_requests() {% endhighlight %}

OK so now we get a list of titles. The last part is to write it to Excel spreadsheet. There are multiple libraries for this but let´s use XLSXWriter. Create an instance of a workbook and name it.

{% highlight python %} workbook = xlsxwriter.Workbook(‘pull_requests.xlsx’) worksheet = workbook.add_worksheet(“Open”) {% endhighlight %}

Now we can plug in the worksheet to our parse method.

{% highlight python %} def parse_pull_requests(worksheet): {% endhighlight %}

Calling the write method of the worksheet will insert each title in the spreadsheet while we loop through the list. Provide the row ID as variable i.

{% highlight python %} i = 0 for title in pull_requests_titles: # row, column, item worksheet.write(i, 0, title) i += 1 {% endhighlight %}

In the end you should close the workbook.

{% highlight python %} workbook.close() {% endhighlight %}

Also implement the quit() method andd call it to stop the scraper. {% highlight python %} def quit(self): self.driver.close() {% endhighlight %}

{% highlight python %} scraper.quit() {% endhighlight %}

Finally save the script with name PullRequestScraper.py and launch with command:

{% highlight shell %} python PullRequestScraper.py {% endhighlight %}

Find the full code on this Gist