1. October 2019
Scraping Pull Requests
Sometimes you may want to gather information from a webste and create a report of the gathered data. Extracting data from a web page is called web scraping.
I had an assingment to parse Pull Requests from Bitbucket and write the information to Excel. For this we can utilize a couple of great Python libraries.
Selenium is a nifty browser automation library that can navigate to URLs, read information from the page, click links and much more.
Install the libs we will be using.
{% highlight shell %} pip install selenium pip install xlsxwriter {% endhighlight %}
Basic setup requires importing a webdriver that can be specified to a browser of your choise. It´s good to define the PullRequestScraper as a class and add methods that implement the desired logic.
{% highlight python %} from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait import xlsxwriter
class PullRequestScraper: def init(self): self.driver = webdriver.Chrome(executable_path=’~/Downloads/chromedriver_win32/chromedriver.exe’) self.wait = WebDriverWait(self.driver, 10)
if name == ‘main’: scraper = PullRequestScraper() {% endhighlight %}
We want to navigate to a page in Bitbucket that lists open Pull Requests.
{% highlight python %}
def get_open(self):
self.driver.get(‘https://
Bitbucket will redirect to login page so we should implement a method for handling the login process.
First assert is for making sure we are on the correct page.
{% highlight python %} def login(self): assert “Log in - Bitbucket” in self.driver.title
myusername = ""
mypassword = ""
username = self.driver.find_element_by_id("j_username")
username.clear()
username.send_keys(myusername)
password = self.driver.find_element_by_name("j_password")
password.clear()
password.send_keys(mypassword)
self.driver.find_element_by_id("submit").click()
{% endhighlight %}
Don´t forget to call your methods in the init phase.
{% highlight python %} scraper.get_open() scraper.login() {% endhighlight %}
Next we need to parse the Pull Requests from the list. XPATH is a utility that allows specifying an DOM tree element with a special syntax.
{% highlight python %} def parse_pull_requests(self): pull_requests_table = self.driver.find_element(By.XPATH,’//[@id=“pull-requests-content”]/div/div/table’) pull_requests_titles = pull_requests_table.find_elements(By.XPATH,’//[@class=“pull-request-title”]’)
for title in pull_requests_titles:
title = title.get_attribute('innerHTML')
{% endhighlight %}
Add the call for this method also.
{% highlight python %} scraper.parse_pull_requests() {% endhighlight %}
OK so now we get a list of titles. The last part is to write it to Excel spreadsheet. There are multiple libraries for this but let´s use XLSXWriter. Create an instance of a workbook and name it.
{% highlight python %} workbook = xlsxwriter.Workbook(‘pull_requests.xlsx’) worksheet = workbook.add_worksheet(“Open”) {% endhighlight %}
Now we can plug in the worksheet to our parse method.
{% highlight python %} def parse_pull_requests(worksheet): {% endhighlight %}
Calling the write method of the worksheet will insert each title in the spreadsheet while we loop through the list. Provide the row ID as variable i.
{% highlight python %} i = 0 for title in pull_requests_titles: # row, column, item worksheet.write(i, 0, title) i += 1 {% endhighlight %}
In the end you should close the workbook.
{% highlight python %} workbook.close() {% endhighlight %}
Also implement the quit() method andd call it to stop the scraper. {% highlight python %} def quit(self): self.driver.close() {% endhighlight %}
{% highlight python %} scraper.quit() {% endhighlight %}
Finally save the script with name PullRequestScraper.py and launch with command:
{% highlight shell %} python PullRequestScraper.py {% endhighlight %}
Find the full code on this Gist