Digging into GSoC Data: Scraping 101 (Part 1)

tl;dr Code for scraping GSoC data from Google Developers website for the years 2005-08 and storing it as csv files can be found here: GSoC data digging on Github
Two csv files will be created: org_numbers.csv (containing information about the number of projects per org) and project_details.csv (containing project, mentor and student details)

The backstory

Over the last few months, I have played around with data quite a bit. Some of it I did while participating in Analytics Vidhya’s amazing data science competitions, and rest of it while working with curated data sets (like this). This week itself, I have also started working on my first serious Kaggle competition. What I had never really done previously was obtain data for myself from the wild. It was about time!

csv_download
This gets boring after some time!

I was struggling to decide on an API to work with when an e-mail last month on the GSoC students’ google group changed my mind. For this edition of GSoC, Google had decided to discontinue the usage of it’s iconic melange website and instead use a new one. This particular mail was simply a reminder of that fact. But it provided me with a moment of clarity. Why not work with the Summer of Code data instead? A decision had been taken.

Now, the melange website conveniently provided me with the organization and project data for the years 2009-15 in the form of csv files. But the data for the previous years (2005-08) was only available as archives on the Google developers website. Well, it had to be scraped then. Simple enough, right? Turns out, it wasn’t as simple and quick as I’d expected it to be (a few minutes that is).

Before I move on to the issues I encountered while scraping, let me lay out my methodology and tools.

Tools of the trade

I used the requests library for retrieving the webpage with the data and parsed it using the html module from the library lxml. The csv library took care of writing the data as a csv file while re was used for regular expressions.

Methodology

Check out the code here: scrape_gsoc.py on github

The url for the pages to be scraped was of the form: https://developers.google.com/open-source/gsoc/<year>. So I set a base url and iterated over a list comprised of the year values. Each iteration corresponded to working with archive page for a particular year. Inside each iteration, first the webpage was fetched using requests and then parsed into a nice tree structure using the html module.

base_url = "https://developers.google.com/open-source/gsoc/"
year_list = [2005, 2006, 2007, 2008]
...
for year in year_list:
    target_url = base_url + str(year)
    target_page = requests.get(target_url)
    tree = html.fromstring(target_page.content)

We can go through this html tree using XPath, a method to select nodes in an xml document which works well with html too.
For instance, to obtain the a list of the participating organizations from the tree:

org_list = tree.xpath('//section[@class="toc"]/ul/li/a/text()')

Well, it will make more sense after watching the relevant portion of the webpage’s source code for the above.

source_code_org
First match!

 

We are trying to access the text nested inside multiple nodes or leaves. And we have specified the path to reach it in the above xpath statement. The tree.xpath method would return a list of all the text which satisfied the specified path; in this case, the organization names. In a similar fashion the organizations’ GSoC ids can be obtained from the href attributes in the above code.

Once we have the organization ids for a particular year, we can iterate over them to obtain more specific details. The portion of source code containing information about a particular organization and its projects looks like this:

source_code_org2

So to obtain a list of projects for a particular org, the code will be:

projects = tree.xpath('//section[@id="%s"]/ul/li/h4/text()' % org_ids[num])

(In this case, org_ids[num] will equal to “asf”)

Then, we can easily obtain the number of projects for a particular org. Once we have that, we can write a row to one of our csv files (org_numbers.csv).

g = open('org_numbers.csv', 'w')
orgNumWriter = csv.writer(g, lineterminator='\n')
orgNumWriter.writerow(["year", "org_id", "org_name", "num_selections"])
...
## year | org_id | org_name | num_selections
orgNumWriter.writerow([str(year), org_ids[num], org_list[num], str(org_tally[num])])

So far, so good. It is at this point that we encounter the issues I had mentioned at the beginning of this post. Specifically when we try to extract and write to a file the student and mentor information. In brief, my troubles had to do with the presence of certain non-standard characters in the names of projects, mentors and students. Characters which the default ASCII encoding isn’t equipped to handle. And thus, I finally ended up getting acquainted with character encoding. More on it in the next post!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s