For a month now I have been enjoying something which I am being called crazy for: watching commented StarCraft 2 replays. Since When Cheese Fails 101 got me hooked (watch this cheese gone wrong for example), I now use sc2casts.com to stay updated on the latest games, I just enjoyed the TSL 3 and am digging the zerg playstyle of Spanishiwa. When searching for a database with match-results I came across the Team Liquid Database. Since I have wanted to further develop my Python and scraping skills for a while now, I thought it would be fun to scrape all the matches available at the Team Liquid Database and play around with it in SPSS a bit to see if some interesting patterns would emerge. As sometimes statistics of the ratio of wins for each race on different maps are talked about, I thought I’d find out what they were by myself using the data in the Team Liquid database.
In the following I’ll show you how I created a simple scraper using Python and BeautifulSoup and how I stored/used the data gathered.
Choosing a scraper
I already had some experience with BeautifulSoup, but also read a lot of great things about Scrapy on reddit, so I decided to start using Scrapy. After installing it and reading the documentation I did get it to load the pages I wanted, however I found myself having trouble using the XPath selectors for this specific case. That’s when I decided Scrapy might be a little bit too heavy for this purpose and decided BeautifulSoup was the way to go for now.
Deciding what to scrape
The way the Team Liquid site works is that if you go to http://www.teamliquid.net/tlpd/sc2-international/games you see the last 50 games that are added. So I would have to write a function that would be able to easily iterate over all pages with games (>15000 games & >430 pages in total). After some initial problems with the fact that the Team Liquid site uses AJAX to update the results table, with a little help over at stackoverflow I managed to find a URL that could be looped trough. (N.B. this URL changes every time you visit the site, so visit the site first in your browser, then look at the tabulator_id number and use it in your code)
if __name__ == '__main__':
for page in range(1,437):
content = "http://www.teamliquid.net/tlpd/tabulator/update.php?"
"tabulator_id=2229&tabulator_page=" + str(page) + "&tabulator_order_col=1"
Opening each page using BeautifulSoup
With the URL’s as parameter I could start writing some code to open the website and access the information in the table.
# open the page & save the content in a BeautifulSoup object
print "Reading:" , url
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
Selecting the information you want using the DOM
Now the function parseWithSoup() should extract a couple of things from the table with results: the league, the map, the winner’s name and the loser’s name. Also the races of both the winner and the loser should be extracted so that I can for example analyze which race won most often on a specific map for example.
The first thing you want to do when scraping content is to find an element or attribute in the DOM tree that uniquely identifies the object you want to scrape. If you can uniquely identify the element, you can then use BeautifulSoup to scrape the data within that element. In this case I used FireBug, which easily lets me select the table element, however by inspecting the source code and using the find function you can also quickly find the element you are after. Luckily the table that contained the information I needed had a unique id: tblt_table.
Using beautifulSoup to get the information
With the table uniquely defined, I could start writing some code to get the information from the table.
# find the table with results and save the rows in a list
table = bs.find(lambda tag: tag.name=='table'
rows = table.findAll(lambda tag: tag.name=='tr')
rows.pop(0) #first row is header
As you can see, the first row of the table contained the header information, so I needed to drop that first, after which BeautifulSoup could be used to get the names of the map and players. This way, I only have to write a function that loops through each row and extracts the data.
For the name of the league, the map and both players it was easy since all those where inside an <a> tag. The following code looks up all the <a> tags and then appends the content of the tag with the .string method in my content list.
#get the leaguename, map, winner's and loser's name
for row in rows:
content = 
tags = row.findAll(lambda tag: tag.name=='a')
for tagcontent in tags:
Secondly, I needed to have the races of both players, which are shown with a small image in front of the name. After inspecting the source code I noticed each image gets a title attribute with the name of the species it represents. So, I created a second loop that looked for all <img> tags in a row and appended the content of the title attribute of each <img> tag.
# get the race of the winner and loser
tags2 = row.findAll(lambda tag: tag.name=='img')
for tagcontent in tags2:
Storing the information in a file
As I wanted to be able to use the data within SPSS I needed to store the data in a file. I am sure that much more advanced methods exist to do this but I decided to create a file and just add all the information in rows separated with tabs.
# open the tablefile (already made) and append to it
for game in content:
out.write('t%s' % game.encode('ascii','ignore') )
Now the program is ready to ready to run. After iterating over 437 pages, which took a couple of minutes I finally had my file with the results of over 15000 games played. After cleaning the data manually in spss (some matches were 2v2, I only wanted 1v1 for my analysis and some matches didn’t have a known map), the data was ready to be analyzed.
Using SPSS to get some interesting insights
I used the descriptives->crosstabs function to see which race would win most often on specific maps given my database. I sorted the results to show the wins of each race on the 10 maps that are played most often in this data set:
So, I hope you liked this article and if you didn’t know already, learned something about scraping webpages with python and using it to get some cool insights in large amounts of web-data. For now there are still some things on my learning list: I want to start writing my programs in a more pythonic way. Now the program does the job, but the code could be a lot better. Also, the method of saving it in a .csv file could be improved. Furthermore, instead of using SPSS I should write a script that implements it’s own version of the crosstabs table.