Adding robots.txt to a Django project the easy way

Googling a website I own, I found out Google indexed parts of my site erroneously. After inspecting the errors in Google Webmaster Tools, I found the culprit in my code. However, after I cleaned up the error, Google still had the wrong URL’s in it’s index.

To solve this problem, I needed to declare the wrongly indexed URL’s in a robots.txt file to tell Google to not index those pages. In this post I’ll show a quick way of adding a robots.txt file to your Django application.

1. Create a robots.txt file in your projects templates folder

Search engines look for robots.txt file and read and follow the rules defined in it (if they are playing nice). This wikipedia page and robotstxt.org explain the different rules you can create pretty nicely.

So, first create a plaintext robots.txt file, write your rules and save it in your project’s template folder.

For example I used this rule:

User-agent: *
Disallow: /en-us/

2. Add a robots.txt URL pattern in urls.py

In order for Google to find your robots.txt file at the standard yourwebsite.com/robots.txt location, you need to create a new URL pattern in your urls.py file.

The most easy way to do this is to import and use the direct_to_template helper as follows:

# urls.py
from django.views.generic.simple import direct_to_template

urlpatterns += patterns('',
    url(r'^robots\.txt$', direct_to_template,
    {'template': 'robots.txt', 'mimetype': 'text/plain'}),
)

That’s it!

If implemented correctly, you will see your robots.txt file when you visit yourwebsitecom/robots.txt. You can now tell Google to reindex your site through Webmaster Tools and check if the wrong URL’s have disappeared.

Alternative methods

The above method is well-suited for when you have a relatively simple website. When things get more complex, you might want to check out a more sophisticated solution such as django-robots.

If you have any other methods, or libraries you use, please do let me know in the comments!

How To Make Sure Your Heroku App Doesn’t Shut Down

I am working with Heroku lately and am really enjoying the ease of deploying a Django application with it. However, as I am in early development of my app, I don’t really have visitors yet. I noticed that I would take very long to load my website for the first time, while it was fast after that first request. I did some reading and understand that Heroku ‘shuts down’ your server when there is a period of inactivity, when your site doesn’t get visited for a while.

I thought this was rather annoying and definitely not desirable when you try to gain initial visitors — nobody likes to wait ten seconds for a page to load. In this post I’ll show you how I made a workaround that will keep the server alive.

Using Cron and a little Python script

First make a simple python script (e.g. acces_page.py) that loads a URL:

#!/usr/bin/env python
import urllib2
if __name__ == '__main__':
    f = urllib2.urlopen('http://myapp.herokuapp.com')
    print f.read(10)

Then upload it somewhere on your second – not heroku – server and put it in some folder (e.g. /your/folder/). Then add the following line to your crontab by doing the crontab -e  command.

*/5 * * * * ~/your/folder/access_page.py > $HOME/cron.log 2>&1

This will run the python script every five minutes, accessing the webpage defined in your python script. To test if it works it will output the first 10 characters of the webpage in the file cron.log in your $HOME directory. When you see it’s working, you can remove the > $HOME/cron.log 2>&1 part.

If the log file says something like access denied, chmod your access_page.py file with chmod +x access_page.py.

An Apriori Algorithm Analysis Of Steve Jobs’ Tribute Messages

Yesterday, Neil Kodner wrote an interesting post in which he scraped and analysed the tribute messages for Steve Jobs on the Apple website. Some interesting insights were, for example, that people talked about the Mac and iPhone the most, and compared Steve Jobs with great minds like Einstein, Ford and Edison. Also, Neil found that ‘rest in peace’ was the most used trigram in all the messages.

Seeing this, it made me think of applying the apriori algorithm, which I recently implemented for my Web Text Mining class, to the tribute messages. The apriori algorithm explained according to wikipedia:

In computer science and data mining, Apriori is a classic algorithm for learning association rules. As is common in association rule mining, given a set of itemsets, the algorithm attempts to find subsets which are common to at least a minimum number C of the itemsets.

The way me and my group-mate Rene Dekker implemented it, the algorithm extracts association rules for words from a sentence or document (stopwords, punctuation and numbers are removed from analysis). So, I took the text file with tribute messages and applied the algorithm to see what word combinations are used frequently within one tribute message. I’ll get into the algorithm in a later blogpost, but here are the results for a minimum support level of 1% and minimum confidence level of 85%.

Interpreting the results

Jobs, friends, condolences -> family

This means that the four words ‘Jobs’, ‘friends’, ‘condolences’, & ‘family’ together (but not necessarily next to each other) occur in at least 1% of the tribute messages. Also, when ‘Jobs’, ‘friends’, & ‘condolences’ occur, at least 85% of the times the word ‘family’ is also present in the message.

friends, Steve -> family
Peace, Jobs -> Steve
Thank, Jobs, us -> Steve
world, friends, condolences -> family
Mr, world -> Jobs
Jobs, computers -> Steve
know, friends -> family
friends, many -> family
Mr, friends -> family
friends -> family
iPad, Jobs, Apple -> Steve
Jobs, created -> Steve
condolences, friends -> family
Mr, friends -> Jobs
friends, lost -> family
go, friends -> family
friends, Apple -> family
never, friends, Steve -> family
people, Jobs, world -> Steve
friends, like -> family
life, friends, Steve -> family
Jobs, friends, condolences -> family
friends, thoughts -> family
friends, always -> family
never, Mr -> Jobs
friends, Steves -> family
friends, Apple, condolences -> family
world, friends, Steve -> family
friends, man -> family
condolences, Steves -> family
Jobs, life, great -> Steve
prayers, friends -> family
Jobs, world, changed -> Steve
human, Jobs -> Steve
friends, Apple, Steve -> family
brought, Jobs -> Steve
friends, condolences, Steve -> family
friends, condolences -> family
friends, us -> family

I’ll elaborate more on the algorithm and different improvements in efficiency and usefulness we made in a later blogpost — please stay tuned.

My Appsterdam Hackathon Experience

Last week I spent three days at the Appsterdam’s Open Data Hackathon on Picnic Festival. The local government of Amsterdam had made different data sources publicly available and created a contest for building apps on top of that data.

As I arrived solo, I luckily found some great teammates to work together with (@danielsteginga, @jjeekkoo & @devoorzitter). After a small brainstorm session we decided on our product and divided the workload. We wanted to do something with the data of public art in Amsterdam and decided to build an app with which you could find the art near you, learn about it, and discuss it with fellow community members.

Then, over the course of three days we hacked together the neat little app LocART. Unfortunately, we didn’t win the contest, but I learned some great lessons along the way.

Design is important

Even though we didn’t win the competition, we did get a honorable mention for best design. Also, several other people that saw the app, explicitly mentioned they really liked the design. I had the feeling that gave us some immediate likability that we wouldn’t have gotten otherwise.

I personally notice the same in my thinking when I see the design of new startups — it is often the design that gives some initial likability or perceived professionality.

Be able to relax when others do things differently

Over the course of the hackathon I noticed that sometimes one of my teammates would do something differently than I would have done, codewise or something else. Instead of trying to show why ‘my’ way would be better, it was great and humbling to learn from seeing people do things in other ways.

Developers are wanted

At different occasions during the hackathon, people would come by looking for developers for their ideas. One guy approached us and asked if we were able to develop his idea. When we asked what his idea was he basically described an application like group.me or WhatsApp. We politely inquired what the added value over these existing apps was, after which he said: “there is something extra, but it’s a secret. I’ll tell you when you start working for me”. After that it went on and on, but I think you get the gist of it.

Having read numerous anecdotes on hacker news about annoying business guys with ideas looking for developers, I could only be amused by all this and politely explained to him why this approach would never work after which I gave him some pointers on finding technical employees/founders and what not to do.

You can do awesome things in a short timespan with great people

I am genuinely impressed of how fast we got something to work and look pretty neat. In just three days we designed a logo, the layout and the workflow. We set up our database and imported and cleaned our data. We build an api on top of our database and went through the jQuery Mobile documentation. We created the front end and integrated all of this in just three days with a working app as result. I mean, that’s a lot of work done and a lot of things I learned from that.

Even though we didn’t quite finish the app fully, a basic version is already working — check it out on your mobile phone at m.locart.nl.

For those of you not on a mobile device, here are some pics (or check locart.nl):

Setting up my Python development environment on Mac OS X Lion

As I just purchased a new MacBook Air, I wanted to make a clean python development and start using virtualenv. This post shows how I set up my new python development. The system came with Python 2.7 which is what I’ve been using and wanted to keep using — so no need to install another Python version.

First install pip using easy_install.
For some python modules you need the gcc libraries. You can get those by downloading Xcode which has them all included. However, Xcode will take up gigabytes of disk space. So if you don’t plan on using Xcode you can use the OSX GCC installer made by kenne threitz — this is only around 300 MB. After you have installed this, you can for example install ‘pil’ without getting errors.

Next, I installed virtualenv to be able to create different virtual environments to keep my different projects easily seperated.
And after doing this I immediately installed virtualenvwrapper to make it easier to use virtualenv.
To make sure we you can use the virtualenvwrapper commands edit your .bash_profile file. Type: ‘vim .bash_profile’ in your main dir and put in the following (change yourprogrammingdir to the directory of your projects):

export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/yourprogrammingdir
source /usr/local/bin/virtualenvwrapper.sh

Save it and source your .bash_profile by: ‘source .bash_profile’.

For now this is what my python development looks like, I’ll update this whenever I add things to it.

How To Change User Profiles In Firefox On Mac OS X

As was working on developing a Firefox extension, I wanted to make a special user profile for development in Firefox. Many tutorials advise you to type the following into terminal to get in the user profile screen.

/Applications/Firefox.app/Contents/MacOS/firefox-bin –profilemanager

However, that wasn’t working for me. I couldn’t easily find a fix for it on Google, so here is an overview of other methods to get into the user profile settings for different Mac OS X versions (source), hope it can help some fellow Googlers.

On Mac OS Snow Leopard (10.6) and newer, type this in the terminal:

/Applications/Firefox.app/Contents/MacOS/firefox-bin -no-remote -P dev &

On Mac OS Leopard (10.5) and older, type this in the terminal (on one line):

arch -arch i386 /Applications/Firefox.app/Contents/MacOS/firefox-bin -no-remote -P dev &

The last one worked for me. So if you are having problems with switching between users in Firefox on Mac OS X, try the code above. Let me know, if you got it to work!

Python vs Java In CS Classes: A Comparison

As I work my way through Learn Python The Hard Way, exercise 12, which elaborates on how to get user input from the command line, made me reminisce on my time programming in Java. In my Java programs I would have used the following code to get the user input from the command line and store it in variable ‘age’:

import java.util.Scanner;

public class Question {
    public static void main(String[] args) {
        System.out.print("How old are you? ");
        Scanner question = new Scanner(System.in);
        int age = question.nextInt();
    }
}

And this is how I do the same in Python:

age = raw_input("How old are you? ")

Please note that my Java is rusty and there is probably a shorter way to do the same, but look at the difference!

As I understand it, more and more universities are using Python in their Introductory Computer Sciences classes. Even though I had fun in my Programming in Java class, I think this is a good development. Python provides an easier first experience into the world of programming and is less intimidating at first sight for new students.

Personally, I find it a big relief not having to work with the ugly Java-syntax and being able to focus on writing awesome code instead of searching for missing semi-colons and curly brackets. Off course there will always be the need for Java, C, or other languages that are much faster, but for introductory purposes, I would be happy to see Python used in the classrooms more often!

Finally I’m Learning Python The Hard Way

I remember the first time I had to use Python for my Text Mining class at the University of Amsterdam very well. It was a delight not having to use the awful syntax and other time-wasting conventions that I learned programming in Java. After that class, I continued to play around with Python for several spare time projects. Subsequently, my little experience with Python made me decide to go the Python/Django route instead of the Ruby/Rails one on some websites I am working on.

However, even though I’ve been using Python for almost a year by now, I never got a good, thorough introduction to the language. Along the way, I basically learned myself what I needed to know of the language at that time. For a while now, I have wanted to get a good introduction to the language and read Learning Python The Hard Way, but somehow never got to it.

To do: get a structured introduction to Python

The coming weeks though, I will be reading and making the assignments of the fresh and newly purchased second edition of Learn Python The Hard Way by Zed A. Shaw. I am glad he asked $1 for this second edition, giving me, being Dutch, all the more reason to finish it.

As I work my way through the 52 exercises, I will post notes on insights that stand out to me, the things I now finally understand and the parts I don’t grasp at all. Please feel free to contact me if you would like to share your experiences with Learn Python The Hard Way.

Scraping and analysing StarCraft 2 games using Python and BeautifulSoup

For a month now I have been enjoying something which I am being called crazy for: watching commented StarCraft 2 replays. Since When Cheese Fails 101 got me hooked (watch this cheese gone wrong for example), I now use sc2casts.com to stay updated on the latest games, I just enjoyed the TSL 3 and am digging the zerg playstyle of Spanishiwa. When searching for a database with match-results I came across the Team Liquid Database. Since I have wanted to further develop my Python and scraping skills for a while now, I thought it would be fun to scrape all the matches available at the Team Liquid Database and play around with it in SPSS a bit to see if some interesting patterns would emerge. As sometimes statistics of the ratio of wins for each race on different maps are talked about, I thought I’d find out what they were by myself using the data in the Team Liquid database.

In the following I’ll show you how I created a simple scraper using Python and BeautifulSoup and how I stored/used the data gathered.

Choosing a scraper

I already had some experience with BeautifulSoup, but also read a lot of great things about Scrapy on reddit, so I decided to start using Scrapy. After installing it and reading the documentation I did get it to load the pages I wanted, however I found myself having trouble using the XPath selectors for this specific case. That’s when I decided Scrapy might be a little bit too heavy for this purpose and decided BeautifulSoup was the way to go for now.

Deciding what to scrape

The way the Team Liquid site works is that if you go to http://www.teamliquid.net/tlpd/sc2-international/games you see the last 50 games that are added. So I would have to write a function that would be able to easily iterate over all pages with games (>15000 games & >430 pages in total). After some initial problems with the fact that the Team Liquid site uses AJAX to update the results table, with a little help over at stackoverflow I managed to find a URL that could be looped trough. (N.B. this URL changes every time you visit the site, so visit the site first in your browser, then look at the tabulator_id number and use it in your code)

if __name__ == '__main__':
for page in range(1,437):
 content = "http://www.teamliquid.net/tlpd/tabulator/update.php?"
 "tabulator_id=2229&tabulator_page=" + str(page) + "&tabulator_order_col=1"
 "&tabulator_order_desc=1&tabulator_Search&tabulator_search="
 parseWithSoup(content)

Opening each page using BeautifulSoup

With the URL’s as parameter I could start writing some code to open the website and access the information in the table.

# open the page & save the content in a BeautifulSoup object
def parseWithSoup(url):
 print "Reading:" , url
 html = urlopen(url).read().lower()
 bs = BeautifulSoup(html)

Selecting the information you want using the DOM

Now the function parseWithSoup() should extract a couple of things from the table with results: the league, the map, the winner’s name and the loser’s name. Also the races of both the winner and the loser should be extracted so that I can for example analyze which race won most often on a specific map for example.

The first thing you want to do when scraping content is to find an element or attribute in the DOM tree that uniquely identifies the object you want to scrape. If you can uniquely identify the element, you can then use BeautifulSoup to scrape the data within that element. In this case I used FireBug, which easily lets me select the table element, however by inspecting the source code and using the find function you can also quickly find the element you are after. Luckily the table that contained the information I needed had a unique id: tblt_table.

Using beautifulSoup to get the information

With the table uniquely defined, I could start writing some code to get the information from the table.

 # find the table with results and save the rows in a list
 table = bs.find(lambda tag: tag.name=='table'
                 and tag.has_key('id')
                 and tag['id']=="tblt_table")
 rows = table.findAll(lambda tag: tag.name=='tr')
 rows.pop(0) #first row is header

As you can see, the first row of the table contained the header information, so I needed to drop that first, after which BeautifulSoup could be used to get the names of the map and players. This way, I only have to write a function that loops through each row and extracts the data.

For the name of the league, the map and both players it was easy since all those where inside an <a> tag. The following code looks up all the <a> tags and then appends the content of the tag with the .string method in my content list.

#get the leaguename, map, winner's and loser's name
for row in rows:
 content = []

 tags = row.findAll(lambda tag: tag.name=='a')
 for tagcontent in tags:
 content.append(tagcontent.string)

Secondly, I needed to have the races of both players, which are shown with a small image in front of the name. After inspecting the source code I noticed each image gets a title attribute with the name of the species it represents. So, I created a second loop that looked for all <img> tags in a row and appended the content of the title attribute of each <img> tag.

# get the race of the winner and loser
tags2 = row.findAll(lambda tag: tag.name=='img')
 for tagcontent in tags2:
 content.append(tagcontent['title'])

Storing the information in a file

As I wanted to be able to use the data within SPSS I needed to store the data in a file. I am sure that much more advanced methods exist to do this but I decided to create a file and just add all the information in rows separated with tabs.

# open the tablefile (already made) and append to it
out=open("tablefile",'a')

for game in content:
out.write('t%s' % game.encode('ascii','ignore') )
out.write('n')

Now the program is ready to ready to run. After iterating over 437 pages, which took a couple of minutes I finally had my file with the results of over 15000 games played. After cleaning the data manually in spss (some matches were 2v2, I only wanted 1v1 for my analysis and some matches didn’t have a known map), the data was ready to be analyzed.

Using SPSS to get some interesting insights

I used the descriptives->crosstabs function to see which race would win most often on specific maps given my database. I sorted the results to show the wins of each race on the 10 maps that are played most often in this data set:

[TABLE=2]

Wrap-up

So, I hope you liked this article and if you didn’t know already, learned something about scraping webpages with python and using it to get some cool insights in large amounts of web-data. For now there are still some things on my learning list: I want to start writing my programs in a more pythonic way. Now the program does the job, but the code could be a lot better. Also, the method of saving it in a .csv file could be improved. Furthermore, instead of using SPSS I should write a script that implements it’s own version of the crosstabs table.