Fix for RE error: illegal byte sequence on Mac OS X

I ran into the following problem running sed on my Mac OS X installation:

sed: RE error: illegal byte sequence

To fix it, you need to set two variables in your .bash_profile or .zhrsc file by adding two lines.

Step 1: Open your .bash_profile or .zhrsc

Go to Terminal and type: vim ~/.bash_profile or vim ~/.zhrsc. Then press “i” to go to edit mode.

Step 2: Add these lines

export LC_CTYPE=C 
export LANG=C

Step 3: Save and restart your terminal

To save your edits type: “:wq” and press Enter. Now restart your Terminal application for the changes to take effect.

This should fix you problems with the “sed: RE error: illegal byte sequence” error on Mac OS X.

Setting up a global .gitignore for Python projects on Mac OS X

If you are working on a lot of Python projects and are using Git version control, it is very convenient to set up a global .gitignore file that ignores often occurring file types like .pyc and .egg that you generally don’t want to have in your repository.

Ignoring common python files not wanted in your repository

First make a file .gitignore in your home directory (touch ~/.gitignore) and include the following:

.DS_Store

*.py[cod]

# C extensions
*.so

# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64

# Installer logs
pip-log.txt

# Unit test / coverage reports
.coverage
.tox
nosetests.xml

# Translations
*.mo

# Mr Developer
.mr.developer.cfg
.project
.pydevproject

Then make sure to add this .gitignore file to Git with the following command:

git config --global core.excludesfile '~/.gitignore'

If all went right, now files like .pyc and .egg are not included in Git repositories!

How to filter data on attribute values in RapidMiner

If you are using RapidMiner and want to filter you data based on certain values of a attribute or variable, this is the way to do it:

1. Select the “filter examples” operator.

Select the Filter Example operator

Selecting the Filter Examples operator

2. Put it between your data and the end result node.

Put the filter between your data and your end result

Putting the filter between your data and your end result

3. Choose the condition class “attribute_value_filter” and fill in the parameter string field like: <attribute_name>=<value> (i.e. Region=USA).

You can also stack values with “|” or “OR” statements like Region=USA|Europe

Select the attribute_value_filter and fill in the parameter string

Selecting the attribute_value_filter and filling in the parameter string

Finally, please note that it is case sensitive and that spaces are allowed.

That’s it! Hope this tutorial helped you to filter data in RapidMiner based on variable values.

How to fix a Django DatabaseError that follows an IntegrityError with PostgreSQL

As I deployed my Django app from my local machine using SQLite to my Heroku instance using PostgreSQL I came across a little problem with my code in handling Django IntegrityErrors.

While locally, I could catch an IntegrityError fine and just continue with other database writes, on my PostgreSQL machine I got a django.db.utils.DatabaseError instead after the first IntegrityError.

Turns out, postgres works a little different than SQLite in these cases and needs you to close the database connection first to be able to continue. Using the connection.close() function in the catch block of the IntegrityError allows Django to go on with other database writes.

from django.db import connection

try:
    #your code
except IntegrityError:
    connection.close()
except:
    print "some other error"

Adding robots.txt to a Django project the easy way

Googling a website I own, I found out Google indexed parts of my site erroneously. After inspecting the errors in Google Webmaster Tools, I found the culprit in my code. However, after I cleaned up the error, Google still had the wrong URL’s in it’s index.

To solve this problem, I needed to declare the wrongly indexed URL’s in a robots.txt file to tell Google to not index those pages. In this post I’ll show a quick way of adding a robots.txt file to your Django application.

1. Create a robots.txt file in your projects templates folder

Search engines look for robots.txt file and read and follow the rules defined in it (if they are playing nice). This wikipedia page and robotstxt.org explain the different rules you can create pretty nicely.

So, first create a plaintext robots.txt file, write your rules and save it in your project’s template folder.

For example I used this rule:

User-agent: *
Disallow: /en-us/

2. Add a robots.txt URL pattern in urls.py

In order for Google to find your robots.txt file at the standard yourwebsite.com/robots.txt location, you need to create a new URL pattern in your urls.py file.

The most easy way to do this is to import and use the direct_to_template helper as follows:

# urls.py
from django.views.generic.simple import direct_to_template

urlpatterns += patterns('',
    url(r'^robots\.txt$', direct_to_template,
    {'template': 'robots.txt', 'mimetype': 'text/plain'}),
)

That’s it!

If implemented correctly, you will see your robots.txt file when you visit yourwebsitecom/robots.txt. You can now tell Google to reindex your site through Webmaster Tools and check if the wrong URL’s have disappeared.

Alternative methods

The above method is well-suited for when you have a relatively simple website. When things get more complex, you might want to check out a more sophisticated solution such as django-robots.

If you have any other methods, or libraries you use, please do let me know in the comments!

Adding cronjobs to a Django project with Heroku Scheduler

Recently, Heroku switched from offering a cronjobs add-on to offering the Heroku Scheduler add-on. The Scheduler add-on lets you schedule tasks like you are used to with cronjobs. You can run a task every 10 minutes, every hour or every day.

In this post I’ll talk about how to create custom management commands in Django and how you can let Heroku Scheduler perform these commands at set times.

1. Create a management folder with a commands folder in it

Django looks for custom management commands in the management/commands/ subfolders of you app.

Django will look for all files in the commands folder that don’t start with a underscore. The name of each .py file without a starting underscore is also the name the command will be when running it.

2. Create a Command class that inherits from BaseCommand

In your command_name.py file, create a class Command that inherits from the BaseCommand class. Also don’t forget to import the BaseCommand class.

The action that you want your custom django command to do should be in a handle(self, *args, **options) method.

As the above code shows, you can capture command line arguments for your custom command. If you don’t need those, you can skip the for loop that loops over the args.

3. Test if your custom Django command works

Before committing your custom command and pushing it to Heroku, you should test your custom Django command locally.

To do this run:

python manage.py your_command_name

(if your command has additional arguments you can append them at the end)

If your custom command works like you want it to, commit the changes to your repo and push them to heroku.

Now you should check if your new Django management command also works on your Heroku app.

For this, try running the following command:

heroku run python manage.py your_command_name

If all went well, this should succesfully run your custom managent command on Heroku.

Now let’s make sure this command is run at scheduled intervals.

4. Add the Heroku Scheduler add-on to your Heroku app

You can add Heroku Scheduler to you app in two ways:

A) Go to your Heroku Dashboard and select the app you like, then go to add-ons and choose the Heroku Scheduler add-on.

B) Instead of using the Heroku website, alternatively, you can use the following heroku command to add the Scheduler add-on:

$ heroku addons:add scheduler:standard

5. Configure the Heroku Scheduler add-on

Now you need to configure the Heroku Scheduler add-on to run your newly created Django custom management command. To open the Heroku Scheduler, use the following command in your shell: heroku addons:open scheduler (the Heroku Scheduler site will open). Alternatively, you can go to your your heroku admin site > your app and then select it’s Heroku Scheduler page.

Here you can schedule your custom Django command tasks.

In the leftmost box you now type the following:

python manage.py your_command_name

Next, you should choose at what interval (10 minutes, 1 hour, 1 day) your custom command should be performed. Additionally, if you need, you can set at what time the next run of the command should take place. If you’re ready configuring, press Save.

That’s it! You have created a custom Django management command and scheduled it for repeated execution using the Heroku Scheduler add-on.

Please do note that Heroku will spin up a worker dyno to complete the scheduled tasks. This dyno is prorated per second so you should keep an eye on how much time the task takes if you want to keep costs in check.

If you have any questions, please feel free to post them below.

Fixing the problem with rel=’next’ and Yoast

Since I moved this blog across domains, I have been using the WordPress SEO by Yoast plugin instead of the infamous All In One SEO Pack plugin I previously used. So far I’ve been happy with it and it is pretty neat.

However, today I noticed an unfamiliar line of code when looking at the source of this page. On the homepage the Yoast plugin had created a <link rel=”next”> tag:

<link rel="next" href="http://guidovanoorschot.nl/page/2/">

I wasn’t familiar with the rel=’next’ attribute, so I found some details in the HTML specification here. Looking at the code again, I also found that the canonical URL of this page was the same as the URL:

<link rel="canonical" href="http://guidovanoorschot.nl/page/2">

At first I thought this would be a problem as Google might regard the content on this page 2, which was the same as individual posts, as a duplicate. However the plugin also added the following line: <meta name="robots" content="noindex,follow"/>, which solves that problem by not allowing Google to index that page.

Fixing the rel=”next” problem in Yoast

However, I found that for some people with different themes the rel=’next’ was zausing problems. If this is the case, you can fix the problem by commenting out one line in the plugin. Go to line 241 in wp-content/plugins/wordpress-seo/frontend/class-frontend.php and comment it out:

$this->adjacent_rel_links();

to

//$this->adjacent_rel_links();

This way the link won’t show up anymore in your homepage’s code. Do keep in mind that when upgrading the plugin you should check to make sure it doesn’t pop up again.

How to fix Spotlight when it keeps indexing in Mac OS X Lion

Today I noticed that the fan of my MacBook was continuously spinning because my Spotlight kept indexing for hours. Even when it was finished, a couple moments later it started reindexing again. After searching Google for a while finding no working solutions, I finally found one that worked for me.

Spotlight keeps indexing – the fix

If you execute the following command in your Terminal (Applications/Utilities) Spotlight should stop indexing (N.B. make sure you copy it exactly.):

sudo rm -rf /.Spotlight-V100/*

Re-index Spotlight

The above command deletes your current corrupted index. Now execute the following command to force a re-index of Spotlight for one-time(If it gives you a -400 error, rebooting and trying again should work). After running the command, Spotlight should start re-indexing within a minute.

sudo mdutil -i on -E /

Please note: It might seem that the problem hasn’t been solved, but the second command tells Spotlight to do a real, proper reindex. If everything is okay, it doesn’t reindex indefinitely now, but should be done in a couple of hours.

This is not a permanent solution, as the problem happened to me twice in the last eight months. So if your Spotlight keeps indexing too, please do let me know it it worked for you, and if you have any other solutions or know the root cause of this problem.

How To Make Sure Your Heroku App Doesn’t Shut Down

I am working with Heroku lately and am really enjoying the ease of deploying a Django application with it. However, as I am in early development of my app, I don’t really have visitors yet. I noticed that I would take very long to load my website for the first time, while it was fast after that first request. I did some reading and understand that Heroku ‘shuts down’ your server when there is a period of inactivity, when your site doesn’t get visited for a while.

I thought this was rather annoying and definitely not desirable when you try to gain initial visitors — nobody likes to wait ten seconds for a page to load. In this post I’ll show you how I made a workaround that will keep the server alive.

Using Cron and a little Python script

First make a simple python script (e.g. acces_page.py) that loads a URL:

#!/usr/bin/env python
import urllib2
if __name__ == '__main__':
    f = urllib2.urlopen('http://myapp.herokuapp.com')
    print f.read(10)

Then upload it somewhere on your second – not heroku – server and put it in some folder (e.g. /your/folder/). Then add the following line to your crontab by doing the crontab -e  command.

*/5 * * * * ~/your/folder/access_page.py > $HOME/cron.log 2>&1

This will run the python script every five minutes, accessing the webpage defined in your python script. To test if it works it will output the first 10 characters of the webpage in the file cron.log in your $HOME directory. When you see it’s working, you can remove the > $HOME/cron.log 2>&1 part.

If the log file says something like access denied, chmod your access_page.py file with chmod +x access_page.py.

An Apriori Algorithm Analysis Of Steve Jobs’ Tribute Messages

Yesterday, Neil Kodner wrote an interesting post in which he scraped and analysed the tribute messages for Steve Jobs on the Apple website. Some interesting insights were, for example, that people talked about the Mac and iPhone the most, and compared Steve Jobs with great minds like Einstein, Ford and Edison. Also, Neil found that ‘rest in peace’ was the most used trigram in all the messages.

Seeing this, it made me think of applying the apriori algorithm, which I recently implemented for my Web Text Mining class, to the tribute messages. The apriori algorithm explained according to wikipedia:

In computer science and data mining, Apriori is a classic algorithm for learning association rules. As is common in association rule mining, given a set of itemsets, the algorithm attempts to find subsets which are common to at least a minimum number C of the itemsets.

The way me and my group-mate Rene Dekker implemented it, the algorithm extracts association rules for words from a sentence or document (stopwords, punctuation and numbers are removed from analysis). So, I took the text file with tribute messages and applied the algorithm to see what word combinations are used frequently within one tribute message. I’ll get into the algorithm in a later blogpost, but here are the results for a minimum support level of 1% and minimum confidence level of 85%.

Interpreting the results

Jobs, friends, condolences -> family

This means that the four words ‘Jobs’, ‘friends’, ‘condolences’, & ‘family’ together (but not necessarily next to each other) occur in at least 1% of the tribute messages. Also, when ‘Jobs’, ‘friends’, & ‘condolences’ occur, at least 85% of the times the word ‘family’ is also present in the message.

friends, Steve -> family
Peace, Jobs -> Steve
Thank, Jobs, us -> Steve
world, friends, condolences -> family
Mr, world -> Jobs
Jobs, computers -> Steve
know, friends -> family
friends, many -> family
Mr, friends -> family
friends -> family
iPad, Jobs, Apple -> Steve
Jobs, created -> Steve
condolences, friends -> family
Mr, friends -> Jobs
friends, lost -> family
go, friends -> family
friends, Apple -> family
never, friends, Steve -> family
people, Jobs, world -> Steve
friends, like -> family
life, friends, Steve -> family
Jobs, friends, condolences -> family
friends, thoughts -> family
friends, always -> family
never, Mr -> Jobs
friends, Steves -> family
friends, Apple, condolences -> family
world, friends, Steve -> family
friends, man -> family
condolences, Steves -> family
Jobs, life, great -> Steve
prayers, friends -> family
Jobs, world, changed -> Steve
human, Jobs -> Steve
friends, Apple, Steve -> family
brought, Jobs -> Steve
friends, condolences, Steve -> family
friends, condolences -> family
friends, us -> family

I’ll elaborate more on the algorithm and different improvements in efficiency and usefulness we made in a later blogpost — please stay tuned.