Monday, September 17, 2012

RSS pubDate to python date

In RSS feeds one of the field is pubDate, which is common across any feed. This date is required to be in RFC 822 - Standard for ARPA Internet Text Messages.

Sample - Sat, 17 Sep 2012 00:00:01 GMT

if you want to convert it to python date time, get help from email.utils of python

>>> import rfc822
>>> rfc822.parsedate_tz('Thu, 26 Jul 2012 13:30:52 EDT')
(2012, 7, 26, 13, 30, 52, 0, 1, 0, -14400)

Though above gives tuple, which you will need to convert to datetime.

If you dont care about timezone (e.g. EDT, GMT etc.), use below -

>>> from datetime import datetime
>>> datetime.strptime('Thu, 26 Jul 2012 13:30:52 EDT'[:-4], '%a, %d %b %Y %H:%M:%S')
datetime.datetime(2012, 7, 26, 13, 30, 52)

I am sure there are many other ways to do this, suggest if you come across any good one :)

Sunday, September 16, 2012

Boilerpipe integration in python

Boilerpipe is a library for boilerplate removal and full text extraction from HTML. In most of the scenarios it works pretty amazing, you can try out here.

We wanted to use it with python, and so tried out the python wrapper for Boilerpipe.

It requires JPype install prerequisites, which is available here.

Once its downloaded, run 'sudo python install'

While installation I run into different errors, and following are the steps for it -

1. Install JPype.

command gcc fail error -

error: command 'gcc-4.2' failed with exit status 1

I followed the explanation here to get it install. Basically you need to update the javaHome path in based on your machine (Mac, Windows or other Linux based on your platform change appropriate method.) Next step is find out the Java path on your machine, and add it in .bash_profile if its not already set.

I did following changes in my–

def setupMacOSX(self):
        self.javaHome = '/Developer/SDKs/MacOSX10.7.sdk/System/Library/Frameworks/JavaVM.framework'
        self.jdkInclude = ""
        self.libraries = ["dl"]
        self.libraryDir = [self.javaHome+"/Libraries"]
        self.macros = [('MACOSX',1)]

def setupInclusion(self):
        self.includeDirs = [

It should do the trick, and JPype should be installed fine.

2. Run intallation of boilerpipe wrapper, that will the boilerpipe jars and chardet (universal encoding detector) as well to your environment as its also one of the required package for the boilerpipe.

Ran ‘sudo python install’ on to get java boilerpipe wrapper install on your machine.

3. Start running your app as provided instruction on documentation.

I got following error upon running the app -

java.lang.Exception: Class de.l3s.boilerpipe.sax.HTMLHighlighter not found

Its because your $JAVA_HOME is not setup correctly. And another part of the reason was boilerpipe jar were missing from the path. (probably python-boilerpipe install didn't bring boilerpipe java jars, so I brought it manually.) After getting those it works fine.

Please suggest if any other good tool for extraction of the main content out of the web page.

Saturday, September 1, 2012

django JSON with DateTime

Django json.dump(data) throws error ‘datetime (...) is not JSON serialzable.’ because default it only does queryset json serialization. Use below to serialize dates.

from django.core.serializers.json import DjangoJSONEncoder

def test(request, title):
    data =  json.dumps(qset, cls=

Which is similar to extending the default JSONEncoder and check for the datetime and return it with extra code to handle it.

This should resolve the issue.