Sunday, September 16, 2012

Boilerpipe integration in python

Boilerpipe is a library for boilerplate removal and full text extraction from HTML. In most of the scenarios it works pretty amazing, you can try out here.

We wanted to use it with python, and so tried out the python wrapper for Boilerpipe.

It requires JPype install prerequisites, which is available here.

Once its downloaded, run 'sudo python install'

While installation I run into different errors, and following are the steps for it -

1. Install JPype.

command gcc fail error -

error: command 'gcc-4.2' failed with exit status 1

I followed the explanation here to get it install. Basically you need to update the javaHome path in based on your machine (Mac, Windows or other Linux based on your platform change appropriate method.) Next step is find out the Java path on your machine, and add it in .bash_profile if its not already set.

I did following changes in my–

def setupMacOSX(self):
        self.javaHome = '/Developer/SDKs/MacOSX10.7.sdk/System/Library/Frameworks/JavaVM.framework'
        self.jdkInclude = ""
        self.libraries = ["dl"]
        self.libraryDir = [self.javaHome+"/Libraries"]
        self.macros = [('MACOSX',1)]

def setupInclusion(self):
        self.includeDirs = [

It should do the trick, and JPype should be installed fine.

2. Run intallation of boilerpipe wrapper, that will the boilerpipe jars and chardet (universal encoding detector) as well to your environment as its also one of the required package for the boilerpipe.

Ran ‘sudo python install’ on to get java boilerpipe wrapper install on your machine.

3. Start running your app as provided instruction on documentation.

I got following error upon running the app -

java.lang.Exception: Class de.l3s.boilerpipe.sax.HTMLHighlighter not found

Its because your $JAVA_HOME is not setup correctly. And another part of the reason was boilerpipe jar were missing from the path. (probably python-boilerpipe install didn't bring boilerpipe java jars, so I brought it manually.) After getting those it works fine.

Please suggest if any other good tool for extraction of the main content out of the web page.

