Friday, February 11, 2011

Extract POIs from OSM PBF ("Protocolbuffer Binary Format") dumps with python

  • I am a GIS n00b - this is my first attempt at handling OSM data.
  • This code would have never been written if it wasn't for Chris Hill's excellent parsepbf.py - blog post here.


I looked at OSM to obtain railway station locations in the country for an in-house project we are running.Parsing through their data dumps sounded like an easy job. I grabbed india.osm.bz2 and india.osm.pbf from Geofabrik. Uncompressing the bz2 file resulted in a 614MB xml whereas the pbf was just 26MB. Intrigued by the small file size of the pbf files ( I never read up on google protocol buffers before) I went to the OSM wiki to read up the format and see if any python libraries are available for this. I found Chris' parsepbf script and ran it with the pbf I had. Turns out running the the script without asking it to spit out osm xml was a bad idea - ended up eating all the memory on my machine [ no swap enabled ] and crashing the system.

I modified the parsepbf file to make as somewhat generic class for picking out nodes with specified tags.


Stats:

  • It took about 5 minutes to pickout all the railway stations on my linode ( 512MB ) VPS.
  • I think a speed up can be achieved by using multiprocessing (?)


Example usage:



Current code can be found here.

Tuesday, February 08, 2011

Daemonize Selenium-grid with init scripts on Debian

In the process of setting up selenium grid for Sahana Eden CI - I needed to run selenium-grid as a daemon, as I could not find any init scripts for it on the internets, I hacked Jenkins ( Hudson ) init script for this.
The script could be generalized to run generic ant builds as a daemon.