Blog.: Extract POIs from OSM PBF ("Protocolbuffer Binary Format") dumps with python

Friday, February 11, 2011

Extract POIs from OSM PBF ("Protocolbuffer Binary Format") dumps with python

I am a GIS n00b - this is my first attempt at handling OSM data.
This code would have never been written if it wasn't for Chris Hill's excellent parsepbf.py - blog post here.

I looked at OSM to obtain railway station locations in the country for an in-house project we are running.Parsing through their data dumps sounded like an easy job. I grabbed india.osm.bz2 and india.osm.pbf from Geofabrik. Uncompressing the bz2 file resulted in a 614MB xml whereas the pbf was just 26MB. Intrigued by the small file size of the pbf files ( I never read up on google protocol buffers before) I went to the OSM wiki to read up the format and see if any python libraries are available for this. I found Chris' parsepbf script and ran it with the pbf I had. Turns out running the the script without asking it to spit out osm xml was a bad idea - ended up eating all the memory on my machine [ no swap enabled ] and crashing the system.

I modified the parsepbf file to make as somewhat generic class for picking out nodes with specified tags.

Stats:

It took about 5 minutes to pickout all the railway stations on my linode ( 512MB ) VPS.
I think a speed up can be achieved by using multiprocessing (?)

Example usage:

Current code can be found here.

2 comments:

Chris Hill said...: Thanks for the mention.

I think I need to put out a warning on the raw code about gobbling memory. The main reason I created the code was to do exactly as you have done - customise it to extract data you are interested in.; 10:41 PM
Tao said...: Hi

I use your code to see if it works on the map of philadelphia. However, it always return the following error message:

Traceback (most recent call last):
File "", line 1, in
tags = foo.return_tags(refresh=True)
File "osmnodepbf.py", line 216, in return_tags
self.parse()
File "osmnodepbf.py", line 113, in parse
self.processDense(pg.dense,tag)
File "osmnodepbf.py", line 199, in processDense
self.tags[node["sky"]] = [node["svl"]]
KeyError: 'svl'

I tried it on some other maps and it keep prompting the same error; 1:54 AM