Ryan Seys

engineer at verily / google life sciences

Fooling with Open Data

February 09, 2014

This evening after a long day of working on school assignments (BORING), I got a really big urge to start fooling with open data. So I did.

I live in Ottawa Canada, and Ottawa has an Open Data Catalog that gives you access to various types of data. In my struggle to find something interesting to do with the data in the little time and imagination that I have, I found myself scowring other sources for other types of data to peek my interests.

I made my way to Open Data Toronto, Open Data Waterloo and finally found myself at Health Canada, more specifically their food and nutrition site containing the Canadian Nutrient File. I was just hungry… for data!

The Canadian Nutrient File contains a plethora of information regarding all ranges of food types. I was specifically drawn to the Refuse Amount File containing information regarding the amount of refuse, i.e. the inedible portion for each food.

I immediately downloaded their data set (11MB!) which turned out to contain a boatload of files I had no idea how to open.

Refuse DBF

A quick Google search for DBF directed me to a dbf library to open said files with Ruby. Time to whip out the Terminal!

First I installed the library like a boss.

$ gem install dbf
Successfully installed dbf-2.0.7
1 gem installed

Rather anti-climatic, but so far so good. Next lets just jump into irb to fool around.

$ irb

And now lets require 'dbf' and read in the dbf file.

irb(main):001:0> require 'dbf'
=> true
irb(main):002:0> data = DBF::Table.new("REFUSE.DBF")
=> #<DBF::Table:0x007f8b1b02b4a0 ...etc... @memo=nil>

Cool! We read in this weird file! So data is now a table with records. Let’s get the first one and take a look at its attributes.

irb(main):003:0> data.first.attributes
=> {"FD_ID"=>2.0, "REFUSE_ID"=>750.0, "REFUSE_AMT"=>0.0, "CF_DT_ENT"=>"01/05/1997"}

So each record has some refuse amount. Let’s do something neat with it, say, calculate the average across all food records.

irb(main):004:0> count = 0
=> 0
irb(main):005:0> sum = 0
=> 0
irb(main):006:0> data.each do |record|
irb(main):007:1*  sum += record.attributes["REFUSE_AMT"]
irb(main):008:1>  count += 1
irb(main):009:1> end
=> 7138
irb(main):010:0> sum/count
=> 9.901528593443539

Wow! So on average across 7138 food items, the average refuse amount is 9.9% That’s kinda interesting.

Let’s see what the max refuse amount is…

irb(main):011:0> data.each do |record|
irb(main):012:1*  if record.attributes["REFUSE_AMT"] > max then
irb(main):013:2*    max = record.attributes["REFUSE_AMT"]
irb(main):014:2>  end
irb(main):015:1> end
=> 7138
irb(main):016:0> max
=> 86.0

Disregarding my terrible coding, this is actually cool. The maximum refuse a food has in the database is 86%! That’s a huge inedible portion! Now in theory I could continue to look up in the other DBF files to find out what that food was, and possibly correlate it with all other high-refuse foods. For the sake of this example I just wanted to show you what is possible with a couple hours, a few cups of tea and a little imagination. And really, the possibilities are endless, all thanks to Open Data (and tea)!

Go make some cool stuff with open data! :) Get started here.