Monthly Archives: May 2012

Creating a Globe of Data (revisited for Programming Historian Second Edition)

Module Goals

After seeing the basics of Python and how it could help us in our daily work, we will introduce one of the many options for visualization of data. In this case, we will combine a data source in CSV format that will be processed to transform them into JSON notation. Finally we will represent all the information in a world globe, designed for modern browsers using the WebGL technology. During the process, we will need to get the spatial coordinates for countries across the world. And before starting, you can see the final result of this unit on World Poverty, so don’t be afraid about all the new names mentioned above, we will explain them below.

The Globe of Data

Since the ending of 2009, some browsers started to implement an incipient specification for rendering 3D content on the Web. Although it is not yet a part of W3C‘s specifications –the W3C is the organization that proposes, defines and approves almost all Internet standards–, WebGL, that it is how is called, is being supported by all major browsers and the industry.

WebGL is the most recent way for 3D representations on the Web. So, with WebGL, a new form of data representation is made available. In fact, there are artists, scientists, game designers, statisticians and so on, creating amazing visualizations from their data.

Google WebGL Globe

Google WebGL Globe

One of these new ways of representations was made by Google. It is called WebGL Globe and allows to show statistical geo-located data.

JSON & World Coordinates

JSON, acronym for JavaScript Object Notation, is not only a format to represent data in Javascript, the language of the browsers. It is also the data type that WebGL Globe needs to work. In this format, a list is inclosed between brackets, “[” for start and “]” to end. Therefore, the data series for WebGL Globe is a list of lists. Every one of these lists have two elements. The first one is the name of the serie and the second one is another list containing the data. Although is good to know how JSON lists are encoded, there are libraries for Python to do that conversion for you, so you only have to handle pure Python objects.

>>> import json

>>> json.dumps([1, 2, 3])
    '[1, 2, 3]'

>>> json.dumps({"key1": "val1", "key2": "val2"})
    '{"key2": "val2", "key1": "val1"}'

The data for WebGL Globe is written comma separated, so you must indicate your information in a set of three elements: the first is the geographical coordinate for latitude, the second one is the same for longitude, and the third one is the value of the magnitude you would like to represent, but normalized between 0 and 1. This means if we have the values 10, 50, 100 for magnitudes, these will have to be translated into 0.1, 0.5 and 1.

Birefly, “A geographic coordinate system is a coordinate system that enables every location on the Earth to be specified by a set of numbers.” These numbers are often chosen to represent the vertical position and horizontal position of a point in the globe (more precisely is even possible add the elevation). They are commonly referred to angles from equatorioal plane, but as far as we are concerned those angles can be transform into a couple of single numbers with several decimals places

Latitude and Longitude of the Earth (Source: Wikipedia.org)

Latitude and Longitude of the Earth (Source: Wikipedia.org)

The only thing you now need is to split up your data into several series of latitude, longitude and magnitude in JSON format, as the next example illustrates:

var data = [
  [
    'seriesA', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
  ],
  [
    'seriesB', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
  ]
];

This said, we can write the data for our globe in pure Python and then apply a conversion into JSON.

>>> data = [
 ...: "seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: ...
 ...: ]

>>> json.dumps(data)
'["seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...], "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...]] ...'

The Data Set

Let’s say we want to represent information from the Human Poverty Index. The first we need is to download the data in the format provided by United Nations’ site for the Multidimensional Poverty Index, that has replaced the old Human Poverty Index. Now we got a spreadsheet document, it’s time to open it and collect just the data we need, thus, go to the page 5 of the book, and copy and paste the cells into a clean spreadsheet. We clean all the date we don’t need like titles, captions, extra columns, etc and we leave just country names, the second “Value” column under the cell “Multidimensional Poverty Index”, the population under poverty in thousands, and the “Intensity of deprivation” column. The next step is to remove the rows with no data for that indicators, marked as “..”. After doing this, we should have a document with 4 columns and 109 rows.

Spreadsheet before getting coordinates for countries

Spreadsheet before getting coordinates for countries

But, although we have the name of the countries, we need the geographical coordinates for them. There are several services that provide the latitude and longitude for a given address. In the case of having just the name of a country, the main coordinates for the capital is provided. We will use geopy, which is a Python library able to connect to different providers and get several kinds of information. To use geopy, a terminal or console is needed in order to get installed, that is very easy with just a command.

$ easy_install geopy

After that, we can open a terminal or interfactive console like iPython and just get the latitude and longitude of, for instance, “Spain”, with next commands:

>>> from geopy import geocoders

>>> g = geocoders.Google()

>>> g.geocode("Spain")
(u'Spain', (40.463667000000001, -3.7492200000000002))

In this way, we can build a list of our countries and pass it to the next script:

>>> from geopy import geocoders

>>> g = geocoders.Google()

>>> countries = ["Slovenia", "Czech Republic", ...]
>>> for country in countries:
try:
    placemark = g.geocode(country)
    print placemark[0] +","+ placemark[1][0] +","+ placemark[1][1]
except:
    print country
....:
....:
Slovenia,46.151241,14.995463
Czech Republic,49.817492,15.472962
United Arab Emirates,23.424076,53.847818
...

Now, we can select all the results corresponding to the latitudes and longitudes of every country and copy them with Ctrl-C or mouse right-click and copy. Go to our spreadsheet, in the first row of a new column, and then paste all. We should see a dialogue for paste the data, and on it, check the right option in order to get the values separated by commas.

Paste the result comma separated

Paste the result comma separated

Done this, we have almost all the coordinates for all the countries. Anyway, there could be some locations for which the script didn’t get the right coordinates, like “Moldova (Republic of)” or “Georgia”. For these countries, and after a carefull supervision, the better thing to do is to run several tries fixing the names (trying “Moldova” instead of “Moldova (Republic of)”) or just looking the location in Wikipedia –for example for Georgia, Wikipedia provides a link in the information box at the right side with the exact coordinates. When the process is over, we remove the columns with the names and sort the columns in order to get first the latitude, second the longitude, and the rest of the columns after that. We almost have the data prepared. After this, we need to save the spreadsheet as CSV file in order to be processed by a Python script that converts it into the JSON format that WebGL Globe is able to handle.

Reading CSV Files

A CSV file is a data format for printing tables intoto plain-text data. There are a plenty of dialects for CSV, but the most common is to print onw row per line and every field comma separated. For example, the next table will have the output shown in below.

Field 1 Field 2
Row 1 Value Cell 1 Row 1 Value Field 2
Row 2 Value Cell 1 Row 2 Value Field 2

And the output will be:

Field 1,Field 2
Row 1 Value Cell 1,Row 1 Value Cell 2
Row 2 Value Cell 1,Row 2 Value Cell 2

And depending on the case, you can choose what character will be used as a separator insted of the “,”, or just leave the header out. But what happens if I need to print commas? Well, you can escape then or just use a double quote for the entire value.

"Row 1, Value Cell 1","Row 1, Value Cell 2"
"Row 2, Value Cell 1","Row 2, Value Cell 2"

And again you can think what is next if I need to print double quotes. In that case can change the character for quoting or just escape with a slash. This is the origin of all the dialects for CSV. However we are not covering this that deep and we will focus on CSV reading through Python. To achieve it we use the standard  “csv”  library and invoke the “reader” method with a file object after opening it from disk. This done, we can just iterate for every line as a list and store every value in a variable for the iteration.

 

In our case every line has, in this order, latitude, longitude, value for multidimensional poverty index, value for thousands of people in a poverty situation, and finally value for the intensity of deprivation. Note that our CSV file has no header, so we do not have to ignore de first line then. We will use three lists to store the different vales of our series and finally, using the

json

library we could print a JSON output to a file. The script that processes the CSV file and produces the JSON output is the detailed the next:

import csv
lines = csv.reader(open("poverty.csv", "rb"))
mpis = []  # Multidimensional Poverty Index
thousands = []  # People, in thousands, in a poverty situation
deprivations = []  # Intensity of Deprivation
for lat, lon, mpi, thousand, deprivation in lines:
    mpis = mpis + (lat, lon, mpi)
    thousands = thousands + (lat, lon, thousand)
    deprivations = deprivations + (lat, lon, deprivation)
output = [
    ["Multidimensional Poverty Index", mpis],
    ["People affected (in thousands)", thousands],
    ["Intensity of Deprivation", deprivations]
]
print json.dumps(output)

And the output must look like:

[
["Multidimensional Poverty Index", ["46.151241", "14.995463", "0", ... ]
...

Putting it all together

Now, if we copy that output into a file called poverty.json we will have our input data for WebGL Globe. So, the last step is setup the Globe and and the data input file all toghether. We need to download the webgl-globe.zip file and extract the directory named as “globe”  into a directory with the same name. In it, we copy our poverty.json file and now edit the index.html in order to replace the apparitions of “population909500.json” with “poverty.json”, and do some other additions like the name of the series. Finally, to see the result, you can put all the files in a static web server and browse the URL. Another option, just for local debugging, is run the next command under the directory itself:

$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

And then, go to http://localhost:8000 to see the result.

Globe before normalization

Globe before normalization

It seems like there is something wrong with two of the series: the population in poverty conditions, and the intensity of the poverty. This is because we need to normalize the values in order to get values in the range o to 1. To do that, we open again our CSV file as a spreadsheet, calculate the sum of the columns that we want to normalize, and then, we create a new column in which every single cell is the result of the division of the old value of cell by the total sum of all the values in the old column, We repeat the proccess with the another column and replace the old ones with just the values in the new ones. Now, we can run the steps of generate the JSON file and try again.

Now, you can click on World Poverty to see everything properly woriking.

Suggested Readings

The Python Standard Library Documentation

Lutz, Learning Python

  • Ch. 9: Tuples, Files, and Everything Else

4 Comments

Filed under Topics