Monthly Archives: February 2013

Río and Reykjavik

Recently, a paper which I was collaborating for, was accepted to Digital Humanities 2013, to be hosted in Lincoln, Nebraska. Our paper is titled “Not Exactly Prima Facie: Understanding the Representation of the Human Through the Analysis of Faces in World Painting.” This research makes use of face recognition techniques in order to identify similarities in faces across the time.

The representations of the human face contain a virtual archive of human expressions and emotions that can help decipher, through a science of the face, various traits of the human condition and its evolution through time and space. In this project we aim to explore this through the use of powerful tools of facial recognition, data mining, graph theory and visualization and cultural history. Our methodology takes advantage of these tools and concepts to answer questions about periods in art history, such as the significance of the Baroque as a culture derived from human expansion, and the cultural meaning of the progressive erasing of the human face from modern painting. Quantitative analysis of huge amounts of data has been shown to provide answers to new and different questions that otherwise couldn’t have been considered. Our study takes some ideas from the concept of Culturomics by creating a set of more than 123,500 paintings from all periods of art history, and applying the same face recognition algorithm used today by Facebook in its photo-tagging system. The result is a set of over 26,000 faces ready to analyze according to a variety of features extracted by the algorithm. We found a mean of approximately 1 face out of every 5 paintings.

But what I am most excited lately, is the submission we also made a week ago. Two target conferences, DATA and WWW, to be hosted at Reykjavik in Iceland, and Río de Janeiro in Brazil. It’s the first time I collaborate in a paper sent to highly technical conferences, so I don’t know what are the chances to be accepted in at least one of them. I’ll just cross my fingers and wait until the notification deadline came.

Leave a Comment

Filed under Events

Creating a Globe of Data (PH2)

Lesson Goals

This is a lesson designed for intermediate users, although beginner users should be able to follow along.

In this lesson we will cover the next main topics:

  • Use of Python to produce a visualization of World Poverty Index on interactive globe.
  • Transform CSV data into JSON notation in Python.
  • Get spatial coordinates using Google and other sources from geopy library.

After seeing the basics of Python and how it could help us in our daily work, we will introduce one of the many options for visualization of data. In this case, we will combine a data source in CSV format that will be processed to transform them into JSON notation. Finally we will represent all the information in a world globe, designed for modern browsers using the WebGL technology. During the process, we will need to get the spatial coordinates for countries across the world. And before starting, you can see the final result of this unit on World Poverty, so don’t be afraid about all the new names mentioned above, we will explain them below.

The Globe of Data

Since the ending of 2009, some browsers started to implement an incipient specification for rendering 3D content on the Web. Although it is not yet a part of W3C‘s specifications –the W3C is the organization that proposes, defines and approves almost all Internet standards–, WebGL, that it is how is called, is being supported by all major browsers and the industry.

WebGL is the most recent way for 3D representations on the Web. So, with WebGL, a new form of data representation is made available. In fact, there are artists, scientists, game designers, statisticians and so on, creating amazing visualizations from their data.

Google WebGL Globe

Google WebGL Globe

One of these new ways of representations was made by Google. It is called WebGL Globe and is intended to show statistical geo-located data.

JSON

JSON, acronym for JavaScript Object Notation, is not only a format to represent data in Javascript, but it is also the data type that WebGL Globe needs to work. In this format, a list is inclosed between brackets, “[” for start and “]” to end. Therefore, the data series for WebGL Globe is a list of lists. Every one of these lists have two elements. The first one is the name of the serie and the second one is another list containing the data. Although is good to know how JSON lists are encoded, there are libraries for Python to do that conversion for you, so you only have to handle pure Python objects. The next code snippet shows how native list and dictionaries Python data objects are transformed into JSON.

>>> import json

>>> json.dumps([1, 2, 3])
    '[1, 2, 3]'

>>> json.dumps({"key1": "val1", "key2": "val2"})
    '{"key2": "val2", "key1": "val1"}'

The data for WebGL Globe is written comma separated, so you must indicate your information in a set of three elements: the first is the geographical coordinate for latitude, the second one is the same for longitude, and the third one is the value of the magnitude you would like to represent, but normalized between 0 and 1. This means if we have the values 10, 50, 100 for magnitudes, these will have to be translated into 0.1, 0.5 and 1.

The only thing you now need is to split up your data into several series of latitude, longitude and magnitude in JSON format, as the next example illustrates:

var data = [
  [
    'seriesA', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
  ],
  [
    'seriesB', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
  ]
];

This said, we can create an array in Python with the format described above and then convert that to JSON using the json library. Due to JSON notation is actually handle in Python as a string, and because is easy to produce syntax errors if you try to write JSON directly, we recommended to create the objects in Python and convert them into JSON, so we can guaranee that the final JSON is free of errors.

>>> import json

>>> data = [
 ...: "seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: ...
 ...: ]

>>> json.dumps(data)
'["seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...], "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...]] ...'

The Data Set

Let’s say we want to represent information from the Human Poverty Index. Then we need to download the data in the format provided by United Nations’ site for the Multidimensional Poverty Index, which has replaced the old Human Poverty Index. Now we got a spreadsheet document, it’s time to open it and collect just the data we need, thus, go to the page 5 of the book, and copy and paste the cells into a clean spreadsheet. We clean what we don’t need like titles, captions, extra columns, etc. and leave just country names, the second “Value” column under the cell “Multidimensional Poverty Index”, the population under poverty in thousands, and the “Intensity of deprivation” column. The next step is to remove the rows with no data for that indicators, marked as “..”. After doing this, we should have a document with 4 columns and 109 rows that. Then remember to normalize all the values between 0 and 1 . Or you can simply download the cleaned and normalized file in CSV format or Excel (XLS) to avoid get lost in spreadsheet manipulation.

multidimensional_poverty_index_2011_ph2

Spreadsheet before normalizing

But, although we have the name of the countries, we need the geographical coordinates for them. There are several services that provide the latitude and longitude for a given address. In the case of having just the name of a country, the main coordinates for the capital is provided. We will use geopy, which is a Python library able to connect to different providers and get several kinds of information. To use geopy, a terminal or console is needed in order to get installed, that is very easy with just a command.

$ easy_install geopy

After that, we can open a terminal with the common Python interpreter, or an interactive console like IPython, and just get the latitude and longitude of, for instance, “Spain”, with next commands:

>>> from geopy import geocoders

>>> g = geocoders.Google()

>>> g.geocode("Spain")
(u'Spain', (40.463667000000001, -3.7492200000000002))

By default, geopy will try to get only one match, but you can easily avoid that behaviour adding the argument exactly_one equals to False. Then geopy will return a list of elements and it will be your task to get just one.  Google has a reduced limit of queries per day, so you should try a different provider for the geocoder if reach that limit.

>>> from geopy import geocoders

# Using GeoNames as provider
>>> g = geocoders.GeoNames()

# Getting the whole list of matches and getting just one
>>> g.geocode("Spain", exactly_one=False)[0]
(u'Spain', (40.463667000000001, -3.7492200000000002))

In this way, we can build a list of our countries from our spreadsheet and pass it to the next below. To build the list of countries you can simply copy the column of countries into your code editor, and replace ‘n’ with ‘”, “‘ so the result it’s something like:

["Slovenia", "Czech Republic", "United Arab Emirates", "Estonia", "Slovakia", "Hungary", "Latvia", "Argentina", "Croatia", "Uruguay", "Montenegro", "Mexico", "Serbia", "Trinidad and Tobago", "Belarus", "Russian Federation", "Kazakhstan", "Albania", "Bosnia and Herzegovina", "Georgia", "Ukraine", "The former Yugoslav Republic of Macedonia", "Peru", "Ecuador", "Brazil", "Armenia", "Colombia", "Azerbaijan", "Turkey", "Belize", "Tunisia", "Jordan", "Sri Lanka", "Dominican Republic", "China", "Thailand", "Suriname", "Gabon", "Paraguay", "Bolivia (Plurinational State of)", "Maldives", "Mongolia", "Moldova (Republic of)", "Philippines", "Egypt", "Occupied Palestinian Territory", "Uzbekistan", "Guyana", "Syrian Arab Republic", "Namibia", "Honduras", "South Africa", "Indonesia", "Vanuatu", "Kyrgyzstan", "Tajikistan", "Viet Nam", "Nicaragua", "Morocco", "Guatemala", "Iraq", "India", "Ghana", "Congo", "Lao People's Democratic Republic", "Cambodia", "Swaziland", "Bhutan", "Kenya", "Sao Tome and Principe", "Pakistan", "Bangladesh", "Timor-Leste", "Angola", "Myanmar", "Cameroon", "Madagascar", "Tanzania (United Republic of)", "Yemen", "Senegal", "Nigeria", "Nepal", "Haiti", "Mauritania", "Lesotho", "Uganda", "Togo", "Comoros", "Zambia", "Djibouti", "Rwanda", "Benin", "Gambia", "Côte d'Ivoire", "Malawi", "Zimbabwe", "Ethiopia", "Mali", "Guinea", "Central African Republic", "Sierra Leone", "Burkina Faso", "Liberia", "Chad", "Mozambique", "Burundi", "Niger", "Congo (Democratic Republic of the)", "Somalia"]

And use this list in the next script:

>>> from geopy import geocoders

>>> g = geocoders.GeoNames()

>>> countries = ["Slovenia", "Czech Republic", ...]
>>> for country in countries:
try:
    placemark = g.geocode(country, exactly_one=False)[0]
    print placemark[0] +","+ placemark[1][0] +","+ placemark[1][1]
except:
    print country
....:
....:
Slovenia,46.151241,14.995463
Czech Republic,49.817492,15.472962
...

Now, we can select all the results corresponding to the latitudes and longitudes of every country and copy them with Ctrl-C, Cmd-C or mouse right-click and copy. Go to our spreadsheet, in the first row of a new column, and then paste all. We should see a dialogue for paste the data, and on it, check the right option in order to get the values separated by commas.

Paste the result comma separated

Paste the result comma separated

Done this, we have almost all the coordinates for all the countries. There could be some locations for which the script didn’t get the right coordinates (geopy raise an error and the script just print the country name instead), like “Moldova (Republic of)” or “Georgia”. For these countries, and after a careful supervision, the better thing to do is to run several tries fixing the names (trying “Moldova” instead of “Moldova (Republic of)”) or just looking the location in Wikipedia –for example for Georgia, Wikipedia provides a link in the information box at the right side with the exact coordinates. When the process is over, we remove the columns with the names and sort the columns in order to get first the latitude, second the longitude, and the rest of the columns after that. We almost have the data prepared. After this, we need to save the spreadsheet as CSV file in order to be processed by a Python script that converts it into the JSON format that WebGL Globe is able to handle.

Reading CSV Files

Instead of passing a list of countries to geopy, we can use our clean and normalized CSV file as input to produce the JSON file we need.

A CSV file is a data format for printing tables into plain-text data. There are a plenty of dialects for CSV, but the most common is to print one row per line and every field comma separated. For example, the next table will have the output shown in below.

Field 1 Field 2
Row 1 Value Cell 1 Row 1 Value Field 2
Row 2 Value Cell 1 Row 2 Value Field 2

And the output will be:

Field 1,Field 2
Row 1 Value Cell 1,Row 1 Value Cell 2
Row 2 Value Cell 1,Row 2 Value Cell 2

And depending on the case, you can choose what character will be used as a separator instead of the “,”, or just leave the header out. But what happens if I need to print commas? Well, you can escape then or just use a double quote for the entire value.

"Row 1, Value Cell 1","Row 1, Value Cell 2"
"Row 2, Value Cell 1","Row 2, Value Cell 2"

And again you can think what is next if I need to print double quotes. In that case can change the character for quoting or just escape with a slash. This is the origin of all the dialects for CSV. However we are not covering this that deep and we will focus on CSV reading through Python. To achieve it we use the standard  “csv”  library and invoke the “reader” method with a file object after opening it from disk. This done, we can just iterate for every line as a list and store every value in a variable for the iteration.

In our case every line has, in this order, latitude, longitude, value for multidimensional poverty index, value for thousands of people in a poverty situation, and finally value for the intensity of deprivation. Note that our CSV file has no header, so we do not have to ignore the first line then. We will use three lists to store the different values of our series and finally, using the json library we could print a JSON output to a file. The final poverty.py script that processes the CSV file and produces the JSON file is detailed next:

import csv
import json
from geopy import geocoders

# Load the GeoNames geocoder
g = geocoders.GeoNames()

# Every CSV row is split into a list of values
file_name = "multidimensional_poverty_index_normalized_2011_ph2.csv"
rows = csv.reader(open(file_name, "rb"))

# Init the the lists that will store our data
mpis = []  # Multidimensional Poverty Index
thousands = []  # People, in thousands, in a poverty situation
deprivations = []  # Intensity of Deprivation

# Iterate through all the rows in our CSV
for country, mpi, thousand, deprivation in rows:
    try:
        # Get the coordinates of the country
        place, (lat, lon) = g.geocode(country, exactly_one=False)[0]
        # Fill the
        mpis = mpis + [lat, lon, mpi]
        thousands = thousands + [lat, lon, thousand]
        deprivations = deprivations + [lat, lon, deprivation]
    except:
        # We ignore countries that geopy is unable to process
        print "Unable to get coordinates for " + country

# Format the output
output = [
    ["Multidimensional Poverty Index", mpis],
    ["People affected (in thousands)", thousands],
    ["Intensity of Deprivation", deprivations]
]

# Generate the JSON file
json_file = open("poverty.json", "w")
json.dump(output, json_file)

And the JSON file poverty.json, using GeoNames, must look like:

[["Multidimensional Poverty Index", ["46.25", "15.1666667", "0", "49.75", "15.0", "0.01", "24.0", "54.0", "0.002", ... ]
...

Take into account that this script will omit some countries, and will print their names on the screen. If you chose a different provider in geopy, you will probably get slightly different coordinates and unrecognizable country names.

Unable to get coordinates for Bolivia (Plurinational State of)
Unable to get coordinates for Congo (Democratic Republic of the)

Putting it all together

Now, we have the poverty.json file, our input data for WebGL Globe. So, the last step is setup the Globe and and the data input file all together. We need to download the webgl-globe.zip file and extract the directory named “globe”  into a directory with the same name. In it, we copy our poverty.json file and now edit the provided index.html in order to replace the apparitions of “population909500.json” with “poverty.json”, and do some other additions like the name of the series. The resulting index.html, excluding style block, must look like the next one.

<!DOCTYPE HTML>
<html lang="en">
  <head>
    <title>WebGL Poverty Globe</title>
    <meta charset="utf-8">
  </head>
  <body>

  <div id="container"></div>

  <div id="info">
    <strong><a href="http://www.chromeexperiments.com/globe">WebGL Globe</a></strong>
    <span class="bull">&bull;</span> Created by the Google Data Arts Team
    <span class="bull">&bull;</span> Data acquired from <a href="http://hdr.undp.org/">UNDP</a>
  </div>

  <div id="currentInfo">
    <span id="serie0" class="serie">Multidimensional Poverty Index</span>
    <span id="serie1" class="serie">Population (in thousands)</span>
    <span id="serie2" class="serie">Intensity of Deprivation</span>
  </div>

  <div id="title">
    World Poverty
  </div>

  <a id="ce" href="http://www.chromeexperiments.com/globe">
    <span>This is a Chrome Experiment</span>
  </a>

  <script type="text/javascript" src="/globe/third-party/Three/ThreeWebGL.js"></script>
  <script type="text/javascript" src="/globe/third-party/Three/ThreeExtras.js"></script>
  <script type="text/javascript" src="/globe/third-party/Three/RequestAnimationFrame.js"></script>
  <script type="text/javascript" src="/globe/third-party/Three/Detector.js"></script>
  <script type="text/javascript" src="/globe/third-party/Tween.js"></script>
  <script type="text/javascript" src="/globe/globe.js"></script>
  <script type="text/javascript">

    if(!Detector.webgl){
      Detector.addGetWebGLMessage();
    } else {

      var series = ['Multidimensional Poverty Index','Population (in thousands)','Intensity of Deprivation'];
      var container = document.getElementById('container');
      var globe = new DAT.Globe(container);
      console.log(globe);
      var i, tweens = [];

      var settime = function(globe, t) {
        return function() {
          new TWEEN.Tween(globe).to({time: t/series.length},500).easing(TWEEN.Easing.Cubic.EaseOut).start();
          var y = document.getElementById('serie'+t);
          if (y.getAttribute('class') === 'serie active') {
            return;
          }
          var yy = document.getElementsByClassName('serie');
          for(i=0; i<yy.length; i++) {
            yy.setAttribute('class','serie');
          }
          y.setAttribute('class', 'serie active');
        };
      };

      for(var i = 0; i<series.length; i++) {
        var y = document.getElementById('serie'+i);
        y.addEventListener('mouseover', settime(globe,i), false);
      }

      var xhr;
      TWEEN.start();
      xhr = new XMLHttpRequest();
      xhr.open('GET', 'poverty.json', true);
      xhr.onreadystatechange = function(e) {
        if (xhr.readyState === 4) {
          if (xhr.status === 200) {
            var data = JSON.parse(xhr.responseText);
            window.data = data;
            for (i=0;i<data.length;i++) {
              globe.addData(data[1], {format: 'magnitude', name: data[0], animated: true});
            }
            globe.createPoints();
            settime(globe,0)();
            globe.animate();
          }
        }
      };
      xhr.send(null);
    }
  </script>
  </body>
</html>

Finally, to see the result, you must put all the files in a static web server and browse the URL. The fastest way to do this is running a local web server in Python, and despite the fact that you will be the only one able to see the globe, managing HTML files and small websites is out of the scope of this lesson. Run the next command under the globe directory itself.

$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

Then, go to http://localhost:8000 and navigate to the index.html to see the result.

Globe before normalization

Globe before normalization

If it seems like this, it is because there is something wrong with some of the series. Remember that we need to normalize the values in order to get values in the range o to 1. To do that, we open again our CSV file as a spreadsheet, calculate the sum of the columns that we want to normalize, and then, we create a new column in which every single cell is the result of the division between the old value of cell by the total sum of all the values in the old column. We repeat the process with the other two columns and replace the old ones with just the values in the new ones. We run the steps to generate a new JSON file and try again.

Now, you can click on World Poverty to see everything properly working.

Suggested Readings

The Python Standard Library Documentation

Lutz, Learning Python

  • Ch. 9: Tuples, Files, and Everything Else

7 Comments

Filed under Topics