Calculating an average face

Example of average face

Example of average face

In the comings and goings of my thesis, now that I am certain that my main topic was way too big (my supervisor liked it but he says that it was more like a whole research curricula that a single thesis), I try to focus and cover only a fragment of the initial goal. Big Culture stills sounds in my head since that October of 2012 when I first thought of the idea. However, my thesis won’t be a monograph anymore, but a set of articles related to Big Data in Humanities.

One of these articles is already on process, and its topic is related to the representation of faces in world painting. An abstract has been sent to DH 2014, hosted at University of Lausanne, Switzerland. After a successful preliminary work for DH 2013, this time I have been working on deeper analysis from our proudly collected data set of 47k faces in paintings across time. As a part of the research process, and as usual in any paper conceived in the CulturePlex, there is some programming involved. In this case a lot of matplotlib, scipy, numpy, IPython, and Pandas (and even PIL/Pillow), a set of Python libraries and tools that quickly become our main stack for data analysis.

One interesting challenge that came from this research was the generation of an average face. The first thing I noticed was that my machine was not able to handle that amount of images due to its limited RAM (4GB), so I asked for help to SHARCNet and Prof. Mark Daley kindly offered me a user in one of the 32GB Mac machines available, so I managed to get installed IPython with all the libraries inside a virtualenv, I copied all the files needed, and then started the Notebook.

$ ipython notebook --pylab inline --no-browser --ip=<YOUR PUBLIC IP HERE> --NotebookApp.password=`python -c "from IPython.lib import passwd; print(passwd())"`

Among the features that I have available for a face (after applying face detection algorithms), there is the centroid of the face. From that point, and using height and width as well, I can trace a rectangle that delimits the boundaries of a face. Then I center all the faces by their centroid and resize all the images to have same height. In order to calculate the average face, I first implemented a solution that made use of opacity/alpha levels in matplotlob, but that seems to be limited to 256 layers (don’t know if can be increased) and works pretty slow. consuming all the resources of the machine really fast. After trying some other methods, I came with the idea that an average image is as simple as a standard statistical mean calculated for every single pixel. Images were in RGB color model, so corresponding matrices had 3 dimensions. If I were used grey-scale images the whole process would have been 3 times faster, although for the sizes of images that I am handling (faces of 200 by 200 pixels), there is almost no difference.

A simplified version of the code used is shown below, although is subjected to performance improvements.

def face_detail(face):
    mode = 'RGB'
    desired_height = 250
    center_at = [400, 400]
    img = faces.load_image(face)
    features = faces.load_features(face)
    center_pct = (
    height = features['height']
    width = features['width']
    painting_height = features['painting_height']
    painting_width = features['painting_width']
    # Resizing
    pil_img = PILImage.fromarray(img, mode)
    resize_height = 1.0 * painting_height * desired_height / height
    resize_width = 1.0 * painting_width * desired_height / height
    resized_img = pil_img.resize(
        (int(resize_width), int(resize_height))
    # Shifting
    shift_point = [
        center_at[1] - (center_pct[1] * resize_height / 100.0),
        center_at[0] - (center_pct[0] * resize_width / 100.0),
    shifted_img = ndimage.shift(
    # Cropping
    xlim = slice(center_at[0] * 0.5, center_at[0] * 1.5)
    ylim = slice(center_at[1] * 0.5, center_at[1] * 1.5)
    cropped_img = shifted_img[xlim, ylim]
    return cropped_img

def get_average_face(faces):
    imgs = []
    center_at = [400, 400]
    for index, face in faces.iterrows():
            img = face_detail(face)
            if img is not None:
                # Adding images
                array_img = np.array(img)
                array_img.resize(center_at + [3])
        except Exception as e:
            msg = "Error found when processing image {}:nt{}"
            print(msg.format(face, e))
    # Averaging
    avgface = np.array(imgs).mean(axis=0)
    avgface = avgface.astype(numpy.uint8)
    return avgface

fig, ax = plt.subplots(1, 1, figsize=(10, 10), dpi=300, facecolor="none")
average_face = get_average_face(faces)
ax.imshow(average_face, interpolation='bicubic')

Some other problems still need to be addressed, i.e. face rotations. The use of affine and projective transformations can solve that, as well as replacing the method of resizing and shifting to re-center all the faces.

1 Comment

Filed under Tasks

The second course that I have to design

I was lurking co-workers’ blog posts when I realized that I had to pick a topic for my second course as requirement for the program. This time the course can be designed for the graduate level (the first one actually also was it, though).

In the last months, I have spent a considerable amount of time reading about big data, data mining, machine learning, and statistical analysis; as well as art history, woman rights movements, and representation of body parts. All of them for my current research on human representation of face in world painting, that it is expected to materialize in an abstract for DH 2014 first, and in the first chapter of my disseration later. Second and third chapters of my thesis may include an authorship attribution study of a very famous Spanish novel, and a computer-based sentiment and meter analysis of the set of a specific kind of poetry plays.

All this work is being carried out thanks to the extensive documentation and reading of primary and secondary sources, as well as by dealing with considerable amounts of data generated mainly ad-hoc for this purposes. In the proccess, I started to follow a certain workflow, 1) data collection and curation, 2) data cleansing, 3) auto annotation of meta data, 4) data formating, and finally 5) data analysis employing a varying set of tools and concepts borrowed from Computer Sciences.

Consequently, that made me think that my second course for this PhD will be on Data for the Humanities Research. So, let’s talk to my super visor to see if he is as happy as I am with this topic 😀

Leave a Comment

Filed under Tasks, Topics

Tha amazing GraphConnect and Drivers Hackathon

Back in a morning of september I got an e-mail. At a first glance the e-mail seemed to me like it was written to a different person. The sender was Emil Efren, one of the guys behind Neo4j, and he was inviting me to San Francisco. After reading it like three times I realized that the e-mail was actually sent to me. They were asking all the Neo4j drivers’ authors to a two-day hackathon in San Francisco, in their office in San Mateo, and then to the GraphConnect conference. And what it was even more awesome, they was going to take care of everything, from transportation to accomodation, and even meals. I don’t really know if that is a common practice in the real world market out of college, but comparing that to our PhD student situation in which we have to beg for our $400 per term for attending conferences, and only in the case that we are speakers, the gesture from Neo4j is even nicer.

The day came, so I went to San Francisco in a flight from mi city with a short stop in Toronto. During the flight I met an old woman who has a real character, and she told me everything about her and her daughter, who was a partier and free spirit that had gone to Burning Man for the last 15 years. And what a coincidence, they were from London Ontario too!

When I first arrived to SFO, I lead myself straight to Neo4j office in San Mateo. But the taxi driver was an annoying and unpolite asian man who started to yell at me because I didn’t know how to reach my destination. I was like, wat? Man, you are the taxi driver, calm down! After arriving two hours after the first meeting had started, I joined the group in the discussion about the future data types in Neo4j. After that we introduce ourselves and the discussion kept going. It wasn’t one of those sponsored things where the company who organizes it already know every single detail of the implementation and the whole event is just a farce. Not at all. The Neo4j team really wanted to know our opinions, and when debated, they even adopted some of our thoughts! I have to say that I loved that :)

The rest of the day, as well as the following day, was for planning and coding the individual milestones for our respective drivers. But we also had some time to hang out a bit with the rest of team. And everything accompained with meals and a special dinner the day before the conference that included real jamón serrano! I was amazed for finding that random product in a dinner in San Francisco.

The Hackathon attendees: Back row: Nigel Small, Tobias Lindaaker, Max DeMarzi, Jason McAllen, Stefan Armbruster, Michael Hunger, Tatham Oddie, Andreas Ronge, Elad Olasson, Philip Rathle, Javier de la Rosa Front row: Josh Adell, Aseem Kishore, Peter Neubauer, Matt Luongo, Wes Freeman

In that dinner, with all the team, drivers authors, and speakers of the GraphConnect, we took the picture that you can see under these lines. And I almost forgot, I got a prize to the most innovative community contribution 😀

The GraphConnect conference  started the next day. And I really liked it because the talks were very varied, from some technicals introductions and discussions, to real world cases where graph databses had proven to be a better solution than relational databases. I also had the pleasure to meet in person and talk a bit to Josh Adell, Matt Luongo, Aseem Kishore, Elad Olasson, Peter Neubauer, Kenny Bastani, Andrés Taylor, Max de Marzi and Alberto Perdomo. Real great people.

But all good things come to an end. So even when the night of the conference all the gang went to a party in Embarcadero, my flight was really early next morning, so I went to bed soon. Although I still had the chance to rent a bike with Josh and go close to the Golde Gate Bridge and the sidewalk. A very enjoyable ride, I must say.

And that was all. They ended saying that it might be a good tradition to see us every year. So, let’s see how things goes in 2014 for Neo4j and neo4-rest-client 😀

This time I got closer to the bridge. Next time I’ll cross it, I promise!

Leave a Comment

Filed under Events


Well, it is that time of the year again, so I might start writing again on a bi-weekly basis :-) The summer was good: I was on DH2013 and got a lot of good feedback on my project about the representation of the face in human paintings, as well as nice outcomes for the rest of the team who traveled to Nebraska by van. Yes, by van. I have to admit that it was fun some parts. I also went to Iceland to present a poster in DATA2013 about our own graph database management system SylvaDB. None of those conferences were what I was expecting.

DH2013, that stands for Digital Humanities, wasn’t engaging enough. To say the truth there were really good presentations and talks, although usually bad speakers were working in really interesting projects, and poor projects had very profesional sellers. On the other hand, the digital humanities thing (that doesn’t have anything to do with humanitarian –disclaimer: anecdote sponsored by the customs officer of Homeland Security of United States of America, right before crossing the border) is becoming its own thing, and its own cult with its own rituals and beliefs. And as in any other cult, there are people doing stuff, or practicants, and people just blowing the bubble and doing some meta-discussing, or believers.

It looks to me like if the Digital Humanities thing were officially established and everybody were just looking for maintaining the thing as it is now (with some honorable exceptions, like Isabel Galina). In contrast, I enjoyed a lot the talks on the DATA Conference: they were clear, wonderful and to the point. It is a shame that the conference covered so many topics and were too specific. Or maybe it’s just that I’ve lost the practice in computer-related conferences. Anyhow, the approach that I saw to relate academia to software development looked to me so money oriented. That’s not necessarily negative, but what about research for knowledge or the pleasure of solving a problem, as we see in other fields. Well, that DATA Conference didn’t have any of that.

In the end, software is led by the industry in order to make money, and digital humanities is led by the academy to keep being the same once settled.

In this scenario, I think it’s better to do my research as best as I can. And there will be already someone to put labels on my work 😀

Leave a Comment

Filed under Debates

Computer Tools for Linguists

My last blog entry was about a new course I would like to design and teach. But some of you were thinking, “hey, what happened to the other course, the one about linguists?” And you are right. No one word since then. But now, I am finally ready to release the course to the world. I will defend on April 11th, but all the content is ready. So, here you go :-) (it is only the content; exercises, syllabus and defense document not included… until I pass the defense)

The course is a hands-on and pragmatic introduction to those computer tools that can be used in Linguistic, from basic computer concepts, to give an understanding of how machines work, to applications, programming languages and programs that can make the life of the linguist researcher a little bit easier. The course will cover subjects such as computers architecture, programming languages, regular expressions, the general purpose language Python, the statistical language R, and the set of tools for child language analysis, CHILDES.

Feel free to report any error!


Filed under Tasks

Thinking about teaching web development to humanists

Now that my first course on computer tools for linguists is on its way (I already have almost half of the lessons designed), it is time to think about the next one. The CulturePlex laboratory is, so far, a multidisciplinary environment where people have backgrounds from different fields, mixing up Computer Sciences-like disciplines with Humanities-like disciplines. However, because the rise of computers in every aspect of our life’s, programming literacy is increasingly becoming a demanded skill. Such was the case that how-to-code courses are now a must for researchers across Academia. In the field of Humanities, people are using computer tools to formulate new questions as well as to solve the new and the old ones . This new trend is usually called and marketed as Digital Humanities, but the term is now under discussion in such limits that some people even consider the discussion itself as Digital Humanities. But more than that, it is really about the crystallization of the needs of both current and future researchers. Therefore, our goal is to stop being just multidisciplinary, and start being a poly-disciplinary laboratory.

And in order to fulfill this gap, my second course is addressing the needs of digital humanists by creating an intensive intersession course. This course will cover all the aspects of web development, from scratch and zero knowledge of programming, to pretty complex web sites with some logic and even persistence in relational databases. The name, Web Development From Scratch for Humanists, says it all. After finishing this course, students will be able to take an arbitrary data set from their investigations, and build a query-able website to show the data to the world.

And to so, a preliminary outline for this course is shown below:

Week 1

Day 1: Introduction to Computers and Architecture
Day 2: Programming Languages and Python. Conditionals, Loops, and Functions
Day 3: Data Types. Recursion
Day 4: Libraries and Oriented Programming

Week 2

Day 1: Internet and the Web
Day 2: Frameworks. Introduction to Django
Day 3: Views and Templates
Day 4: HTML Fundamentals

Week 3

Day 1: CSS
Day 2: Introduction to Javascript
Day 3: jQuery and AJAX
Day 4: Bootstrap and D3.js

Week 4

Day 1: Introduction to Relational Databases
Day 2: Schemas and Models
Day 3: Decorators and User Authentication
Day 4: Migrations

Week 5

Day 1: REST Interfaces
Day 2: Agile Integration
Day 3: Git and Control Version Systems
Day 4: Test Driven Development

And also, as an experiment, we probably will be running this course in the lab, just to see if it is too ambitious and simply unrealistic, or on the contrary, something that we can achieve with a lot of effort. Time will say it.

Leave a Comment

Filed under Tasks

Río and Reykjavik

Recently, a paper which I was collaborating for, was accepted to Digital Humanities 2013, to be hosted in Lincoln, Nebraska. Our paper is titled “Not Exactly Prima Facie: Understanding the Representation of the Human Through the Analysis of Faces in World Painting.” This research makes use of face recognition techniques in order to identify similarities in faces across the time.

The representations of the human face contain a virtual archive of human expressions and emotions that can help decipher, through a science of the face, various traits of the human condition and its evolution through time and space. In this project we aim to explore this through the use of powerful tools of facial recognition, data mining, graph theory and visualization and cultural history. Our methodology takes advantage of these tools and concepts to answer questions about periods in art history, such as the significance of the Baroque as a culture derived from human expansion, and the cultural meaning of the progressive erasing of the human face from modern painting. Quantitative analysis of huge amounts of data has been shown to provide answers to new and different questions that otherwise couldn’t have been considered. Our study takes some ideas from the concept of Culturomics by creating a set of more than 123,500 paintings from all periods of art history, and applying the same face recognition algorithm used today by Facebook in its photo-tagging system. The result is a set of over 26,000 faces ready to analyze according to a variety of features extracted by the algorithm. We found a mean of approximately 1 face out of every 5 paintings.

But what I am most excited lately, is the submission we also made a week ago. Two target conferences, DATA and WWW, to be hosted at Reykjavik in Iceland, and Río de Janeiro in Brazil. It’s the first time I collaborate in a paper sent to highly technical conferences, so I don’t know what are the chances to be accepted in at least one of them. I’ll just cross my fingers and wait until the notification deadline came.

Leave a Comment

Filed under Events

Creating a Globe of Data (PH2)

Lesson Goals

This is a lesson designed for intermediate users, although beginner users should be able to follow along.

In this lesson we will cover the next main topics:

  • Use of Python to produce a visualization of World Poverty Index on interactive globe.
  • Transform CSV data into JSON notation in Python.
  • Get spatial coordinates using Google and other sources from geopy library.

After seeing the basics of Python and how it could help us in our daily work, we will introduce one of the many options for visualization of data. In this case, we will combine a data source in CSV format that will be processed to transform them into JSON notation. Finally we will represent all the information in a world globe, designed for modern browsers using the WebGL technology. During the process, we will need to get the spatial coordinates for countries across the world. And before starting, you can see the final result of this unit on World Poverty, so don’t be afraid about all the new names mentioned above, we will explain them below.

The Globe of Data

Since the ending of 2009, some browsers started to implement an incipient specification for rendering 3D content on the Web. Although it is not yet a part of W3C‘s specifications –the W3C is the organization that proposes, defines and approves almost all Internet standards–, WebGL, that it is how is called, is being supported by all major browsers and the industry.

WebGL is the most recent way for 3D representations on the Web. So, with WebGL, a new form of data representation is made available. In fact, there are artists, scientists, game designers, statisticians and so on, creating amazing visualizations from their data.

Google WebGL Globe

Google WebGL Globe

One of these new ways of representations was made by Google. It is called WebGL Globe and is intended to show statistical geo-located data.


JSON, acronym for JavaScript Object Notation, is not only a format to represent data in Javascript, but it is also the data type that WebGL Globe needs to work. In this format, a list is inclosed between brackets, “[” for start and “]” to end. Therefore, the data series for WebGL Globe is a list of lists. Every one of these lists have two elements. The first one is the name of the serie and the second one is another list containing the data. Although is good to know how JSON lists are encoded, there are libraries for Python to do that conversion for you, so you only have to handle pure Python objects. The next code snippet shows how native list and dictionaries Python data objects are transformed into JSON.

>>> import json

>>> json.dumps([1, 2, 3])
    '[1, 2, 3]'

>>> json.dumps({"key1": "val1", "key2": "val2"})
    '{"key2": "val2", "key1": "val1"}'

The data for WebGL Globe is written comma separated, so you must indicate your information in a set of three elements: the first is the geographical coordinate for latitude, the second one is the same for longitude, and the third one is the value of the magnitude you would like to represent, but normalized between 0 and 1. This means if we have the values 10, 50, 100 for magnitudes, these will have to be translated into 0.1, 0.5 and 1.

The only thing you now need is to split up your data into several series of latitude, longitude and magnitude in JSON format, as the next example illustrates:

var data = [
    'seriesA', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
    'seriesB', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]

This said, we can create an array in Python with the format described above and then convert that to JSON using the json library. Due to JSON notation is actually handle in Python as a string, and because is easy to produce syntax errors if you try to write JSON directly, we recommended to create the objects in Python and convert them into JSON, so we can guaranee that the final JSON is free of errors.

>>> import json

>>> data = [
 ...: "seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: ...
 ...: ]

>>> json.dumps(data)
'["seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...], "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...]] ...'

The Data Set

Let’s say we want to represent information from the Human Poverty Index. Then we need to download the data in the format provided by United Nations’ site for the Multidimensional Poverty Index, which has replaced the old Human Poverty Index. Now we got a spreadsheet document, it’s time to open it and collect just the data we need, thus, go to the page 5 of the book, and copy and paste the cells into a clean spreadsheet. We clean what we don’t need like titles, captions, extra columns, etc. and leave just country names, the second “Value” column under the cell “Multidimensional Poverty Index”, the population under poverty in thousands, and the “Intensity of deprivation” column. The next step is to remove the rows with no data for that indicators, marked as “..”. After doing this, we should have a document with 4 columns and 109 rows that. Then remember to normalize all the values between 0 and 1 . Or you can simply download the cleaned and normalized file in CSV format or Excel (XLS) to avoid get lost in spreadsheet manipulation.


Spreadsheet before normalizing

But, although we have the name of the countries, we need the geographical coordinates for them. There are several services that provide the latitude and longitude for a given address. In the case of having just the name of a country, the main coordinates for the capital is provided. We will use geopy, which is a Python library able to connect to different providers and get several kinds of information. To use geopy, a terminal or console is needed in order to get installed, that is very easy with just a command.

$ easy_install geopy

After that, we can open a terminal with the common Python interpreter, or an interactive console like IPython, and just get the latitude and longitude of, for instance, “Spain”, with next commands:

>>> from geopy import geocoders

>>> g = geocoders.Google()

>>> g.geocode("Spain")
(u'Spain', (40.463667000000001, -3.7492200000000002))

By default, geopy will try to get only one match, but you can easily avoid that behaviour adding the argument exactly_one equals to False. Then geopy will return a list of elements and it will be your task to get just one.  Google has a reduced limit of queries per day, so you should try a different provider for the geocoder if reach that limit.

>>> from geopy import geocoders

# Using GeoNames as provider
>>> g = geocoders.GeoNames()

# Getting the whole list of matches and getting just one
>>> g.geocode("Spain", exactly_one=False)[0]
(u'Spain', (40.463667000000001, -3.7492200000000002))

In this way, we can build a list of our countries from our spreadsheet and pass it to the next below. To build the list of countries you can simply copy the column of countries into your code editor, and replace ‘n’ with ‘”, “‘ so the result it’s something like:

["Slovenia", "Czech Republic", "United Arab Emirates", "Estonia", "Slovakia", "Hungary", "Latvia", "Argentina", "Croatia", "Uruguay", "Montenegro", "Mexico", "Serbia", "Trinidad and Tobago", "Belarus", "Russian Federation", "Kazakhstan", "Albania", "Bosnia and Herzegovina", "Georgia", "Ukraine", "The former Yugoslav Republic of Macedonia", "Peru", "Ecuador", "Brazil", "Armenia", "Colombia", "Azerbaijan", "Turkey", "Belize", "Tunisia", "Jordan", "Sri Lanka", "Dominican Republic", "China", "Thailand", "Suriname", "Gabon", "Paraguay", "Bolivia (Plurinational State of)", "Maldives", "Mongolia", "Moldova (Republic of)", "Philippines", "Egypt", "Occupied Palestinian Territory", "Uzbekistan", "Guyana", "Syrian Arab Republic", "Namibia", "Honduras", "South Africa", "Indonesia", "Vanuatu", "Kyrgyzstan", "Tajikistan", "Viet Nam", "Nicaragua", "Morocco", "Guatemala", "Iraq", "India", "Ghana", "Congo", "Lao People's Democratic Republic", "Cambodia", "Swaziland", "Bhutan", "Kenya", "Sao Tome and Principe", "Pakistan", "Bangladesh", "Timor-Leste", "Angola", "Myanmar", "Cameroon", "Madagascar", "Tanzania (United Republic of)", "Yemen", "Senegal", "Nigeria", "Nepal", "Haiti", "Mauritania", "Lesotho", "Uganda", "Togo", "Comoros", "Zambia", "Djibouti", "Rwanda", "Benin", "Gambia", "Côte d'Ivoire", "Malawi", "Zimbabwe", "Ethiopia", "Mali", "Guinea", "Central African Republic", "Sierra Leone", "Burkina Faso", "Liberia", "Chad", "Mozambique", "Burundi", "Niger", "Congo (Democratic Republic of the)", "Somalia"]

And use this list in the next script:

>>> from geopy import geocoders

>>> g = geocoders.GeoNames()

>>> countries = ["Slovenia", "Czech Republic", ...]
>>> for country in countries:
    placemark = g.geocode(country, exactly_one=False)[0]
    print placemark[0] +","+ placemark[1][0] +","+ placemark[1][1]
    print country
Czech Republic,49.817492,15.472962

Now, we can select all the results corresponding to the latitudes and longitudes of every country and copy them with Ctrl-C, Cmd-C or mouse right-click and copy. Go to our spreadsheet, in the first row of a new column, and then paste all. We should see a dialogue for paste the data, and on it, check the right option in order to get the values separated by commas.

Paste the result comma separated

Paste the result comma separated

Done this, we have almost all the coordinates for all the countries. There could be some locations for which the script didn’t get the right coordinates (geopy raise an error and the script just print the country name instead), like “Moldova (Republic of)” or “Georgia”. For these countries, and after a careful supervision, the better thing to do is to run several tries fixing the names (trying “Moldova” instead of “Moldova (Republic of)”) or just looking the location in Wikipedia –for example for Georgia, Wikipedia provides a link in the information box at the right side with the exact coordinates. When the process is over, we remove the columns with the names and sort the columns in order to get first the latitude, second the longitude, and the rest of the columns after that. We almost have the data prepared. After this, we need to save the spreadsheet as CSV file in order to be processed by a Python script that converts it into the JSON format that WebGL Globe is able to handle.

Reading CSV Files

Instead of passing a list of countries to geopy, we can use our clean and normalized CSV file as input to produce the JSON file we need.

A CSV file is a data format for printing tables into plain-text data. There are a plenty of dialects for CSV, but the most common is to print one row per line and every field comma separated. For example, the next table will have the output shown in below.

Field 1 Field 2
Row 1 Value Cell 1 Row 1 Value Field 2
Row 2 Value Cell 1 Row 2 Value Field 2

And the output will be:

Field 1,Field 2
Row 1 Value Cell 1,Row 1 Value Cell 2
Row 2 Value Cell 1,Row 2 Value Cell 2

And depending on the case, you can choose what character will be used as a separator instead of the “,”, or just leave the header out. But what happens if I need to print commas? Well, you can escape then or just use a double quote for the entire value.

"Row 1, Value Cell 1","Row 1, Value Cell 2"
"Row 2, Value Cell 1","Row 2, Value Cell 2"

And again you can think what is next if I need to print double quotes. In that case can change the character for quoting or just escape with a slash. This is the origin of all the dialects for CSV. However we are not covering this that deep and we will focus on CSV reading through Python. To achieve it we use the standard  “csv”  library and invoke the “reader” method with a file object after opening it from disk. This done, we can just iterate for every line as a list and store every value in a variable for the iteration.

In our case every line has, in this order, latitude, longitude, value for multidimensional poverty index, value for thousands of people in a poverty situation, and finally value for the intensity of deprivation. Note that our CSV file has no header, so we do not have to ignore the first line then. We will use three lists to store the different values of our series and finally, using the json library we could print a JSON output to a file. The final script that processes the CSV file and produces the JSON file is detailed next:

import csv
import json
from geopy import geocoders

# Load the GeoNames geocoder
g = geocoders.GeoNames()

# Every CSV row is split into a list of values
file_name = "multidimensional_poverty_index_normalized_2011_ph2.csv"
rows = csv.reader(open(file_name, "rb"))

# Init the the lists that will store our data
mpis = []  # Multidimensional Poverty Index
thousands = []  # People, in thousands, in a poverty situation
deprivations = []  # Intensity of Deprivation

# Iterate through all the rows in our CSV
for country, mpi, thousand, deprivation in rows:
        # Get the coordinates of the country
        place, (lat, lon) = g.geocode(country, exactly_one=False)[0]
        # Fill the
        mpis = mpis + [lat, lon, mpi]
        thousands = thousands + [lat, lon, thousand]
        deprivations = deprivations + [lat, lon, deprivation]
        # We ignore countries that geopy is unable to process
        print "Unable to get coordinates for " + country

# Format the output
output = [
    ["Multidimensional Poverty Index", mpis],
    ["People affected (in thousands)", thousands],
    ["Intensity of Deprivation", deprivations]

# Generate the JSON file
json_file = open("poverty.json", "w")
json.dump(output, json_file)

And the JSON file poverty.json, using GeoNames, must look like:

[["Multidimensional Poverty Index", ["46.25", "15.1666667", "0", "49.75", "15.0", "0.01", "24.0", "54.0", "0.002", ... ]

Take into account that this script will omit some countries, and will print their names on the screen. If you chose a different provider in geopy, you will probably get slightly different coordinates and unrecognizable country names.

Unable to get coordinates for Bolivia (Plurinational State of)
Unable to get coordinates for Congo (Democratic Republic of the)

Putting it all together

Now, we have the poverty.json file, our input data for WebGL Globe. So, the last step is setup the Globe and and the data input file all together. We need to download the file and extract the directory named “globe”  into a directory with the same name. In it, we copy our poverty.json file and now edit the provided index.html in order to replace the apparitions of “population909500.json” with “poverty.json”, and do some other additions like the name of the series. The resulting index.html, excluding style block, must look like the next one.

<html lang="en">
    <title>WebGL Poverty Globe</title>
    <meta charset="utf-8">

  <div id="container"></div>

  <div id="info">
    <strong><a href="">WebGL Globe</a></strong>
    <span class="bull">&bull;</span> Created by the Google Data Arts Team
    <span class="bull">&bull;</span> Data acquired from <a href="">UNDP</a>

  <div id="currentInfo">
    <span id="serie0" class="serie">Multidimensional Poverty Index</span>
    <span id="serie1" class="serie">Population (in thousands)</span>
    <span id="serie2" class="serie">Intensity of Deprivation</span>

  <div id="title">
    World Poverty

  <a id="ce" href="">
    <span>This is a Chrome Experiment</span>

  <script type="text/javascript" src="/globe/third-party/Three/ThreeWebGL.js"></script>
  <script type="text/javascript" src="/globe/third-party/Three/ThreeExtras.js"></script>
  <script type="text/javascript" src="/globe/third-party/Three/RequestAnimationFrame.js"></script>
  <script type="text/javascript" src="/globe/third-party/Three/Detector.js"></script>
  <script type="text/javascript" src="/globe/third-party/Tween.js"></script>
  <script type="text/javascript" src="/globe/globe.js"></script>
  <script type="text/javascript">

    } else {

      var series = ['Multidimensional Poverty Index','Population (in thousands)','Intensity of Deprivation'];
      var container = document.getElementById('container');
      var globe = new DAT.Globe(container);
      var i, tweens = [];

      var settime = function(globe, t) {
        return function() {
          new TWEEN.Tween(globe).to({time: t/series.length},500).easing(TWEEN.Easing.Cubic.EaseOut).start();
          var y = document.getElementById('serie'+t);
          if (y.getAttribute('class') === 'serie active') {
          var yy = document.getElementsByClassName('serie');
          for(i=0; i<yy.length; i++) {
          y.setAttribute('class', 'serie active');

      for(var i = 0; i<series.length; i++) {
        var y = document.getElementById('serie'+i);
        y.addEventListener('mouseover', settime(globe,i), false);

      var xhr;
      xhr = new XMLHttpRequest();'GET', 'poverty.json', true);
      xhr.onreadystatechange = function(e) {
        if (xhr.readyState === 4) {
          if (xhr.status === 200) {
            var data = JSON.parse(xhr.responseText);
   = data;
            for (i=0;i<data.length;i++) {
              globe.addData(data[1], {format: 'magnitude', name: data[0], animated: true});

Finally, to see the result, you must put all the files in a static web server and browse the URL. The fastest way to do this is running a local web server in Python, and despite the fact that you will be the only one able to see the globe, managing HTML files and small websites is out of the scope of this lesson. Run the next command under the globe directory itself.

$ python -m SimpleHTTPServer
Serving HTTP on port 8000 ...

Then, go to http://localhost:8000 and navigate to the index.html to see the result.

Globe before normalization

Globe before normalization

If it seems like this, it is because there is something wrong with some of the series. Remember that we need to normalize the values in order to get values in the range o to 1. To do that, we open again our CSV file as a spreadsheet, calculate the sum of the columns that we want to normalize, and then, we create a new column in which every single cell is the result of the division between the old value of cell by the total sum of all the values in the old column. We repeat the process with the other two columns and replace the old ones with just the values in the new ones. We run the steps to generate a new JSON file and try again.

Now, you can click on World Poverty to see everything properly working.

Suggested Readings

The Python Standard Library Documentation

Lutz, Learning Python

  • Ch. 9: Tuples, Files, and Everything Else


Filed under Topics

The experience of the FirefoxOS Apps Day in Toronto #firefoxos

The last Saturday was the FirefoxOS Apps Day. It’s been a world wide event, from Argentina to Roumania, from Spain or Japan. Here en Canada there were two cities hosting the event, Vancouver and Toronto. Basically, it was a event to present the brand new FirefoxOS, an open operative system by Mozilla and the Spanish telco Telefonica.

The event was hosted in the amazing Mozilla’s office between Chinatown and the downtown of Toronto. It was pretty informal because was intended to be more of a hackaton than a conference. Even though, there was a couple of slots to speakers. The first one was John Karahalis giving a brief introduction to FirefoxOS. After him, Jennifer Fong-Adwent and Jonathan Lin of Mozilla shared tools and technology built to extend and support the new platform. Then the hacking part started for 4 hours. Our turn came after the hacking time, and we presented Dr. Glearning for FirefoxOS, talking a bit about motivation and goal, but focused on the problems we found developing for FirefoxOS in both modes (hosted and packaged apps), and how we came up with solutions and workarounds.

Later, was the time for showing demos built during the hacking time. Even if 4 hours is not time enough to develop something complete, I was amazed by two different factors: first the creativity of the developers, from utils apps like the ones you use everyday to check the trolley time, to games in cavalier projections; and second the good taste in matters of design, specially one on Pomodore Technique and another one on compass. And everything in just 4 hours!

My conclusion, after seeing the success of the event in other parts of the world as well, it’s that event was a perfect starting point to captivate developers on the simplicity of developing for FirefoxOS. But Mozilla still has a long way ahead. One thing to the future could be a World App Challenge with good prizes for the winners. But I have to say that all the teams that presented a demo got one of the awesome GeeksPhone with FirefoxOS, a very elegant detail for the attendees to the event from Mozilla’s part.

Leave a Comment

Filed under Events

Now I have a MOOC platform, what are the physical stuff I need?

It’s been a while since I wrote my last MOOC-related post. But now, after the crazy days of starting the first MOOC class in which I have the honour to participate in, I can write a bit about the second main aspect of a MOOC: what do you need to create those awesome videos. I already know that these posts are not about content, but about things you need to start. About the content and political or philosophical implications of teaching a MOOC there already are conversations out there that can fit your interest and answer your questions. For me, it is simply an interesting trend that universalizes the access to higher education, so as an academic member of an university, it’s a must to at least give it a try.

Said this, let’s talk about physical stuff. In the last post, I talked about infrastructure needs. Well, we finally forked the OpenMOOC engine and started our own development, which includes an all-in-one solution (registration, users, discussion, etc.) with a very easy installation process –stay tuned for detailed instruction to deploy it in your own server. And now that the course started, we are producing the videos as fast as we can. In an ideal world, you should buy one of those amazing Wacom tablets that already does all the work for you, but if you don’t have $4,000, as we do in our lab where the resources are limited, you should use what you already have. So far, what we are using is:

  • Digital camera recorder. The Panasonic HC-V700 ($460), but any modern camera, a good DSLR or even a small digital camera, is able to record in good quality (1080p) and is not that expense.
  • Tabletop monopod. This time we bought one from Amazon, a Sharpics SPMP16 ($30), in order to record what the teacher writes.
  • Lamp. In order to avoid annoying hand shadows when writing, we got a basic swing arm lamp ($25).
  • iPad ($399). We already had one, so no more to say.
  • Stylus. We are using a Bamboo Stylus Solo ($30), but there are cheaper options out there. It’s most about how confident you feel with it.
And I think that’s all. The process we are following on the cheap, in order to achieve results as close to Udacity‘s videos as possible, shown above, in which the hand never hides the written content, is the next:
  1. Write a small script of the video, that it is called nugget on the OpenMOOC terminology.
  2. Fix the iPad on a desktop under the camera lens, using the monopod and the lamp light.
  3. Write the content on Paper or Sketchbook Pro ($4.99) and record all the thing.
  4. At the same time, we screencast the iPad using screen mirroing through AirServer ($14.99) and Camtasia ($99).
  5. In the same Camtasia, using chroma key, we put the texts and diagrams over the hand, creating the similar effect.

But we need a lot of more practice 😀

On the other hand, we are also streaming the classes, so we can record and cut the session into pieces and make more concepts videos. So far, we are not using videos for homeworks, but Dr. Glearning, a service that enables you to create homework that your students can do in their phones. I wish you could see the students’ faces after telling them they will do homework in their phones, it’s simply priceless. But, although Dr. Glearning app is already available on iTunes and Google Play stores, is still in beta for teachers to create their courses. In addition, our OpenMOOC fork, we developed a basic integration, so you can embed Dr. Glearning courses into your MOOC course. Awesome, isn’t it?


Filed under Analysis