# Monthly Archives: November 2011

## Word Frequency and Sentiment Analysis of the Spanish Elections Manifestos for #20N

Yesterday was an important date for all the Spaniards. It was a polling day. The first elections after the acceptance of the Global Crisis. Two days before, we had a day called the Reflexion Day, named in that way in order to invite all the voters to think about their decision. However, the common feeling these months has been the indignation of the spanish people across the country and, why not, across the world. Well, as a Spaniard who lives in Canada, what I did in my Reflexion Day was a quick and dirty word frequency and sentiment analysis of the Election Manifestos from some of the most “important” parties in Spain.

To achieve it, I followed a serie of steps, mechanically, and I applied them to the official manifestos. Of course, there were some of them, for example the EAJ-PNV‘s manifesto, that they weren’t in a text format, but in image format. That kind of files weren’t processed. The list of the parties analyzed is IU, PP, PSOE, EsquerraUPyD, CC, EQUO and GEROABAI.

Once I downloaded all the PDF’s files, some of them really heavy, I used the pandoc tool to extract just the text. This done, I created a little Python script to split all the text into single sentences, join sentences that are between two lines, and clean several things, like numbers of pages or the extra dots. After that, at the same time, the script connects to Sentimen API from ViralHeat to get the positive or negative feeling of every sentence in the political manifesto. With the result in JSON format properly stored in a file, one line per sentence, using another different Python script, I extract just the numbers in a CSV format, in order to be included into a Spreadsheet file and calculate some statistics.

The last part of the analysis was to create a visualization of the data. For this issue, I chose the Nightingale’s Rose from Protovis visualization toolkit, and the Wordle tool to create tag clouds. The result of the first one can be seen below.

Diagram of Sentiments of the Political Manifestos in the Spanish Elections

Every slice in the diagrama has two areas. One in blue represents the total number of sentences in a positive sentiment. The one in red represents the total number of sentences in a negative sentiment. The length of the manifestos goes from 2,250 sentences from Esquerra’s one, to the much more little 623 sentences from UPyD. In relative terms, we can find the next results.

Percentage of Positive and Negative Sentences

It seems like UPyD is the more realistic party, because it has the speech with more negative sentences in percentage than the rest (~16%), but it also preserves a good number of positive sentences. On the other side, CC y PP have the more optimistic manifesto, with percentage of positive sentences higher than the 95%. But, what kind of words are more used in their respective manifestos? Let’s see…

PSOE Election Manifesto Tag Cloud

This one is the cloud of the current party in the Goverment. But its manifesto seems to center around on the words social, employment (“empleo”), system (“sistema”), politics (“política”), economy (“economía”), equality (“igualdad”) and companies (“empresas”); precisely the topics in which they has notably failed.

PP Election Manifesto Tag Cloud

The PP is the main party in the opposition and allegedly more right-wing (actually the both have shown the same social politics in the past). Its manifesto is strongly focused in the word change (“cambio”), followed by employment (“empleo”), society (“sociedad”), stability (“estabilidad”), reforms (“reformas”), better (“mejor”), european (“europea”), welfare (“bienestar”), and the future tense for motivate, estimulate or boost (“impulsaremos”). Of course it is a really positive speech. Who wouldn’t vote to them with that kind of happiness and improvements? I won’t be me…

IU Election Manifesto Tag Cloud

On the other hand, the historically more left-wing party is focused on highlight just the words, again, left-wing (“izquierda”), proposals (“propuestas”), rights (“derecho”), united (“unida”, just because the name es something like United Left-Wing Party), social in many ways (“social”), elections (“electoral”), public in other bunch of forms (“público”, “pública” and so on) and services (“services”). Under my opinion, not a very strong manifesto and maybe a little bit fainthearted.

UPyD Election Manifesto Tag Cloud

It looks like our more realistic party has any prominent word. In its place, they focus on comunities (“comunidades”), development (“development”), regions (“autónomas”) and administration (“adminitración”). Perhaps it’s the more heterogeneous manifesto that I have analyzed.

Esquerra Election Manifesto Tag Cloud

Esquerra is a Catalonian party and I couldn’t find the manifesto in Spanish. Anyway, they seem to be centered around estate (“estat”), people (“persones”), the name of its region, Catalonia (“Catalunya”), social (“social”), action (“acció”) and politics (“política).

EQUO Election Manifesto Tag Cloud

Equo is a just created party, founded by the ex-director of Greenpeace Spain and with a strong focus on environment and global warming. That’s why we can find words like “salud” (health), “sostenible” (sustainable) or “desarrollo” (development).

CC Election Manifesto Tag Cloud

Geroa Bai Election Manifesto Tag Cloud

In these last two clouds we can see the name of the party and, more important, the name of the corresponding autonomous regions: Canarias and Navarra. The resto of the words are barely used. Maybe they are trying to win voters in their own regions, because all the manifesto is aorund the names of the regions.

Sadly, the worst came. And what is it about? It’s not about having a hard right-wing party for the next 4 years. It’s about granting a party the power to rule alone in the Goverment, with an absurd absolute mayority and, the most of the times, counterproductive.

Final Congress Results 2011 (Source: elpais.com)

Filed under Analysis

## Creating a Globe of Data

Before starting, you can see the final result of this post on World Poverty.

Some months ago, I was impressed with the web Chrome Experiments. In that site, you can find a lot of experiments made using the new WebGL technology, that it’s supposed to work in the most of new browsers. WebGL is the most recent standard for 3D representations on the Web. So, with WebGL, a new form of data representation is now possible. In fact, there are artists, scientists, game designers, statistics and so on, creating amazing visualizations of their data.

One of these new ways of representations was made by Google. It’s called WebGL Globe and allows to show statistical geo-located data. The only thing you need is split up your data into several series of latitude, longitude and magnitude in JSON format, as the next example illustrates:

var data = [
[
'seriesA', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
],
[
'seriesB', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
]
];


JSON, acronym for JavaScript Object Notation, is not only a format to represent data in Javascript. It’s also the data type that WebGL Globe needs to work. In this format, a list is inclosed between brackets, “[” for start and “]” to end. Therefore, the data series for WebGL Globe is a list of lists. Every one of these lists have two elements. The first one is the name of the serie and the second one is another list containing the data. The data is written comma separated, so that you must indicate your information in set of three elements: the first is the geographical coordinate for latitude, the second one is the same for longitude, and the third one is the value of the magnitude you would like to represent.

Let’s say we want to represent information from the Human Poverty Index. The first we need is to download the data in the format provided by United Nations’ site for the Multidimensional Poverty Index, that has replaced the old Human Poverty Index. Now we got a spreadsheet document, it’s time to open it and collect just the data we need, thus, go to the page 5 of the book, and copy and paste the cells into a clean spreadsheet. We clean all the date we don’t need like titles, captions, extra columns, etc and we leave just country names, the second “Value” column under the cell “Multidimensional Poverty Index”, the population under poverty in thousands, and the “Intensity of deprivation” column. The next step is to remove the rows with no data for that indicators, marked as “..”. After doing this, we should have a document with 4 columns and 109 rows.

Spreadsheet before getting coordinates for countries

But, although we have the name of the countries, we need the geographical coordinates for them. There are several services that provide the latitude and longitude for a given address. In the case of having just the name of a country, the main coordinates for the capital is provided. We will use geopy, which is a Python library able to connect to different providers and get several kinds of information. To use geopy, a terminal or console is needed in order to get installed, that is very easy with just a command.

$easy_install geopy  After that, we can open a terminal or interfactive console like iPython and just get the latitude and longitude of, for instance, “Spain”, with next commands: >>> from geopy import geocoders >>> g = geocoders.Google() >>> g.geocode("Spain") (u'Spain', (40.463667000000001, -3.7492200000000002))  In this way, we can build a list of our countries and pass it to the next script: >>> from geopy import geocoders >>> g = geocoders.Google() >>> countries = ["Slovenia", "Czech Republic", ...] >>> for country in countries: try: placemark = g.geocode(country) print "%s,%s,%s" % (placemark[0], placemark[1][0], placemark[1][1]) except: print country ....: ....: Slovenia,46.151241,14.995463 Czech Republic,49.817492,15.472962 United Arab Emirates,23.424076,53.847818 ...  Now, we can select all the results corresponding to the latitudes and longitudes of every country and copy them with Ctrl-C or mouse right-click and copy. Go to our spreadsheet, in the first row of a new column, and then paste all. We should see a dialogue for paste the data, and on it, check the right option in order to get the valies separated by commas. Paste the result comma separated﻿ Done this, we hace almost all the coordinates for all the countries. Anyway, could be some locations for which the script didn’t get the right coordinates, like “Moldova (Republic of)” or “Georgia”. For these countries, and after a carefull supervision, the better thing to do is to run several tries fixing the names (trying “Moldova” instead of “Moldova (Republic of)”) or just looking the location in Wikipedia –for example for Georgia, Wikipedia provides a link in the information box at the right side with the exact coordinates. When the process is over, we remove the columns with the names and order the columns in order to get first the latitude, second the longitude, and the rest of the columns after that. We almost have the data prepared. After this, we need to save the spreadsheet as CSV file in order to be processed by a Python script that converts it into the JSON format that WebGL Globe is able to handle. The script that processes the CSV file and produces a JSON output is the detailed the next: import csv lines = csv.reader(open("poverty.csv", "rb")) mpis = [] # Multidimensional Poverty Index thousands = [] # People, in thousands, in a poverty situation deprivations = [] # Intensity of Deprivation for lat, lon, mpi, thousand, deprivation in lines: mpis += (lat, lon, mpi) thousands += (lat, lon, thousand) deprivations += (lat, lon, deprivation) print """ [ ["Multidimensional Poverty Index", [%s]], ["People affected (in thousands)", [%s]], ["Intensity of Deprivation", [%s]] """ % (",".join(mpis), ",".join(thousands), ",".join(deprivations))  And the output will look like: [ ["Multidimensional Poverty Index", ["46.151241", "14.995463", "0", ... ] ...  Now, if we copy that output into a file called poverty.json we will have our input data for WebGL Globe. So, the last step is setup the Globe and and the data input file all toghether. We need to download the webgl-globe.zip file and extract the directory named as “globe” into a directory with the same name. In it, we copy our poverty.json file and now edit the index.html in order to replace the apparitions of “population909500.json” with “poverty.json”, and do some other additions like the name of the series. Finally, to see the result, you can put all the files in a stativ web server and browse the URL. Another option, just for local debugging, is run the next command under the directory itself: $ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...


And then, go to http://localhost:8000 to see the result.

Globe before normalization

It seems like there is something wrong with two of the series: the population in poverty conditions, and the intensity of the poverty. This is because we need to normalize the values in order to get values in the range o to 1. To do that, we open again our CSV file as a spreadsheet, calculate the sum of the columns that we want to normalize, and then, we create a new column in which every single cell is the result of the division of the old value of cell by the total sum of all the values in the old column, We repeat the proccess with the another column and replace the old ones with just the values in the new ones. Now, we can run the steps of generate the JSON file and try again.

Now, you can click on World Poverty to see everything properly woriking.