Some notes about the new features of neo4j-rest-client

This is the first time I write in technical terms and about an own development. But first, a bit of history. Back to December 2009, I met Neo4j database. Neo4j was one of the first graph databases with serious uses in the real world. I was amazed about how easy was creating nodes and edges (they call them relationships in an even more intuitive nomenclature). But, like everything else in the graph world, was written in Java with no other language binding, except a very basic Python binding, neo4j.py. A couple of months later, they released the REST standalone server and then, because having neo4.py working was really hard for pure Python coders, I decided to write a client side library. That’s how neo4-rest-client was born. It was a really basic tool, but started to grow and grow, and later the next year the first packaged version was release in the Python Package Index. Since then, everything has been improved, as much as the Neo4j REST API as the Python community around. The Neo4j guys finally deprecated neo4.py and released a new python-embedded client, also based on then Java runtime, at the same time that other alternatives just appeared into the scene: bublflow, neo4django, or the newest py2neo, for example. However, neo4j-rest-client was always as low level as possible: it didn’t manage cache, or lazy loads, or delayed requests, to name just a few. But when the Cypher plugin, the preferred method to query the graph in Neo4j, become part of the core, I decided to implement some cool features based on it.

The first thing was to have better iterables for objects, as well as laziness in loads and requests. I implemented a way to query the graph database using Cypher, but taking advantage from the current neo4j-rest-client objects like Node or Relationship. So, every time you make a query, you can get the objects as returned by the server, what it is called the RAW response. Using the `constants.RAW` option in the `returns` parameter of the method `query` from `GraphDatabase` objects.


from neo4jrestclient.client import GraphDatabase
from neo4jrestclient.constants import RAW

gdb = GraphDatabase("http://localhost:7474/db/data/")

q = "START n=node(*) RETURN ID(n), n.name"
params = {}
gdb.query(q, params=params, returns=RAW)

Or you can use `params` to pass safely parameters to your query.


q = "START n=node({nodes}) RETURN ID(n), n.name"
params = {"nodes": [1, 2, 3, 4, 5]}

gdb.query(q, params=params, returns=RAW)

Independently of the way you define your query, the last line can omit the `returns` if the value is `RAW`, but the power of this parameter is the possibility of passing casting functions in order to format the results.

from neo4jrestclient.client import Node

q = "START n=node({nodes}) RETURN ID(n), n.name!, n"
returns = (int, unicode, Node)
gdb.query(q, returns=returns)

Or you can even create your own casting function, what is really useful when using nullable properties, referenced in Cypher as `?` and `!`.


from neo4jrestclient.client import Node

def my_custom_casting(val):
    try:
        return unicode(val)
    except:  # Never ever leave an except like this
        return val

q = "START n=node({nodes}) RETURN ID(n), n.name!, n"
returns = (int, my_custom_casting, Node)
gdb.query(q, returns=returns)

Now I can assure that if the name of a node is not present,  a proper RAW value will be returned. But what happens if the number of columns don’t math the number of elements passed to be used as casting functions? Nothing, remaining elements will be returned as RAW, as usual. Nice graceful degradation 😀

On the other hand, and using the new queries feature, I implemented  some filtering helpers than could eventually replace the Lucene query method used so far. The star here is the `Q` object.


from neo4jrestclient.query import Q

The syntax, borrowed from Django and inspired by lucene-querybuilder, is the next one:


Q(property_name, lookup=value_to_match, [nullable])

The `nullable` option can take a `True` (by default), `False` or a `None`, and set the behaviour of Cypher when an element doesn’t have the queried property. In real examples, it will look like:


lookup = Q("name", istartswith="william")
williams = gdb.nodes.filter(lookup)

The complete list of lookup options is in the documentation. And lookups can be as complicated as you want.


lookups = (
    Q("name", exact="James")
    & (Q("surname", startswith="Smith")
       | ~Q("surname", endswith="e"))
)
nodes = gdb.nodes.filter(lookup)

The `filter`  method, added to nodes and relationships, can take an extra argument `start`, in order to set the `START` instead of using all the nodes or relationships (`node(*)`). The `start` parameter can be a mixed list of integers and Node objects,  a mixed list of integers and Relationship objects, or an Index object.


n1 = gdb.nodes.create()
start = [1, 2, 3, n1]
lookup = Q("name", istartswith="william")
nodes = gdb.nodes.filter(lookup, start=start)

index = gdb.nodes.indexes.create(name="williams")
index["name"]["w"] = n1
nodes = gdb.nodes.filter(lookup, start=index)
nodes = gdb.nodes.filter(lookup, start=index["name"])

Or using just the index:


nodes = index.filter(lookup, key="name", value="w")

Also, all filtering functions support lazy loading when slicing, so you can safely do slices in huge graph databases, because internally is using `skip` and `limit` Cypher options before doing the query.

Finally, just mention about the ordering method that allows you to order ascending (default) or descending, just concatenating calls.


from neo4jrestclient.constants import DESC

nodes = gdb.nodes.filter(lookup)[:100]
nodes.order_by("name", DESC).order_by("age")

And that’s all. Let’s see what the future has prepared for Neo4j and the Python neo4j-rest-client!

Leave a Comment

Filed under Topics

What if I decide to teach a MOOC? Well, then I should learn some Python :)

Well, I think that today everybody knows what is MOOC. MOOC stands for Massive Open Online Course. You may heard about the term first from Stanford University, and then by Udacity, Coursera, edX, or TEDEd. So there is so much hype about the concept and the idea of MOOC’s, although it is not as new as we can think. Open Learning Initiative could be one of the first exploring this trend, or more recently, P2P University and Khan Academy. However, when you decide to teach your content following the MOOC model, there are some steps to overcome. First question is if you need your own platform or just use one of the available. If the answer to this question is something like “what are you talking about”, then you can pur yout content in sites like Udemy , CurseSite by BlackBoard or iTunesU, and forget about systems administration, users registration, machine requirements, bandwidth, etc. But you will be tied to a company and its constraints. Or, if you are part of a bigger institution, you can beg your boss to join to one of the biggest consortium mentioned above. Let me tell you something, this is not going to happen quickly (or at all, the wheels of bureaucracy turn slowly), so better get a new approach. On the other hand, if you have a passable server with acceptable bandwidth, some tech guys with free time (what it is an oxymoron), and a lot of energy and passion, you can also setup your own infrastructure. If this is your case, what are your options? Well, just a few that I will enumerate.

  • OpenMOOC, aims to be an open source platform (Apache license 2.0) that implements a fully open MOOC solution. It fully video-centered following the steps of the first experiment of Standford AI-Class. It is a new approach but makes harder to add traditional questions no based on videos or even the send of essays. Is prepared to be used with an IdP in order to have an identity federation for big sites. It is able to process automatically YouTube videos and extract the last frame as the question if required. Because we particularly don’t need the federation, we removed that feature and added some more in our own fork, just to try the solution. Also is able to connect to AskBot for a forum-like space for questions and answer. Successfully deployed in UNED COMA.
  • Class2Go, easier to install and have it running but kind of complex to manage. It integrates very well with services such Amazon SES (that we added to our OpenMOOC fork), Piazza, the Khan Academy HTML-based exercise framework , and Amazon AWS. Used by Standford.
  • Course Builder, pretty beauty but hard to deploy or add content. Used by Google and some of its free courses.
  • Learnata is with no doubt the best documented and easiest in install. It is the underlying system of the P2PU and it counts with an active and real community behind. It has an awesome badges system, a detailed dashboard, and API, and a bunch of modules (formerly Django applications). But doesn’t manage videos as well as the other two.

All of them are built using Python and, except Course Builder, Django as core technology. It just so happens that here at CulturePlex Lab we use Python and Django a lot. That’s why we are currently forking everywhere and creating our own MOOC system. And that’s the magic of Open Source: we can fork OpenMOOC, take some features of Class2Go and another ones from Learnata and, whenever we respect the licenses, release a new MOOC system, the CulturePlex Courses (still under hard testing).

Next post? Some notes about what you need, in physical terms, like a camera, a monopods, a tablet, etc.

12 Comments

Filed under Analysis

The experience of the PyCon Canada 2012 #PyConCa

Well, so finally the date arrived and I had to go to Toronto for giving a talk about Graph Databases in Python. Since the beginning  of the event I could feel the energy and good vibrations of the wonderful team of organizers. From here, my humble congratulations for all of them for an awesome job, including volunteers, that made real an amazing experience. It was my first PyCon so far. I had already heard about Python Conferences and how cool are, but I never had the opportunity to be in one. PyConCa, the first one of its kind Canada-wide, gave me the chance I was wanted to.

The reception took place the Friday at night, where I could know some people, register as speaker and get the credential. I must say that the credential with the shiny tag of speaker made me very happy.

Saturday was the first formal day of conference, starting relatively early (above all for those who went out the night before). The session began with a keynote by Jessica McKellar about HackerSchool. After a small break, the sessions split up in three, main hall, lower hall and tutorial room. Unfortunately, I couldn’t attend to any of the tutorials. So for the next talk I had to decide between the 40 min talk about SQLAlchemy (given by its creator Michael Bayer) and the two 20 min talks about MongoDB and Gene databases and about Writing self-documenting scientific code using physical quantities. So I went to the SQLAlchemy talk for 20 min and then to the one about  MongoDB. In the last one I knew Vid Ayer who shown herself really interested on graph databases and she didn’t miss my talk.

After another small break, I saw a really good talk by Mike Fletcher, an independent consultant from Toronto. He gave a presentation about Profiling for Performance, in which I discovered awesome tools like Coldshot, a better alternative for hotshot, or RunSnakeRun, a graphical interface for profiling logs that is really helpful.

The lunch, that was included in the prize of the conference, was acceptable and did the trick to deceive the stomach until the dinner. Next talks were about App Engine Python SDKCloudant REST API that has been rewritten in Python using Flask, a light web framework; a Python Dynamo-DB mapper; and a funny presentation of everything you wanted to know about deploying web apps on Windows but were too horrified to ask. After the coffe break, Daniel Lindsley, author of tastypie and Haystack amon others, gave an excellent talk about searchers.

After the excellent Daniel’s presentation, I stayed at Main Hall and listened the talk I Wish I Knew How to Quit You: Secrets to sustainable Python communities by Elizabeth Leddy, a core developer of Plone, the previously famous CMS for Python. She is, as we say in Spanish, toda una personaja, and talked about how to successfully manage a Python community. Next talk was by Mahdi Yusuf, a pasionate developer from Ottawa, and maintainer of PyCoders Weekly newsletter, who explained the history of Python Packaging. Right after his successful talk, Martín Alderete from the Python Argentinian users group presented the Ninja IDE, an totally awesome integrated development environment with a ton of features, open source and thought with Python in mind. And that was all for the first day.

In the next morning, because my talk was in the first slot, I missed the keynote given by Michael Feathers. But I was half nervous half excited. So I went to the Lower Hall where my talk will take place an waited for the guy before me to end. Steve Singer, that is his name, talked about using Python as a procedural language for PostgreSQL, what is really interesting. And finally I gave my talk about graph databases, mostly Neo4j, in Python. I presented basic concepts like what is a graph, or what type of graphs exist. And then a landscape of graph databases solutions and which ones of these are suitable to be used in Python. Finally some examples using these libraries and even a fast hint about how to deploy a Neo4j in Heroku and connect to it from neo4j-rest-client. Unfortunately, at the same time, Kenneth Reith, author of requests and working on Heroku, was giving a really interesting talk called Python for Humans.

Nevertheless, I enjoyed the experience of giving a talk, even though there weren’t a lot of people attending to it. One thing I noticed was the most of the people were scientist looking forward to solve a problem that could be better understood using graph structures, like state machines or even biology.

And, why not, here is my talk and my slides, even when I know that the only thing worse than listening to yourself speaking in a video, that’s doubtless listening to yourself speaking in a video in English. But, it is what it is :)

After that, I stayed in the same room to attend a talk on server log analysis using Pandas. Pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. And the guy who talked about it, used IPython Notebook to do the presentation, a new unknown for me killer feature of the interactive console IPython. Then, some talks about big data using Disco and Inferno, and horizontally scaling databases in Django, guys from Chango and Wave Accounting, both Toronto-based companies. Diego Muñoz, ex-member of the CulturePlex, gave a talk about an Ember.js adaptar for Django in order to avoid any change to your REST API if you are already using tastypie.

A small coffe break after, talks about real time web apps with Stack.io by Gabriel Grant, Urwid by Ian Ward (including the awesome bpython interpreter or speedometer tool), workloads and cloud by Chayin Kirshen, and speeding up your database by Anna Filina. But in the afternoon, I must say that the best one was given by Alexandre Bourget from Monteral aka the showman: Gevent-socketio, cross-framework real-time web live demo.

For ending the day and the conference, Fernando Pérez gave a talk on science and Python as a retrospective of a (mostly) successful decade. In his slides you can clearly see an use of IPython Notebook.

And that was pretty much everything. I already am looking forward for the next one. It is a really good experience, you learn a lot from great people and pass an amazing weekend surrounded by other Python coders. It is totally worth it. Let’s see if I am accepted for PyCon US. Cross your fingers!

1 Comment

Filed under Events

More ideas on the Virtual Cultural Laboratory

These days, after reading the article about the VCL, A Virtual Laboratory for the Study of History and Cultural Dynamics (Juan Luis Suárez and Fernando Sancho, 2011) for our first session of the incipient reading group in the lab, some ideas came to my mind. The article presents a tool in order to help researchers to model and analyze historical processes and cultural dynamics.

The tool defines a model with messages, agents, and behaviours. Very briefly, a message is the most basic unit of information that can be stored or exchanged. There are three types of agents: individuals, mobile members of a social group that can interchange messages among them or adquire a new one; repositories, like individuals but fixed in space; and cultural items, as a way to store an unmutable message to transfer, also immobile. Finally, we find four ways in which agents can behave: reception, memory, learning and emission. Every kind of agent has a different set of behaviours. Cultural items do not receive information and always emit the same message; repositories work as a limited collection of messages, when the repository is full, a message is selected for elimination. And individuals can be Creative, Dominant and Passive, according to the levels they show of attentionality and engagement with the messages. These three simple models provided make the VCL a really versatile cultural simulator. However, as the authors say in the article, VCL is a beta version and could be improved a bit.

I am lucky enough to be able to talk to the authors, and we are having a really interesting discussion about new ways to expand the VCL. On my side, I have been quite influenced by the book La evolución de la cultura (Luigi Luca Cavalli-Sforza, 2007) and the already mentioned before Maps of Time (David Christian, 2011), in such a way that demography and concepts networks have become a very significant factors from my point of view.

The idea is to use graphs to represenet and store the culture of the individuals, and also graphs to represent the different cultures, trying to shift everything a bit to the domain of Graph Theory. We will be able to store the whole universe of concepts  defined through semantic relationships among them. In this scenario, we can figure out a degree pruning to get the diferent connected componentes that represent the cultures, but keeping always the source graph. This prune function could be a measure over the relationships, like for example ‘remove relationships between nodes with this value of betweeness centrality’, or even a randomly way to get connected components. But better if the removed relationships have a sense in terms of semantic.

After we have different graph cultures, we put them all in different places. Then we can get culture sub-graphs and store them in the individiuals in order to give them a cultural feeling of membershipto a certain culture. Sub-graphs form the same culture could overlap each other, but sub-graphs from different cultures should be disjointed. Now, individuals start to move across the world. Also I would introduce the notion of innovations for culture sub-graphs: an innovation is a deciduous concept with no relationships to any concept of the sub-graph, but at least one relationship if we consider the set of relationships of the original graph. Somehow, this implies that everything is already in the world, but it is an interesting assumption to experiment with. Maybe the original graph could be dynamic and get new concepts across time.

So, individuals could show specific behaviours with regard to innovations: Conservative, Conformist and Liberal. And another property to draw the feeling to belong to a group, distinct to the one the individual was born. This value is kind of similar to the permeability to ideas, but different, while permeability works during the whole life of the individual, the membership feeling could operate until it is satisfied, so we can use it as a way to stop individuals, or to define the equilibrium.

Well, these are just ideas. Another approach could be to use population pyramids as inputs for the simulation. Yes, it’s me and demography again. If we do this, given a culture and a number of individuals that changes across time thanks to the population pyramid, we could see, and this is the point, how concepts move through cultures, and even more important, what is the culture of the individuals when the simulation stops. Calculating this is as easy as checking what sub-graphs are a sub-set of the existing cultures. This idea of using a populational pyramid seems interesting to me because allows to analyze the importance of the lost of permeability of the indivoduals to innovations. Therefore, we could find what the elements are of the vertical cultural transmission, traditional, familiar, and ritual; in opposition to the horizonatal transmission (does not imply kinship but relations between individuals).

And one more idea! This one the craziest, I think. We could use a biology-inspired model for the concepts, so a concept would be defined by a vector that quantifies it using previously established knowledge fields. For instance, let’s say that an idea, i, is formed by a 20% of Literature, a 20% of Physics, and a 0% of Biology, so the resulting vector will be i = [20, 20, 0]. Also, ideas are related to each other through a graph. Following this biological analogy, we could set the vector to have 23 pairs of values, in such a way that allows individuals adopt new ideas and modify them according to random changes in the last pair of values… or maybe this is too much craziness. Let’s see!

Leave a Comment

Filed under Analysis

Making Sense of Teaching Computer Science Tools to Linguists

As a part of the PhD program, I am required to design and defend two different courses. One intended for undergraduate students and another one for grads. I don’t know yet if both can be for graduate students, but my first course will be. It is going to be about some useful tools that any experimental and not just theoretical linguist should know. As today, we are getting more and more accostumed to hear terms like digital humanists, digital history, or digital whatever. There are even a sub-disciplne (if we can call it that way) named Computational Linguistic. However, it seems to me like two ancient soccer rival teams: what a traditional linguist does is pure Linguistic, but what a computational linguist does is just the future. Again, the everlasting fight between old and new ways to do things. When what really should matter is what are your questions, and what tools do you have to find the answers. That’s why I am proposing this course: to teach tools and new ways to think about them, and to make students able to solve their own research problems. And what is even more important, make them lose the fear to experiment with Computer Science and Linguistic.

That being said, I am going to introduce you a very early scrap of my pretended syllabus. Every week is going to have a one hours class, in order to explain and introduce concepts, and another two hours lab class in which test, experiment and expand the subjects previously covered that week.

  1. Computers Architecture. One of the most important things that virtually anybody should know, is how actually a computer works. In order to understand what is possible to do and what is not, it is needed to know the common components of almost any current computer, like RAM memory, CPU, GPU, hard drives, input/output operations and devices, etc. Also, a brief introduction on the existing types will be given. Once you know how a machine is built, you can control and understand things like having enough memory to run the programs, why this file freezes my computer when loading, and so on.
  2. Fundamentals of Programming. First thing to note is almost everything inside your computer is a program. And I will say more, an important aomunt of processes in your everyday life are pretty similar to computer programs. The order you follow when you are taking a shower, that awesome recipe for cooking pumpkin pie, the steps you give before starting the engine of your car, or the movements you do while dancing. All of them are a way similar to computer programs, or better said, to algorithms. A computer program is a set of instructions that a machine run one by one. Those programs are usually algorithms, in the sense of they are steps to achieve an output given a specific input. Very likely, an introduction to coding using something like pseudo-languages, flux diagrams, or NetLogo, will be given.
  3. Programming Languages. A brief introduciton to programming languages and why they are the way they are. Some linguists are really accostumed to handle with language peculiarities, however, natural languages seem to answer to the hypothesis of the Universal Grammar, as argued by the generative grammar studies of Noam Chomsky. A grammar is a set of rules, but in natural languages, unlinke formal languages like those used to code, the set of rules is usually huge. Fortunatelly, programming languages are built using a little set of rules, and the grammar that describe them can be, according to Chomsky, classified according the way they generate sentences. We could even say that study and learn how to program is like understand how another language works. So, in the end, programming languages are just a kind of language: constructed languages. And instead of learning how the brain works in order to understand it, you have to know how machines do. Once this is clear, it is time to meet Python.
  4. Writing Code. After the introduction to Python, students have to learn how a well structure of the code looks like. Concepts like loops, flow control statements, functions and parameters will be taught. Then it is the moment to make them know what libraries are and how to create their own ones. Finally, notions of Object-Oriented Programming (OOP) in Python will be shown, just in order to guide them in the use of objects and third-party libraries. Regrettably a hard-core lesson like this one is really needed in order to expand the skills of the students facing real life problems in their research. Thinking exactly the way machines do, it is a the best manner to efficiently code.
  5. Python Libraries. After getting some basic knowledge about programming, some third-party libraries will be presented. In special, the commonly used Natural Language Toolkit or simply called NLTK. This library is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, part-of-speech tagging (POS), parsing, and semantic reasoning.
    The scientific Python extensions scipy, numpy, pylab and matplotlib will be also introduced, but briefly because visualization is covered in another Week 8.
  6. R Language. R is a language and environment for statistical computing and graphics. Its syntax and usage differ a bit in regard to Python and it is much more focused on manipulating and analyzing data sets. Although R has buil-in functions and libraries for almost any measure, there is a very active community behind that provide even more, the Comprehensive R Archive Network or CRAN. Learn how to use it and where to find the function that does exactly what you want is as important as know the language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible for any research porpuse. This week just will cover basic statistical concepts like population, sample, frequency, and measures of center (mode, median, mean), spread (range, interquartile range, variance, standard deviation) and shape (symmetric, skewnessm kurtosis).
  7. Statistics. More advanced features will be introduced, not just how to calculate, but when and why. ANOVA, Chi Square, Pearson coefficient, arithmetic regression are some of them. Knowing of the meaning of statistical measures is really important to understand your data and what is happening with them. However, the point will always be how to get these measures working on R, instead of discussing theoretical and deep aspects of them.
  8. Plotting. Both R and Python have powerful tool to represent and visualize data. Unfortunately, the syntax of these functions can be a little tricky and deserve a whole week to be explained. Produce good charts and visualization of your data can be a crucial step in order to get your research properly understood. That’s why this week will introduce different methods to plot data: bar charts, pie charts, scatter plots, histograms, heatmaps, quadrilateral mesh, spectograms, stem plots, cross correlation, etc.
  9. Regular Expressions. As defined in Wikipedia, “a regular expression provides a concise and flexible means to ‘match’ (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.” The origins of regular expressions lie in automata theory and formal language theory. These fields study models of computation and ways to describe and classify formal languages.  Given a formal definition, we have Σ, an alphabet of symbols, and constants: the empty set, the empty string, and literal characters. The operations of concatenation, alternation and Kleen star (and cross), define regular expressions. However, we will use the POSIX Basic Regular Expressions syntax, because is the most used and easy to learn. Of course, it does include boolean “OR”, grouping and quantification. Linguists can benefit of regular expressions because is a tool that allows filtering and discovering data. Let’s say we have a ser of words in Spanish and we want to extact all conjugations of the verb “jugar” (to play). The regular expression “^ju(e?)g[au].*$” will return the proper matches.
  10. CHAT Format. CHAT is a file format for transcribing conversations in natural language using plain text. First designed for transcribing children talkings, is actually a quite powerful format for any kind of transcription. It allows to register parts of sentence, omissions, speakers, dates and times, and more features, including someones from phonology. There is even an application that translate CHAT XML files to Phon files.
    Besides, due to CHAT files are actually plain text file, the Python Natural Language Toolki already has a parser for them, so we can do certain kind of analysis using just Python and our already adquired skills for regular expressions.
  11. CHILDES CLAN. Besides the specification of the CHAT format (.cha), CHILDES is intended for . One of the tools that provides is CLAN, the most general tool available for transcription, coding, and analysis. Over a corpora of texts written in CHAT format, CLAN  does provide methods such as headers, gems, comments, and postcodes that can be used for some  qualitative data analysis (QDA).
  12. Presentations. Althougth at the beginning this week was intented for introducing PRAAT and CHILDES Phon for  analysis of phonological data, later I thought that should be even more useful to present students’ projects to the class for comenting and getting feedback. So, this week won’t have any new content, but students will have to comment and critize their classmates’ works.

Quite interesting. Not only from a linguistic point of view, but also for computer scientist who enjoys teaching computer science to others. Fortunately, the linguist Prof. Yasaman Rafat accepted to be my co-supervisor for the linguistic side of this course. However, because the course is bridging bewteen two disciplines, I still needed a computer scienctist for guiding me in the process of teaching to non-technical students. Fortunately I asked to the Biology and Computer Science (and Python expert) Prof. Mark Daley, and he gently said yes. So now there is no obstacle at all to make this course a real thing :)

5 Comments

Filed under Tasks, Topics

Talking about Graph Databases at PyCon Canada

Good news, everyone! This November, at PyCon Canada 2012, I will be giving a short talk (about 20 min.) about graph databases in Python and most likely share the time with Diego Muñoz. And now, some info about the talk 😀

Good news, everyone!

Good news, everyone!

Abstract

Since the irruption in the market of the NoSQL concept, graph databases have been traditionally designed to be used with Java or C. With some honorable exceptions, there isn’t an easy way to manage graph databases from Python. In this talk, I will introduce you some of the tools that you can use today in order to work with those new challenging databases, using of course our favorite languge: Python.

Outline

Starting at a very basic definition of what it is a graph and why we want to start using one, I will introduce some examples of real life. Companies like Facebook or twitter pretty recently started using graph-like databases.

Then, an overview of options available will be shown. Detailing a bit the ecosystem and highlighting which of the solutions are able to use with Python and how.

After that, and focused on Blueprints databases, I will show how to use pyblueprints, for basic stuff, and bulbflow for building models Django-style based on nodes.

Finally, for using just Neo4j, one the most solid graph databases in the market and dual licensed (GPL for Open Source), I will introduce py2neo, a Cypher-centric approach, and neo4j-rest-client, that is the one I actively develop.And it there is some time left, some real examples connecting to AWS or Heroku.

The talk is intended for a audience with basic or novice level of Python, so I hope to see you all there!

1 Comment

Filed under Events

Raiders of the Lost Thesis: A Proposal for Big Culture?

Well, well, well. It’s been a while with no entries in this blog. Mainly due to the end of the last academic year, my awesome vacations during August and, why not, I didn’t feel like communicating or writing.

The year already started and my main proposal is to start the thesis. “But, hey! Before writing anything, you should read a lot”, someone could think. And he would be right. I have never been a reader of essays or articles, but is almost the only way to go, it seems. “But, hey again! You need first a topic”, somebody could also say. And he would be god damn right again! I don’t have a “topic”, as usually people do. However, I expect the topic emerges from the readings. We could say that my research is focused on Culture, with upper or lower “C”, the frontiers or borders that delimit it, and how it evolves. With the hope, of course, of finding some interesting result or conclussion.

For the time being, I have read “Mainstream Culture” by Frédéric Martell, and ending “Maps of Time” by David Christian, and starting “Things and Places” by Zenon W. Pylyshyn. It is not that much, but is a beginning. In this point of my research, and with a lot of weird and strange thoughts and connections in my mind, I started to think in something that could be interesting: Big Culture. Let me explain.

In the last of the books mentioned above, I am discovering how the mind is able to link between the perception and the world. It is a tough start, but it is needed to unveil the mechanisms that operate in the brain, and to understand how demonstrative thoughts and perception are related. As Pylyshyn cites, John Perry “argued that such demonstratives are essential in thoughts that occasion action”, actions by the motor system of the body. And for making this possible, humans need some frame of reference that has not why to be necessarily global, but local. What it could be a good starting point for a cultural references system.

On the other hand we have civilizations. In “Maps of Time”, David Christian summarizes the history of everything, including us, into a cycle of manipulation of energy and emergence of orders more and more complex: the life; what if goes against the Second Law of Thermodynamics. This is not a negative criticism, quite the contrary. He does an extremely brilliant exercise of synthesis, since the creation of the Universe until our days. This idea of energy comsumption-production and malthusian cycles is really valid for pre-modern civilizations, like agrarian or pastoral ones. But in the last two or three hundred years, when the modern concept of time was invented, the comercial networks –as one of the big reasons for innovation– were followed by cultural transmissions. And at the same time, innovation was one of the cause for the biggest increase of population in history. In the current mega-cities, all the “natural” purposes and preocupation of humans are a bit hidden. First, this provokes what Émile Durkheim calls anomia, and secondly, blurred definitions of identity and cultural unity.

Finally we have our current crazy world moved by economic interests, egos, and supposed superior morals: the mainstream culture, as defined by Frédéric Martell in “Mainstream Culture”. This huge research exposes how delicate, vague and artificial are actually all cultures. The complexity of the information networks, joint to global scale comercial networks, defines what we understand by culture. However, while reading this excellent book, a thing just came on my mind: maybe, instead of everything being local and global at the same time, humans have developed an unusual skill for handling cultural scopes across the time.

So, I think maybe it is a good idea to organize a good set of thoughts about what is the Culture, why it exists and what it means. From its origin in the cognitive studies and neuroscience of the brain, to the daily world, governed by complex networks. Without forgetting the process by which we became cultural beings, from our ancestors until today.

Leave a Comment

Filed under Analysis

Creating a Globe of Data (revisited for Programming Historian Second Edition)

Module Goals

After seeing the basics of Python and how it could help us in our daily work, we will introduce one of the many options for visualization of data. In this case, we will combine a data source in CSV format that will be processed to transform them into JSON notation. Finally we will represent all the information in a world globe, designed for modern browsers using the WebGL technology. During the process, we will need to get the spatial coordinates for countries across the world. And before starting, you can see the final result of this unit on World Poverty, so don’t be afraid about all the new names mentioned above, we will explain them below.

The Globe of Data

Since the ending of 2009, some browsers started to implement an incipient specification for rendering 3D content on the Web. Although it is not yet a part of W3C‘s specifications –the W3C is the organization that proposes, defines and approves almost all Internet standards–, WebGL, that it is how is called, is being supported by all major browsers and the industry.

WebGL is the most recent way for 3D representations on the Web. So, with WebGL, a new form of data representation is made available. In fact, there are artists, scientists, game designers, statisticians and so on, creating amazing visualizations from their data.

Google WebGL Globe

Google WebGL Globe

One of these new ways of representations was made by Google. It is called WebGL Globe and allows to show statistical geo-located data.

JSON & World Coordinates

JSON, acronym for JavaScript Object Notation, is not only a format to represent data in Javascript, the language of the browsers. It is also the data type that WebGL Globe needs to work. In this format, a list is inclosed between brackets, “[” for start and “]” to end. Therefore, the data series for WebGL Globe is a list of lists. Every one of these lists have two elements. The first one is the name of the serie and the second one is another list containing the data. Although is good to know how JSON lists are encoded, there are libraries for Python to do that conversion for you, so you only have to handle pure Python objects.

>>> import json

>>> json.dumps([1, 2, 3])
    '[1, 2, 3]'

>>> json.dumps({"key1": "val1", "key2": "val2"})
    '{"key2": "val2", "key1": "val1"}'

The data for WebGL Globe is written comma separated, so you must indicate your information in a set of three elements: the first is the geographical coordinate for latitude, the second one is the same for longitude, and the third one is the value of the magnitude you would like to represent, but normalized between 0 and 1. This means if we have the values 10, 50, 100 for magnitudes, these will have to be translated into 0.1, 0.5 and 1.

Birefly, “A geographic coordinate system is a coordinate system that enables every location on the Earth to be specified by a set of numbers.” These numbers are often chosen to represent the vertical position and horizontal position of a point in the globe (more precisely is even possible add the elevation). They are commonly referred to angles from equatorioal plane, but as far as we are concerned those angles can be transform into a couple of single numbers with several decimals places

Latitude and Longitude of the Earth (Source: Wikipedia.org)

Latitude and Longitude of the Earth (Source: Wikipedia.org)

The only thing you now need is to split up your data into several series of latitude, longitude and magnitude in JSON format, as the next example illustrates:

var data = [
  [
    'seriesA', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
  ],
  [
    'seriesB', [ latitude, longitude, magnitude, latitude, longitude, magnitude, ... ]
  ]
];

This said, we can write the data for our globe in pure Python and then apply a conversion into JSON.

>>> data = [
 ...: "seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...],
 ...: ...
 ...: ]

>>> json.dumps(data)
'["seriesA", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...], "seriesB", [34.56, -5.23, 0.89, 27.78, 10.56, 0.12, ...]] ...'

The Data Set

Let’s say we want to represent information from the Human Poverty Index. The first we need is to download the data in the format provided by United Nations’ site for the Multidimensional Poverty Index, that has replaced the old Human Poverty Index. Now we got a spreadsheet document, it’s time to open it and collect just the data we need, thus, go to the page 5 of the book, and copy and paste the cells into a clean spreadsheet. We clean all the date we don’t need like titles, captions, extra columns, etc and we leave just country names, the second “Value” column under the cell “Multidimensional Poverty Index”, the population under poverty in thousands, and the “Intensity of deprivation” column. The next step is to remove the rows with no data for that indicators, marked as “..”. After doing this, we should have a document with 4 columns and 109 rows.

Spreadsheet before getting coordinates for countries

Spreadsheet before getting coordinates for countries

But, although we have the name of the countries, we need the geographical coordinates for them. There are several services that provide the latitude and longitude for a given address. In the case of having just the name of a country, the main coordinates for the capital is provided. We will use geopy, which is a Python library able to connect to different providers and get several kinds of information. To use geopy, a terminal or console is needed in order to get installed, that is very easy with just a command.

$ easy_install geopy

After that, we can open a terminal or interfactive console like iPython and just get the latitude and longitude of, for instance, “Spain”, with next commands:

>>> from geopy import geocoders

>>> g = geocoders.Google()

>>> g.geocode("Spain")
(u'Spain', (40.463667000000001, -3.7492200000000002))

In this way, we can build a list of our countries and pass it to the next script:

>>> from geopy import geocoders

>>> g = geocoders.Google()

>>> countries = ["Slovenia", "Czech Republic", ...]
>>> for country in countries:
try:
    placemark = g.geocode(country)
    print placemark[0] +","+ placemark[1][0] +","+ placemark[1][1]
except:
    print country
....:
....:
Slovenia,46.151241,14.995463
Czech Republic,49.817492,15.472962
United Arab Emirates,23.424076,53.847818
...

Now, we can select all the results corresponding to the latitudes and longitudes of every country and copy them with Ctrl-C or mouse right-click and copy. Go to our spreadsheet, in the first row of a new column, and then paste all. We should see a dialogue for paste the data, and on it, check the right option in order to get the values separated by commas.

Paste the result comma separated

Paste the result comma separated

Done this, we have almost all the coordinates for all the countries. Anyway, there could be some locations for which the script didn’t get the right coordinates, like “Moldova (Republic of)” or “Georgia”. For these countries, and after a carefull supervision, the better thing to do is to run several tries fixing the names (trying “Moldova” instead of “Moldova (Republic of)”) or just looking the location in Wikipedia –for example for Georgia, Wikipedia provides a link in the information box at the right side with the exact coordinates. When the process is over, we remove the columns with the names and sort the columns in order to get first the latitude, second the longitude, and the rest of the columns after that. We almost have the data prepared. After this, we need to save the spreadsheet as CSV file in order to be processed by a Python script that converts it into the JSON format that WebGL Globe is able to handle.

Reading CSV Files

A CSV file is a data format for printing tables intoto plain-text data. There are a plenty of dialects for CSV, but the most common is to print onw row per line and every field comma separated. For example, the next table will have the output shown in below.

Field 1 Field 2
Row 1 Value Cell 1 Row 1 Value Field 2
Row 2 Value Cell 1 Row 2 Value Field 2

And the output will be:

Field 1,Field 2
Row 1 Value Cell 1,Row 1 Value Cell 2
Row 2 Value Cell 1,Row 2 Value Cell 2

And depending on the case, you can choose what character will be used as a separator insted of the “,”, or just leave the header out. But what happens if I need to print commas? Well, you can escape then or just use a double quote for the entire value.

"Row 1, Value Cell 1","Row 1, Value Cell 2"
"Row 2, Value Cell 1","Row 2, Value Cell 2"

And again you can think what is next if I need to print double quotes. In that case can change the character for quoting or just escape with a slash. This is the origin of all the dialects for CSV. However we are not covering this that deep and we will focus on CSV reading through Python. To achieve it we use the standard  “csv”  library and invoke the “reader” method with a file object after opening it from disk. This done, we can just iterate for every line as a list and store every value in a variable for the iteration.

 

In our case every line has, in this order, latitude, longitude, value for multidimensional poverty index, value for thousands of people in a poverty situation, and finally value for the intensity of deprivation. Note that our CSV file has no header, so we do not have to ignore de first line then. We will use three lists to store the different vales of our series and finally, using the

json

library we could print a JSON output to a file. The script that processes the CSV file and produces the JSON output is the detailed the next:

import csv
lines = csv.reader(open("poverty.csv", "rb"))
mpis = []  # Multidimensional Poverty Index
thousands = []  # People, in thousands, in a poverty situation
deprivations = []  # Intensity of Deprivation
for lat, lon, mpi, thousand, deprivation in lines:
    mpis = mpis + (lat, lon, mpi)
    thousands = thousands + (lat, lon, thousand)
    deprivations = deprivations + (lat, lon, deprivation)
output = [
    ["Multidimensional Poverty Index", mpis],
    ["People affected (in thousands)", thousands],
    ["Intensity of Deprivation", deprivations]
]
print json.dumps(output)

And the output must look like:

[
["Multidimensional Poverty Index", ["46.151241", "14.995463", "0", ... ]
...

Putting it all together

Now, if we copy that output into a file called poverty.json we will have our input data for WebGL Globe. So, the last step is setup the Globe and and the data input file all toghether. We need to download the webgl-globe.zip file and extract the directory named as “globe”  into a directory with the same name. In it, we copy our poverty.json file and now edit the index.html in order to replace the apparitions of “population909500.json” with “poverty.json”, and do some other additions like the name of the series. Finally, to see the result, you can put all the files in a static web server and browse the URL. Another option, just for local debugging, is run the next command under the directory itself:

$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

And then, go to http://localhost:8000 to see the result.

Globe before normalization

Globe before normalization

It seems like there is something wrong with two of the series: the population in poverty conditions, and the intensity of the poverty. This is because we need to normalize the values in order to get values in the range o to 1. To do that, we open again our CSV file as a spreadsheet, calculate the sum of the columns that we want to normalize, and then, we create a new column in which every single cell is the result of the division of the old value of cell by the total sum of all the values in the old column, We repeat the proccess with the another column and replace the old ones with just the values in the new ones. Now, we can run the steps of generate the JSON file and try again.

Now, you can click on World Poverty to see everything properly woriking.

Suggested Readings

The Python Standard Library Documentation

Lutz, Learning Python

  • Ch. 9: Tuples, Files, and Everything Else

4 Comments

Filed under Topics

Final Post: Gamex and Faces in Baroque Paintings

Face recognition algorithms (used in digital cameras) allowed us to detect faces in paintings. This has gave us the possibility of having a collection of faces of a particular epoch (in this case, the baroque). However, the results of the algorithms are not perfect when applied in paintings instead of pictures. Gamex gives the chance to clean this collection. This is very important since these paintings are the only historical visual inheritance we have from the period. A period that started after the meet of two worlds.

1. Description

Gamex was born from the merging of different ideas we had at the very beginning of the Interactive Exhibit Design course. It basically combines motion detection, face recognition and games to produce an interactive exhibit of Baroque paintings. The user is going to interact with the game by touching, or more properly poking, faces, eyes, ears, noses, mouths and throats of the characters of the painting. We will be scoring him if there is or there is not a face already recognized on those points. Previously, the database has a repository with all the information the faces recognition algorithms have detected. With this idea, we will be able to clean mistakes that the automatic face recognition has introduced.

The Gamex Set

The Gamex Set

2. The Architecture

A Tentative Architecture for Gamex explains the general architecture in more detail. Basically we have four physical components:

  • A screen. Built with a wood frame and elastic-stretch fabric where the images are going to be projected from the back and where the user is going to interact poking them.
  • The projector. Just to project the image from the back to the screen (rear screen projetion).
  • Microsoft Kinect. It is going to capture the deformations on the fabric and send them to the computer.
  • Computer. Captures the deformations send by the Kinect device and translates them to touch events (similar to mouse clicks). These events are used in a game to mark on different parts of the face of people from baroque paintings. All the information is stored in a database and we are going to use it to refine a previously calculated set of faces obtained through face recognition algorithms.

3. The Technology

There were several important pieces of technology that were involved in this project.

Face Recognition

Recent technologies offers us the possibility of recognizing objects in digital images. In this case, we were interested in recognizing faces. To achieve that, we used the libraries OpenCV and SimpleCV. The second one just allowed us to use OpenCV with Python, the glue of our project. There are several posts in which we explain a bit more the details of this technology and how we used.

Multi Touch Screen

One of the biggest part of our work involved working with multi-touch screens. Probably because it is still a very new technology where things haven’t set down that much we have several problems but fortunately we managed to solved them all. The idea is to have a rear screen projection using the Microsoft Kinect. Initially though for video-game system Microsoft Xbox 360, there is a lot of people creating hacks (such as Simple Kinect Touch) to take advantage of the abilities of this artifact to capture deepness. Using two infrared lights and arithmetic, this device is able to capture the distance from the Kinect to the objects in front of it. It basically returns an image, in which each pixel is the deepness of the object to the Kinect. All sorts of magic tricks could be performed, from recognizing gestures of faces to deformations in a piece of sheet. This last idea is the hearth of our project. Again, some of the posts explaining how and how do not use this technology.

Calibrating the multi-touch screen

Calibrating the multi-touch screen

Games

Last but not least, Kivy. Kivy is an open source framework for the development of applications that make use of innovative user interfaces, such as multi-touch applications. So, it fits to our purposes. As programmers, we have developed interfaces in many different types of platforms, such as Java, Microsoft Visual, Python, C++ and HTML. We discovered Kivy being very different from anything we knew before. After struggling for two or three weeks we came with our interface. The real thing about Kivy is that they use a very different approach which, apart from having their own language, the developers claim to be very efficient. At the very end, we started to liked and to be fair it has just one year out there so it will probably improve a lot. Finally, it has the advantage that it is straightforward to have a version for Android and iOS devices.

4. Learning

There has been a lot of personal learning in this project. We never used before the three main technologies used for this project. Also we included a relatively new NoSQL database system called MongoDB. So that makes four different technologies. However, Javier and me agree that one of the most difficult part was building up the frame. We tried several approaches: from using my loft bed as a frame to a monster big frame (with massive pieces of wood carried from downtown to the university in my bike) that the psyco duck would bring down with the movement of the wings.

It is also interesting how ideas changes over the time, some of them we probably forgot. Others, we tried and didn’t work as expected. Most of them changed a little bit but the spirit of our initial concept is in our project. I guess creative process is a long way between a driven idea and the hacks to get to it.

5. The Exhibition

Technology fails on the big day and the day of the presentation we couldn’t get our video but there is the ThatCamp coming soon. A new opportunity to see users in action. So the video of the final result, although not puclib yet, is attached here. It will come more soon!

6. Future Work

This has been a long post but there is still a few more things to say. And probably much more in the future. We liked the idea so much that we are continuing working on this and we liked to mention some ideas that need to be polished and some pending work:

  • Score of the game. We want to build a better system for scores. Our main problem is that the data that we have to score is incomplete and imperfect (who has always the right answers anyway). We want to give a fair solution to this. Our idea is to work with fuzzy logic to lessen the damage in case the computer is not right.
  • Graphics. We need to improve our icons. We consider some of them very cheesy and needs to be refined. Also, we would like to adapt the size of the icon to the size of the face the computer already recognized, so the image would be adjusted almost perfectly.
  • Sounds.  A nice improvement but also a lot of work to have a good collection of midi or MP3 files if we don’t find any publicly available.
  • Mobile versions. Since Kivy offers this possibility, it would be silly not to take advantage of this. At the end, we know addictive games are the key to entertain people on buses. This will convert the application in a real crowd sourcing project. Even if this implies to build a better system for storing the information fllowing the REST principles with OAuth and API keys.
  • Cleaning the collection. Finally, after having enough data it would be the right time to collect the faces and have the first repository of “The Baroque Face”. This will give us an spectrum of how does the people of the XVI to XVIII looked like. Exciting, ¿isn’t it?
  • Visualizations. Also we will be able to do some interesting visualizations, like heat maps where the people did touch for being a mouth, or an ear, or a head.

6. Conclusions

In conclusion we can say that the experience has been awesome. Even better than that was to see the really high level of our classmates’ projects. In the honour of the truth, we must say that we have a background in Computer Science and we played somehow with a little bit more of adventage. Anyway, it was an amazing experience the presentation of all the projects. We really liked the course and we recommend to future students. Let’s see what future has prepared for Gamex!

Some of the very interesting projects

Some of the projects

This post was written and edited together to my classmate Roberto. So you can also find the post on his blog.

4 Comments

Filed under Analysis, Tasks

Building the proper screen

The last step in the project, after we were able to overcome all the technical difficulties like Kivy language, was the building of the a suitable screen for our purposes, this is a poke-able rear screen. Doing this we avoid the problem of calibrating the Kinect device each time and for each user, and foremost we could do the setup just once.

The first attempt of a rear screen

The first attempt of a rear screen

Our first attempt was building a very big frame and using a table cover or bed sheet as the screen. But we found several serious problems:

  1. The frame was too big for moving.
  2. The frame wasn’t rigif enough and with the interaction of the users it got deformation.
  3. The screen, after an user interaction, never got back to normal and keept deformed forever.

Among all the problems (and others just not commented here), the last one was totally frustrating, because all the platform depends on the stability of the screen. If the screen is not just plain, Kinect will detect that, and it will translate points to the Kivy application when there is no one actually. Everything was wrong.

Choosing the most beautiful tissue

Choosing the most beautiful tissue

The alternative was to use a stretch fabric with the capacity to recover the initial shape after any, virtually, number of interactions. Even with hard punchs. But we didn’t know where to buy that or if it was even cheap enough for our students’ pockets. Fortunatelly, Prof. William Turkel recommend us Fabric Land, with three locations in the city and a lot of options for fabrics. I must say that the place was a bit weird for me, with a bunch of middle age ladies looking for good materials. I felt like Mrs. Doubtfire there. Finally one girl, very gentile, young and nice, helped us to find what we wanted, and she sold us at the price of $5 per meter!

Colours pins! That's the most decorative we can be

Colours pins! That's the most decorative we can be

And with all the raw material ready, we got down to work and we built, after several tries, the proper rear and stretch screen. Just in the first test we discovered that accuracy was amazing. And the most interesting thing: somehow, the fact that the screen is elastic, demands interaction from the users and keeps the people playing. So, we can say mission accomplished!

2 Comments

Filed under Tasks