Monthly Archives: October 2012

Making Sense of Teaching Computer Science Tools to Linguists

As a part of the PhD program, I am required to design and defend two different courses. One intended for undergraduate students and another one for grads. I don’t know yet if both can be for graduate students, but my first course will be. It is going to be about some useful tools that any experimental and not just theoretical linguist should know. As today, we are getting more and more accostumed to hear terms like digital humanists, digital history, or digital whatever. There are even a sub-disciplne (if we can call it that way) named Computational Linguistic. However, it seems to me like two ancient soccer rival teams: what a traditional linguist does is pure Linguistic, but what a computational linguist does is just the future. Again, the everlasting fight between old and new ways to do things. When what really should matter is what are your questions, and what tools do you have to find the answers. That’s why I am proposing this course: to teach tools and new ways to think about them, and to make students able to solve their own research problems. And what is even more important, make them lose the fear to experiment with Computer Science and Linguistic.

That being said, I am going to introduce you a very early scrap of my pretended syllabus. Every week is going to have a one hours class, in order to explain and introduce concepts, and another two hours lab class in which test, experiment and expand the subjects previously covered that week.

  1. Computers Architecture. One of the most important things that virtually anybody should know, is how actually a computer works. In order to understand what is possible to do and what is not, it is needed to know the common components of almost any current computer, like RAM memory, CPU, GPU, hard drives, input/output operations and devices, etc. Also, a brief introduction on the existing types will be given. Once you know how a machine is built, you can control and understand things like having enough memory to run the programs, why this file freezes my computer when loading, and so on.
  2. Fundamentals of Programming. First thing to note is almost everything inside your computer is a program. And I will say more, an important aomunt of processes in your everyday life are pretty similar to computer programs. The order you follow when you are taking a shower, that awesome recipe for cooking pumpkin pie, the steps you give before starting the engine of your car, or the movements you do while dancing. All of them are a way similar to computer programs, or better said, to algorithms. A computer program is a set of instructions that a machine run one by one. Those programs are usually algorithms, in the sense of they are steps to achieve an output given a specific input. Very likely, an introduction to coding using something like pseudo-languages, flux diagrams, or NetLogo, will be given.
  3. Programming Languages. A brief introduciton to programming languages and why they are the way they are. Some linguists are really accostumed to handle with language peculiarities, however, natural languages seem to answer to the hypothesis of the Universal Grammar, as argued by the generative grammar studies of Noam Chomsky. A grammar is a set of rules, but in natural languages, unlinke formal languages like those used to code, the set of rules is usually huge. Fortunatelly, programming languages are built using a little set of rules, and the grammar that describe them can be, according to Chomsky, classified according the way they generate sentences. We could even say that study and learn how to program is like understand how another language works. So, in the end, programming languages are just a kind of language: constructed languages. And instead of learning how the brain works in order to understand it, you have to know how machines do. Once this is clear, it is time to meet Python.
  4. Writing Code. After the introduction to Python, students have to learn how a well structure of the code looks like. Concepts like loops, flow control statements, functions and parameters will be taught. Then it is the moment to make them know what libraries are and how to create their own ones. Finally, notions of Object-Oriented Programming (OOP) in Python will be shown, just in order to guide them in the use of objects and third-party libraries. Regrettably a hard-core lesson like this one is really needed in order to expand the skills of the students facing real life problems in their research. Thinking exactly the way machines do, it is a the best manner to efficiently code.
  5. Python Libraries. After getting some basic knowledge about programming, some third-party libraries will be presented. In special, the commonly used Natural Language Toolkit or simply called NLTK. This library is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, part-of-speech tagging (POS), parsing, and semantic reasoning.
    The scientific Python extensions scipy, numpy, pylab and matplotlib will be also introduced, but briefly because visualization is covered in another Week 8.
  6. R Language. R is a language and environment for statistical computing and graphics. Its syntax and usage differ a bit in regard to Python and it is much more focused on manipulating and analyzing data sets. Although R has buil-in functions and libraries for almost any measure, there is a very active community behind that provide even more, the Comprehensive R Archive Network or CRAN. Learn how to use it and where to find the function that does exactly what you want is as important as know the language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible for any research porpuse. This week just will cover basic statistical concepts like population, sample, frequency, and measures of center (mode, median, mean), spread (range, interquartile range, variance, standard deviation) and shape (symmetric, skewnessm kurtosis).
  7. Statistics. More advanced features will be introduced, not just how to calculate, but when and why. ANOVA, Chi Square, Pearson coefficient, arithmetic regression are some of them. Knowing of the meaning of statistical measures is really important to understand your data and what is happening with them. However, the point will always be how to get these measures working on R, instead of discussing theoretical and deep aspects of them.
  8. Plotting. Both R and Python have powerful tool to represent and visualize data. Unfortunately, the syntax of these functions can be a little tricky and deserve a whole week to be explained. Produce good charts and visualization of your data can be a crucial step in order to get your research properly understood. That’s why this week will introduce different methods to plot data: bar charts, pie charts, scatter plots, histograms, heatmaps, quadrilateral mesh, spectograms, stem plots, cross correlation, etc.
  9. Regular Expressions. As defined in Wikipedia, “a regular expression provides a concise and flexible means to ‘match’ (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.” The origins of regular expressions lie in automata theory and formal language theory. These fields study models of computation and ways to describe and classify formal languages.  Given a formal definition, we have Σ, an alphabet of symbols, and constants: the empty set, the empty string, and literal characters. The operations of concatenation, alternation and Kleen star (and cross), define regular expressions. However, we will use the POSIX Basic Regular Expressions syntax, because is the most used and easy to learn. Of course, it does include boolean “OR”, grouping and quantification. Linguists can benefit of regular expressions because is a tool that allows filtering and discovering data. Let’s say we have a ser of words in Spanish and we want to extact all conjugations of the verb “jugar” (to play). The regular expression “^ju(e?)g[au].*$” will return the proper matches.
  10. CHAT Format. CHAT is a file format for transcribing conversations in natural language using plain text. First designed for transcribing children talkings, is actually a quite powerful format for any kind of transcription. It allows to register parts of sentence, omissions, speakers, dates and times, and more features, including someones from phonology. There is even an application that translate CHAT XML files to Phon files.
    Besides, due to CHAT files are actually plain text file, the Python Natural Language Toolki already has a parser for them, so we can do certain kind of analysis using just Python and our already adquired skills for regular expressions.
  11. CHILDES CLAN. Besides the specification of the CHAT format (.cha), CHILDES is intended for . One of the tools that provides is CLAN, the most general tool available for transcription, coding, and analysis. Over a corpora of texts written in CHAT format, CLAN  does provide methods such as headers, gems, comments, and postcodes that can be used for some  qualitative data analysis (QDA).
  12. Presentations. Althougth at the beginning this week was intented for introducing PRAAT and CHILDES Phon for  analysis of phonological data, later I thought that should be even more useful to present students’ projects to the class for comenting and getting feedback. So, this week won’t have any new content, but students will have to comment and critize their classmates’ works.

Quite interesting. Not only from a linguistic point of view, but also for computer scientist who enjoys teaching computer science to others. Fortunately, the linguist Prof. Yasaman Rafat accepted to be my co-supervisor for the linguistic side of this course. However, because the course is bridging bewteen two disciplines, I still needed a computer scienctist for guiding me in the process of teaching to non-technical students. Fortunately I asked to the Biology and Computer Science (and Python expert) Prof. Mark Daley, and he gently said yes. So now there is no obstacle at all to make this course a real thing :)


Filed under Tasks, Topics

Talking about Graph Databases at PyCon Canada

Good news, everyone! This November, at PyCon Canada 2012, I will be giving a short talk (about 20 min.) about graph databases in Python and most likely share the time with Diego Muñoz. And now, some info about the talk 😀

Good news, everyone!

Good news, everyone!


Since the irruption in the market of the NoSQL concept, graph databases have been traditionally designed to be used with Java or C. With some honorable exceptions, there isn’t an easy way to manage graph databases from Python. In this talk, I will introduce you some of the tools that you can use today in order to work with those new challenging databases, using of course our favorite languge: Python.


Starting at a very basic definition of what it is a graph and why we want to start using one, I will introduce some examples of real life. Companies like Facebook or twitter pretty recently started using graph-like databases.

Then, an overview of options available will be shown. Detailing a bit the ecosystem and highlighting which of the solutions are able to use with Python and how.

After that, and focused on Blueprints databases, I will show how to use pyblueprints, for basic stuff, and bulbflow for building models Django-style based on nodes.

Finally, for using just Neo4j, one the most solid graph databases in the market and dual licensed (GPL for Open Source), I will introduce py2neo, a Cypher-centric approach, and neo4j-rest-client, that is the one I actively develop.And it there is some time left, some real examples connecting to AWS or Heroku.

The talk is intended for a audience with basic or novice level of Python, so I hope to see you all there!

1 Comment

Filed under Events