Autoencoder: 2012

2012-12-26

On parallel computing

C is a certain kind of devil that can turn around in the blink of an eye. If you're not careful for an instant, it's gonna bite you in the ass. Hard. But parallel computing is another kind of devil that does the same thing. When you combine the two, you get the most intricate problem humankind has to face. But I'll only discuss the parallel part in this post.
In my experience using threads, I've created completely crazy code (as pointed out by my colleagues, Paul Țiței and Sebi Orțan). I'd like to come clean, and help you avoid making the same mistakes.

First, here's the classic tried-and-true formula guaranteed to occasionally spawn a deadlock:

void *a() {
lock_resource(X);
lock_resource(Y);
do_work();
unlock(X);
unlock(Y);
}

void *b() {
lock_resource(Y);
lock_resource(X);
do_work();
unlock(X);
unlock(Y);
}

Suppose a() and b() are called in a quick succession. Say that a() locks X, then immediately after, b() locks Y. a() is now waiting for Y, and b() for X. Deadlock.
This can be easily prevented by ensuring that a() and b() lock the resources in the same order. Use pencil and paper, or whatever you like, but you MUST guarantee this. It's the only way to ensure correctness.

Our teacher, Rareș Boian, told us at the course that, once you run into a deadlock and are able to fix it, your lab project should be OK. Well, I fixed my deadlock, but I created a more subtle problem (very, very wrong, according to my colleagues, and me, eventually) - which was not noticed by our lab examiner, luckily (or not) for me:

void *thread(){
lock_mutex();
do_things_that_dont_need_mutex();
do_things_that_need_mutex();
do_other_things_that_dont_need_mutex();
unlock_mutex();
}

Of course, when I put it this way, you're gonna notice it. My program has no concurrency, defeating the purpose of threads! Since each thread locks the mutex for its whole run, what are other threads doing in the mean time? Perhaps they're just being handled by the thread/function calling mechanism, but that takes a trivial amount of processing power, compared to the instructions that don't need the mutex. They don't call it critical section for nothing! Don't include uncritical things in it, like I did!
Maybe this wasn't mentioned enough at the course, or maybe it was just me (skipping some and being really stupid), but I felt the need of pointing this out to the rest of the world.
While our overworked teachers may have missed this important philosophical point, I am leveraging the power of the Internet (free content distribution) to point it out to you. Be kind and share this to your fellow prospective programmers!

2012-10-14

On dolphins

There's lots of evidence that dolphins are conscious. They have bigger brains than us. This is an unfortunate occurrence - the species with the biggest brains on the planet have very limited means of physically manipulating their environment, significantly limiting this planet's innovation output.

So, I wrote a blog post expressing my respect and solidarity. Maybe some day a dolphin will read it and tell the other dolphins that humanity didn't mean any harm, contrary to the pollution and other trouble it's brought on them. Especially Japan. Those bastards.

On the other hand, maybe mankind will cooperate with dolphinkind one day. There already exist machines that can facilitate dolphin-computer interaction via sound. Maybe using more advanced AI techniques (such as an autoencoder :D ), a computer could find a mapping between dolphin words (if there are any) and human words. But dolphin language could, as far as we know, be just as complicated as human languages, and we've had a great deal of trouble modeling even that, even given all our knowledge of English phonemes and words and grammars. Dolphinese is a whole new language, alien to us.

Hopefully, however, in the future, our machines will allow us to talk to dolphins, and tell them our ideas and problems, and allow them to express theirs, which will increase the number of intelligent beings having a say on how the Universe works. Or maybe they'll just tell you they kinda like you.

2012-08-29

Web scraping tutorial

I've seen this video about scraping websites, and I want to write a short and quick tutorial on how to do it, after I've tried it and found it fun. If you know of some handy tool that I have not written about and might fit in this tutorial, don't hesitate to comment!

Suppose you're looking for a job. But you're not satisfied clicking on every link that you see on your website, such as Craigslist. You need an automated solution that does that for you, because it's too repetitive and boring. But that's what computers are for! Let's turn boring repetitiveness into exciting fascination!

Scraping means getting structured data from a website using an automatic tool. It might not be very nicely regarded by website owners, since they can't make money if a bot doesn't look at any advertisements, but it relentlessly uses up their bandwidth. And for that matter, it might be in the moral twilight.

For doing this, I use Python and a module for it, namely lxml. There's also mechanize, beautiful-soup, hpricot, but these are beyond the scope of this tutorial. Watch the video for more.
Also, a better interactive interface for Python is IPython, which feels more like an environment than a simple interpreter, since it has auto-completion, syntax coloring, detailed exceptions, and I think it looks better than IDLE even though it's a console app.
On Ubuntu, installing these was as easy as typing:

sudo apt-get install python-lxml ipython

First, we need a way to tell our Python script which elements of a page we want. We are going to use these things called "CSS Selectors". They're some sort of rules specifying what should and what shouldn't be matched - sort of like regular expressions, but specifically designed for web pages. Here's an example of one matching all bold elements on a page:

There you go. Looks simple, right? It's just the letter b. Now here's one that matches all bold, italic, and headline-1 formatted text:

b, i, h1

You get the idea. But real expressions used for finding more specific content aren't as simple. Some might include element ID numbers, elements following other elements, parents and children, attributes, and so on. Here's the one that matches our data from the tables on that page:

.row a:nth-child(2)

This one matches all links that are the second child of their parent, and which have the class "row". If you want the details, look here.

But you don't need to learn this syntax in order to do useful stuff. Thank goodness, since it would be such a tedious task. It turns out there's a tool that allows you to visually mark elements and generate the simplest expression for you. It's called SelectorGadget. Such a creative name.

Install it. This is a bookmarklet with a javascript script that does stuff to your page in the browser.

For this tutorial, we're going to download all links to job pages on Craigslist Romania. These could, for example, be saved into a text file one on each line, then you could call:

wget -i file_with_links.txt

to download each of them into the current directory.

Ok. With the SelectorGadget installed, visit the page and click on the bookmarklet. Be warned, all hell may break loose if you do that. Then, click on a link to a job.

When you click on it, lots of things may turn yellow. That means those things have also been selected by the generated selector. To mark one as unwanted, click on it again, and it will turn red, and will serve as a counterexample of what you want. Keep doing this until you only have the links to the jobs. It may take a few clicks until you get exactly what you want, and it's not guaranteed to find a solution, but it saves you a whole lot of effort when it does.

Now that we have this... alien CSS Selectonator thingy... we save it for later. We're going to use it in our program, even though we don't fully understand it. Computers have a mind of their own.

Now, let's learn how to use lxml.html! I picked this one because it's quite robust (hasn't let me down yet), and very fast. It is a C extension of Python, so it's quite close to as fast as possible. Fire up IPython and enjoy the view:

Python 2.7.3 (default, Aug 1 2012, 05:14:39)
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]:

Ah, bliss!
Now, assuming you did install lxml, write these commands:
import lxml.html.parse, urllib2
url = urllib2.urlopen('http://bucharest.craigslist.org/jjj/')

Now, the HTML stuff is available as url.read(). We want to parse it and get the root node of the resulting tree:

doc = parse(url).getroot()

And to apply the selector to get our job links:

links = doc.cssselect('.row a:nth-child(2)')
To print them, we need their "href" attribute. We can simply iterate over them like this:
for link in links:
print link.attrib['href']

Great. Hopefully you'll now have your screen full of Craigslist links. This is good.
But we want the links to be in a file. The easiest way to do this is by redirecting the output. Save the commands in a script file, and run it like this:
python CraigsLinks.py > file_with_links.txt
and you've got the file you can use in the command at the beginning of the article. And a lot of jobs! Hopefully you'll find a way to make the computer read them too. Congratulations!

That was surprisingly easy to write, and it could be easily adapted to sites which don't take steps against scrapers. Become filled with astonishment at the simplicity of Python and its modules!

There are many more things to learn. For example, running Javascript on sites that use it, tricking the site into thinking we're a browser (faking the user-agent id), navigating from page to page ("crawling"), or scraping and crawling in a parallelized manner. For more, watch the video and explore the tools' tutorials and documentations.

2012-07-25

On popularity over time

This post is about homemade science. There's probably lots of wheel reinventing going on, since I'm not very knowledgeable about theory, but it illustrates how to craft your own "research" pipelines.

I was wondering how something posted online varies in popularity over time, but I had no data to analyze. Then I remembered YouTube has a cute little "Show Video Statistics" button:

So I saved a bunch of these in a folder, and ran the following command:

convert * -average a.png

This uses a program called ImageMagick to average the pictures. I obtain the following result:

I then proceeded to crop and rescale the image, such that a pixel on the horizontal scale is a day, and a pixel on the vertical scale is the percentage of the total views. Then, if I vertically mirror the image, the X and Y coordinates of the pixels represent the data I'm looking for. I adjust colors, canvas size to 200 vertically, Gaussian blur vertically 100px, threshold and trim conveniently, and draw a convenient line:

Now I write down some coordinates from the line, then plug them into Formulize. I set the target function to be:

D(visitors, day) = f(day)

This tells Formulize that we want to find the derivative of the function w.r.t. the number of days. We want the derivative, because YouTube plots the total views, but we want the views per day.

Also, I select many functions to make use of, since this is a nonlinear thing that I'm trying to approximate.

Then I run for a little bit, and I get the following:

Yays! The best match is a function of the form exp(-x^2) - which is actually a Gaussian function (or the bell curve). Now we know that the visits per day to something posted online decrease similarly to the right side of a Normal distribution. This can be probably derived from some theory somewhere, and if you know any papers or such, please comment!

If a dataset matches the normal distribution, it means it's just like you tried to throw darts at the center, but the points you hit are a bit off. This particular dataset is the same, except all the hits are after the publishing time (since you can't see a post before it's posted.)

This kind of empirical thinking is greatly valued among scientific people. If you enjoy doing stuff like this, consider pursuing a career in research.

2012-05-25

Will teaching soon be over?

MOOCs (Massive Online Open Courses) are sprouting up all over the Internet. While this seems like a good thing, since education becomes much more accessible, it's a double-edged sword, because it will have a great impact on society, and, inevitably, some people will have to find other jobs. Let me explain how that could happen.
So far, these courses seem very inoffensive, but sooner or later, some of them will become better and better recognized around industry. Also, they are very cheap to produce: new content only needs to be created once, or when the scientific understanding of reality changes. A lot of the same material can be re-used lots of times and can be broadcast for an ever-decreasing cost (think of cloud services, such as YouTube or Amazon's EC2).
This content also allows much greater interactivity - you can pause and rewind a lecture to better understand it, and you might keep track of your progress in understanding each of the concepts presented in the courses - and receive exercises and examples tailored by a machine specifically for you. Contrast this to an academic course, where you need to listen in a linear fashion, and you can't rewind if you missed something - perhaps losing track of what is being taught.
There are still some obstacles to overcome, such as verifying the credentials of a user, or gaining credibility of the assessments/exams, but if the courses truly manage to teach the needed information - and some will almost certainly manage to do it - then these problems are trivial, in my opinion. Some people disagree, however.
So, if high-quality education is available online for free (or almost for free), then schools and universities become redundant. People are just as able to build careers as if they actually went to universities, which is bad news for the universities, since few people are going to want to go to them. Therefore, many teachers lose their jobs.
This is another example of technological unemployment. The teachers, people who have devoted their lives to teaching, might be forced to find other jobs. Even though teaching is a highly specialized and very tough job to get right, the demand will reduce drastically in the following years, like many menial disciplines that suffered the same fate due to automation.
Such phenomena are actually indicators of progress. There can be no improvement without change, and this can be seen as society taking off a band-aid.

2012-05-18

All fuzzy inside

Sometimes, a whole lot of times, things aren't as simple as black and white. Sometimes they are thrown together and blended into an unintelligible mess. If you have to deal with such a mess, then strict, boolean logic-based rules and algorithms will only make your mess bigger (and maybe some segfaults).
Sometimes you need to estimate a state, to respond proportionally, or to reason under uncertainty. In those cases, you must use a probabilistic interpretation instead of a brittle, hardcoded case-based controller.
Maybe the simplest and most robust system that deals with this is fuzzy logic.

Shamelessly stolen from Wikipedia

As you might see in the graph, the temperature is a continuous variable, while a system has 3 discrete responses: cold, warm, or hot. This could be, for instance, a thermostat being able to cool or heat up a room.

Some of its robustness lies in being able to transition from one state to another in a smooth way, without starting and stopping abruptly at the border between them. Also, the gradient of the responses may be adjusted, to compensate for whatever subjectivity there may be in deciding the thresholds, or for whatever errors there may be in the measurement, since the real world can be very noisy.

There also exist more advanced controllers similar to this, such as the PID controller, which also takes into account the integral and the derivative of the signal with respect to time. However, in order to decide how much importance is assigned to each component (the proportional, integral, and derivative responses), extra parameters are required, which reduces the basin of initial conditions which will lead to a "good" solution. The more parameters that need tuning, the more unstable a system is.

This is loosely similar to the principle of Occam's razor - the fewer assumptions an explanation requires, the more likely it is to be correct. Also, this is why laziness is one of the programmers' virtues - generally, the simpler the code, the better it is (given that it performs the required job).

2012-05-13

On Kaggle and the Turing Test

There's this website called Kaggle, where you can compete at data analysis and get money (lots of them, if you win). Essentially, people make a model of some data from some organization, and the models with the best score get awarded prizes (or karma).
There's recently been a competition in which the scientists were asked to come up with a solution for automatically grading essays. The company provided examples for training models: essays and their respective human grades.
The scoring metric, called "kappa" in this case, is between 0 (completely useless) to 1 (a perfect model).
One contender noticed that there is a certain discrepancy between the human grades - they only agree with a score of about 76%. They were worried that this would be a ceiling for how well a computer model could do the job (since humans are supposedly better than computers at understanding human language).
However, when the competition came to an end, the best model scored better than the professional human graders, with a score of 81.4%.
While this may have been an effect of the algorithms possessing a closer estimate of the "true" grade (they could average the human ratings), it is still suggesting that computers may now be better than humans at grading essays. That computers can perceive how correct a given text is more precisely than people. I believe this illustrates how quickly technology has been progressing in recent years, becoming almost incomprehensibly better.
The Turing Test claims that if a program can successfully deceive humans into thinking it's another human (using instant messaging), then it's safe to call it intelligent.
I wonder, if programs can understand humans better than humans themselves, what will that mean for humankind? Will all our jobs be automated? Will all of us become unemployed? Do we need another economic model?

Intro

Let me tell you what this blog will be about.
I'm a CS student, and from time to time I have too much time (aka I get bored of doing what I'm supposed to be doing). That's when I look at stuff on the Internet - the great series of tubes linking people together - and see stuff that I enjoy thinking about.
An autoencoder is a neural network that can encode its own input in a very efficient manner and reconstruct it very precisely. Sparse autoencoders are awesome because their codes are very good at classification (and you can easily pick other categories to classify while only retraining the last layer). I'm a fanboy of Andrew Ng (and took his ML class and thought it was awesome).
The motto of the blog - "Listen to the sound of the machine" - is a quote from Elephants Dream, the first Open Movie - I'm also a fan of open source and consider the movie a great achievement.
I intend to post about once a week, though real life has priority.