2012-07-25

On popularity over time

This post is about homemade science. There's probably lots of wheel reinventing going on, since I'm not very knowledgeable about theory, but it illustrates how to craft your own "research" pipelines.

I was wondering how something posted online varies in popularity over time, but I had no data to analyze. Then I remembered YouTube has a cute little "Show Video Statistics" button:

So I saved a bunch of these in a folder, and ran the following command:
convert * -average a.png
This uses a program called ImageMagick to average the pictures. I obtain the following result:

I then proceeded to crop and rescale the image, such that a pixel on the horizontal scale is a day, and a pixel on the vertical scale is the percentage of the total views. Then, if I vertically mirror the image, the X and Y coordinates of the pixels represent the data I'm looking for. I adjust colors, canvas size to 200 vertically, Gaussian blur vertically 100px, threshold and trim conveniently, and draw a convenient line:


Now I write down some coordinates from the line, then plug them into Formulize. I set the target function to be:
D(visitors, day) = f(day)

This tells Formulize that we want to find the derivative of the function w.r.t. the number of days. We want the derivative, because YouTube plots the total views, but we want the views per day.
Also, I select many functions to make use of, since this is a nonlinear thing that I'm trying to approximate.
Then I run for a little bit, and I get the following:


Yays! The best match is a function of the form exp(-x^2) - which is actually a Gaussian function (or the bell curve). Now we know that the visits per day to something posted online decrease similarly to the right side of a Normal distribution. This can be probably derived from some theory somewhere, and if you know any papers or such, please comment!
If a dataset matches the normal distribution, it means it's just like you tried to throw darts at the center, but the points you hit are a bit off. This particular dataset is the same, except all the hits are after the publishing time (since you can't see a post before it's posted.)
This kind of empirical thinking is greatly valued among scientific people. If you enjoy doing stuff like this, consider pursuing a career in research.