Linear Regression Using Ruby

In the following post I am going to walk you through the basics of linear regression, and show you how you can perform simple linear regression using Ruby.

While Ruby is not commonly recognized as a tool for statistical analysis, there are times at Sharethrough when we need to perform basic statistical modeling in our web application, which is written using Ruby on Rails. In addition, Ruby’s elegant syntax makes computational regression very approachable.

Lets begin with a little mathematical review. Simple Linear Regression is a mathematical technique used to model the relationship between an dependent variable (y) and an independent variable(x).

Since we are attempting to find a linear relationship between a dependent variable and a single independent variable the basic equation is something that everyone should be familiar with.

\(y = \beta_0 + \beta_1 x\)

Linear regression is finding the best values for \(\beta_{0}\) and \(\beta_{1}\). Finding these values will take up the remainder of this post.

The best values for \(\beta_{0}\) and \(\beta_{1}\) will minimize the error between our line and the dataset. Unless you have a perfectly linear dataset — which almost never occurs in the real world — you will never find perfect values for \(\beta_{0}\) and \(\beta_{1}\). Therefore, we’ll try and estimate the best possible values for \(\beta_{0}\) and \(\beta_{1}\). These estimates will be denoted \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\).

We can now define the regression equation as:

\(\hat{r}(x) = \hat\beta_{0} + \hat\beta_{1}x\)

The error between our model and the data is calculated using residual sums of squares which is defined as:

\(\sum_{i=0}^{n}\hat{\varepsilon}_i^{2} = \sum_{i=0}^{n}(y_i - (\hat\beta_0 + \hat\beta_1 x_i))^{2}\)

The goal is to minimize the value of the sum of square error. If we expand the above quadratic we get the equations for \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\).

\(\hat\beta_{1} = \frac{ \sum_{i=1}^{n} (X_{i}-\bar{X})(Y_{i}-\bar{Y}) }{ \sum_{i=1}^{n} (X_{i}-\bar{X})^2 }\)

\(\hat\beta_{0} = \bar{Y} - \hat\beta_{1}\,\bar{X}\)

Now that we have the equations, let’s write some Ruby that solves them numerically.

We’ll start by attacking the simplest part of the equations:
\(\bar{X}\) and \(\bar{Y}\). These symbols represent the mean of the x and y variables in the dataset. In Ruby we can write the following function to compute the mean:

Now that we have the easy part out of the way, lets attack the equation for \(\beta_1\).

To simplify, break the equation into two parts, the numerator and the denominator.

The numerator becomes:

\(\sum_{i=1}^{n} (x_{i}-\bar{x})(y_{i}-\bar{y})\)

What this equation says is “for every value in x and y, multiply the difference between an observed x and the mean of x by the difference between the observed y and the mean of y.”

In Ruby, this would be:

Once we have the numerator we can compute the denominator of our equation for \(\beta_1\). The equation for the denominator is:

\(\sum_{i=1}^{n} (x_{i}-\bar{x})^2\)

Writing Ruby to compute this value is also pretty easy:

With the numerator and the denominator identified, we can put them together into a Ruby function that estimates slope:

Having solved for \(\beta_1\), we can tackle the solution for \(\beta_0\) which is the y-intercept of our regression line.

The equation for \(\beta_0\) is simply:

\(\hat{\beta_0} = \bar{Y_n} - \hat{\beta_1} \hat{X_n}\)

Translated into English this equation says “the y-intercept can be estimated as the difference between the average of y and the average of x multiplied by the slope of our line.”

In Ruby this is:

Now that you have both the slope and the y-intercept, you’ve written all the Ruby necessary to perform simple linear regression

Putting it all together, we end up with the following Ruby class:

Let’s try out our simple-linear regression class on a sample dataset: video views vs number of days a video has been online. It’s a good example of data we analyze at Sharethrough

The dataset looks like:

Days Online Number of Views
1 5500
2 45000
... ...

You can Download the full dataset here

We can use the code below to run our Ruby based regression on the sample dataset

Our Ruby-based regression says the best fit line for our sample dataset is:

\(y = 2463.53x + 25071.51\)

This Ruby-based solution corresponds to the regression line generated using the following R code

To help visualize the results of our regression, below is a graph of the dataset along with our regression line.

You can find all the code used in the this blog post on Github

Stay tuned for part 2 where we will look at confidence intervals, model error, and the predictive power of our simple linear regression model.

Ryan Weald is a Data Scientist at Sharethrough. You can follow him on twitter @rweald

Interested in working on hard data problems in a dynamic, collaborative environment? We’re hiring

Leave a Response

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

18 Comments

  1. What did you use to plot the final plot? Is that an R plot or a ruby plot? I just implemented this (plus confidence intervals on alpha and beta) the other day, but haven’t gotten around to visualizing it from Ruby yet.

  2. @Sameer: it’s also important to note that GSL is GPL-licensed and they make it really clear that any application that uses GSL must be GPL-licensed as well. That rules it out for a number of potential users.

  3. I think it would be best to coerce the data type to Decimal. Computing averages like you do above is bound to run into rounding errors. For large data sets “sum” will eventually become much larger than “value”, and you lose precision.
    It’s neat to have “simple” implementations of mathematical formulas, but floating point math is tricky.

  4. I used R to generate the graph, ggplot2 to be specific. I wish there was a Ruby library that provided rich graphing capabilities but I have been disappointed with most of the ones I have found.

  5. @Sameer Thanks for your feedback. I looked at ruby-gsl but decided against it for this post. My goal was to illustrate the underlying math, and show the simplicity of the Ruby necessary to perform this basic statistical method. If I was going to be deploying a large scale production system I would definitely consider using a C based approach. The GPL license of GSL does make it somewhat problematic.

  6. Marco Falcioni thanks for your feedback. I thought about the risk with large values but decided that for this particular post I wanted to make the code as simple as possible to try and help people understand the math. In a production setting I would work with a big number library to ensure precision wouldn’t be lost. This code is ment more as a basic example than a fully featured library but perhaps I should have included a warning about the risk with large numbers and I appreciate you pointing this out.

  7. In “mean”, “total = values.reduce(0) { |sum, x| x + sum }” is the same as the much shorter “total = values.reduce(0, :+)”. Personally I’d just reduce the whole method body to “values.reduce(0, :+).to_f / values.size”.

  8. I dig. You might be interested in this presentation from a dude who just wrote a neat book on doing rad things with Ruby and R: http://www.slideshare.net/sausheong/rubyand-r

    It describes the pros and cons of three Ruby/R interfaces (RinRuby, RSRuby, and RServe) and illustrates with a text classification problem.

    Probably not a solution for very large data sets, but for the non-trivial stuff where you don’t want to have to dump out of Ruby but you also don’t want to hand-code, could come in handy.

  9. Why not just write this in a different language? I notice you refer to methods as functions in several cases. This seems like a square peg in a round hole. Classes are superfluous when you’re just using conditions and Enumerable methods, and even if you decided to ignore that aspect of Ruby (forced OOP), then you’d still need to contend with the dismal performance of Ruby’s numerics, lack of native threads, and so on.

  10. I’d like to see a post that shows an example of when Float is not appropriate, and alternatives that can be used. That would be a great follow-up blog post.

  11. Pingback: This Week in Ruby: Rails Rumble Dates, Active Admin 0.5, Protected Methods in Ruby 2.0

  12. Linear Regression using PHP:

    /**
    * Compute linear least squares regression line.
    *
    * @link http://en.wikipedia.org/wiki/Linear_least_squares#Computation
    * @acces public
    * @static
    * @param array $y An array of y values.
    * @param array $x An array of x values or y-keys if not specified.
    * @return array(b,m) for the equation y = mx + b
    */
    function linest($y,$x=null) {
    $x = ($x===null) ? array_keys($y) : array_values($x);
    $y = array_values($y);
    $n = count($y);
    if( $n < 2 ) {
    return false;
    }

    $sum_x = 0;
    $sum_xx = 0;
    $sum_y = 0;
    $sum_xy = 0;
    for($i=0; $i<$n; $i++) {
    $sum_x += $x[$i];
    $sum_y += $y[$i];
    $sum_xx += $x[$i]*$x[$i];
    $sum_xy += $x[$i]*$y[$i];
    }
    $m = ( ($n*$sum_xy)-($sum_y*$sum_x) ) / ( ($n*$sum_xx)-($sum_x*$sum_x) );
    $b = ($sum_y – $m*$sum_x)/$n;

    return array($b,$m);
    } // END: function linest($y,$x=null)

  13. Minor correction: In the “residual sums of squares” equation, you refer to alpha and beta, but in the other equations these are referred to as beta-not and beta-one.

    Thanks for the approachable introduction to linear regression!