This Post Has 19 Comments

    1. rweald

      @Sameer Thanks for your feedback. I looked at ruby-gsl but decided against it for this post. My goal was to illustrate the underlying math, and show the simplicity of the Ruby necessary to perform this basic statistical method. If I was going to be deploying a large scale production system I would definitely consider using a C based approach. The GPL license of GSL does make it somewhat problematic.

  1. Tony Arkles

    What did you use to plot the final plot? Is that an R plot or a ruby plot? I just implemented this (plus confidence intervals on alpha and beta) the other day, but haven’t gotten around to visualizing it from Ruby yet.

    1. rweald

      I used R to generate the graph, ggplot2 to be specific. I wish there was a Ruby library that provided rich graphing capabilities but I have been disappointed with most of the ones I have found.

  2. Tony Arkles

    @Sameer: it’s also important to note that GSL is GPL-licensed and they make it really clear that any application that uses GSL must be GPL-licensed as well. That rules it out for a number of potential users.

  3. Marco Falcioni

    I think it would be best to coerce the data type to Decimal. Computing averages like you do above is bound to run into rounding errors. For large data sets “sum” will eventually become much larger than “value”, and you lose precision.
    It’s neat to have “simple” implementations of mathematical formulas, but floating point math is tricky.

    1. rweald

      Marco Falcioni thanks for your feedback. I thought about the risk with large values but decided that for this particular post I wanted to make the code as simple as possible to try and help people understand the math. In a production setting I would work with a big number library to ensure precision wouldn’t be lost. This code is ment more as a basic example than a fully featured library but perhaps I should have included a warning about the risk with large numbers and I appreciate you pointing this out.

  4. Dave Guarino

    I dig. You might be interested in this presentation from a dude who just wrote a neat book on doing rad things with Ruby and R: http://www.slideshare.net/sausheong/rubyand-r

    It describes the pros and cons of three Ruby/R interfaces (RinRuby, RSRuby, and RServe) and illustrates with a text classification problem.

    Probably not a solution for very large data sets, but for the non-trivial stuff where you don’t want to have to dump out of Ruby but you also don’t want to hand-code, could come in handy.

  5. Non Plus

    Why not just write this in a different language? I notice you refer to methods as functions in several cases. This seems like a square peg in a round hole. Classes are superfluous when you’re just using conditions and Enumerable methods, and even if you decided to ignore that aspect of Ruby (forced OOP), then you’d still need to contend with the dismal performance of Ruby’s numerics, lack of native threads, and so on.

  6. Pingback: This Week in Ruby: Rails Rumble Dates, Active Admin 0.5, Protected Methods in Ruby 2.0

  7. Paul

    Linear Regression using PHP:

    /**
    * Compute linear least squares regression line.
    *
    * @link http://en.wikipedia.org/wiki/Linear_least_squares#Computation
    * @acces public
    * @static
    * @param array $y An array of y values.
    * @param array $x An array of x values or y-keys if not specified.
    * @return array(b,m) for the equation y = mx + b
    */
    function linest($y,$x=null) {
    $x = ($x===null) ? array_keys($y) : array_values($x);
    $y = array_values($y);
    $n = count($y);
    if( $n < 2 ) {
    return false;
    }

    $sum_x = 0;
    $sum_xx = 0;
    $sum_y = 0;
    $sum_xy = 0;
    for($i=0; $i<$n; $i++) {
    $sum_x += $x[$i];
    $sum_y += $y[$i];
    $sum_xx += $x[$i]*$x[$i];
    $sum_xy += $x[$i]*$y[$i];
    }
    $m = ( ($n*$sum_xy)-($sum_y*$sum_x) ) / ( ($n*$sum_xx)-($sum_x*$sum_x) );
    $b = ($sum_y – $m*$sum_x)/$n;

    return array($b,$m);
    } // END: function linest($y,$x=null)

  8. Sam Umbach

    Minor correction: In the “residual sums of squares” equation, you refer to alpha and beta, but in the other equations these are referred to as beta-not and beta-one.

    Thanks for the approachable introduction to linear regression!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>