covar_pop(), covar_samp(), corr(), regr_%()

Functions for linear regression analysis

This parent section and its two child sections describe these aggregate functions for linear regression analysis:

Overview

See, for example, this Wikipedia article on Regression analysis. Briefly, linear regression analysis estimates the relationship between a dependent variable and an independent variable, aiming to find the line that most closely fits the data. This is why each of the functions described has two input formal parameters. The dependent variable, the first formal parameter, is conventionally designated by "y"; and the independent variable, the second formal parameter, is conventionally designated by "x".

See, for example, the article "How To Interpret R-squared in Regression Analysis". It says this:

Linear regression identifies the equation that produces the smallest difference between all of the observed values and their fitted values. To be precise, linear regression finds the smallest sum of squared residuals that is possible for the dataset.

In terms of the high school equation for a straight line:

y = m*x + c

the function regr_slope(y, x) estimates the gradient, "m", of the straight line that best fits the set of coordinate pairs over which the aggregation is done; and the function regr_intercept(y, x) estimates its intercept with the y-axis, "c". The so-called "R-squared " measure, implemented by regr_r2(y, x), indicates the goodness-of-fit. It measures the percentage of the variance in the dependent variable that the independent variables explain collectively—in other words, the strength of the relationship between your model and the dependent variable on a 0 – 100% scale. For example, if regr_r2() returns a value of 0.7, it means that seventy percent of the relationship between the putative dependent variable and the independent variable can be explained by a straight line with the gradient and intercept returned, respectively, by regr_slope() and regr_intercept(). The remaining thirty percent can be attributed to stochastic variation.

The purpose of each of the functions is rather specialized; but the domain is also very familiar to people who need to do linear regression. For this reason, the aim here is simply to explain enough for specialists to be able to understand exactly what is available, and how to invoke what they decide that they need. Each function is illustrated with a simple example.

Each of these aggregate functions is invoked by using the same syntax:

  • either the simple syntax, select aggregate_fn(expr, expr) from t
  • or the GROUP BY syntax
  • or the OVER syntax

Only the simple invocation is illustrated. See, for example, the sections GROUP BY syntax and OVER syntax in the section avg(), count(), max(), min(), sum() for how to use these syntax patterns.

Signature:

Each one of the aggregate functions for linear regression analysis, except for regr_count(), has the same signature:

input value:       double precision, double precision

return value:      double precision

Because it returns a count, regr_count() returns a bigint, thus:

input value:       double precision, double precision

return value:      bigint

In all cases, the first input parameter represents the values that you want to be taken as the dependent variable (conventionally denoted by "y") and the second input parameter represents the values that you want to be taken as the independent variable (conventionally denoted by "x").

About nullness

If, for a particular input row, either the expression for "y", or the expression for "x", evaluates to null, then that row is implicitly filtered out.

Create the test table

The same test table recipe serves for illustrating all of the functions for linear regression analysis. The design is straightforward. Noise is added to a pure linear function, thus:

  y = slope*x + intercept + delta

where "delta" is picked, for each "x" value from a pseudorandom normal distribution with specified mean and standard deviation.

The procedure "populate_t()" lets you try different values for "slope", "intercept", and for the size and variability of "delta". It uses the function normal_rand(), brought by the tablefunc extension.

drop procedure if exists populate_t(
  int,  double precision,  double precision,  double precision,  double precision)
  cascade;
drop table if exists t cascade;

create table t(
  k      int primary key,
  x      double precision,
  y      double precision,
  delta  double precision);

create procedure populate_t(
  no_of_rows  in int,
  slope       in double precision,
  intercept   in double precision,
  mean        in double precision,
  stddev      in double precision)
  language plpgsql
as $body$
begin
  delete from t;

  with
    a1 as (
      select
        s.v        as k,
        s.v        as x,
        (s.v * slope) + intercept as y
      from generate_series(1, no_of_rows) as s(v)),

    a2 as (
      select (
        row_number() over()) as k,
        r.v as delta
      from normal_rand(no_of_rows, mean, stddev) as r(v))

  insert into t(k, x, y, delta)
  select
    k, x, a1.y, a2.delta
  from a1 inner join a2 using(k);

  insert into t(k, x, y, delta) values
    (no_of_rows + 1,    0, null, null),
    (no_of_rows + 2, null,    0, null);
end;
$body$;

\set no_of_rows 100
call populate_t(
  no_of_rows  => :no_of_rows,

  mean        =>  0.0,
  stddev      => 20.0,
  slope       =>  5.0,
  intercept   =>  3.0);

\pset null <null>
with a as(
  select k, x, y, delta from t where x between 1 and 5
  union all
  select k, x, y, delta from t where k between 96 and (:no_of_rows + 2))
select
  to_char(x, '990.9')           as x,
  to_char(y, '990.9')           as y,
  to_char((y + delta), '990.9999') as "y + delta"
from a
order by k;

Here is an impression of the result of invoking "populate_t()" with the values shown. The whitespace has been manually added.

   x    |   y    | y + delta
--------+--------+-----------
    1.0 |    8.0 |   -5.9595
    2.0 |   13.0 |  -14.8400
    3.0 |   18.0 |   40.4009
    4.0 |   23.0 |   27.8537
    5.0 |   28.0 |   68.7411

   96.0 |  483.0 |  483.9196
   97.0 |  488.0 |  464.3205
   98.0 |  493.0 |  528.2446
   99.0 |  498.0 |  514.0421
  100.0 |  503.0 |  549.7692
    0.0 | <null> | <null>
 <null> |    0.0 | <null>

The individual functions are described in these two child-sections