Logo

PEARSON

Updated 2023-11-01 21:03:56.553000

Syntax

SELECT [westclintech].[wct].[PEARSON] (
 ,<@Known_y, float,>
 ,<@Known_x, float,>)

Description

Use the aggregate PEARSON function to calculate the correlation coefficient between two datasets. The equation for the correlation coefficient is

r_{xy}=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}

Arguments

@Known_x

the x-values to be used in the PEARSON calculation. @Known_x is an expression of type float or of a type that can be implicitly converted to float.

@Known_y

the y-values to be used in the PEARSON calculation. @Known_y is an expression of type float or of a type that can be implicitly converted to float.

Return Type

float

Remarks

PEARSON is an AGGREGATE function and follows the same conventions as all other AGGREGATE functions in SQL Server.

Examples

In this example, we calculate the Pearson coefficient as correlation for a single set of x- and y-values

SELECT wct.PEARSON(y, x) as PEARSON
  FROM (   SELECT 0.75,
                  1
           UNION ALL
           SELECT 2.5,
                  2
           UNION ALL
           SELECT 6.75,
                  3
           UNION ALL
           SELECT 10,
                  4) n(x, y);

This produces the following result

{"columns":[{"field":"PEARSON","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"PEARSON":"0.988719187867937"}]}

In this example, we will populate some temporary table with some historical financial information and then calculate the Pearson coefficient of correlation. First, create the table and put some data in it:

CREATE TABLE #c (
    SYM NVARCHAR(5),
    YE BIGINT,
    REV FLOAT,
    GPROF FLOAT,
    OPINC FLOAT,
    NETINC FLOAT);

INSERT INTO #c
VALUES ('YHOO', 2009, 6460.32, 3588.57, 386.69, 597.99);
INSERT INTO #c
VALUES ('YHOO', 2008, 72.5, 4185.14, 12.96, 418.92);
INSERT INTO #c
VALUES ('YHOO', 2007, 6969.27, 4130.52, 695.41, 639.16);
INSERT INTO #c
VALUES ('YHOO', 2006, 6425.68, 3749.96, 940.97, 751.39);
INSERT INTO #c
VALUES ('YHOO', 2005, 5257.67, 3161.47, 1107.73, 1896.23);
INSERT INTO #c
VALUES ('GOOG', 2009, 23650.56, 14806.45, 8312.19, 6520.45);
INSERT INTO #c
VALUES ('GOOG', 2008, 21795.55, 13174.04, 5537.21, 4226.86);
INSERT INTO #c
VALUES ('GOOG', 2007, 16593.99, 9944.9, 54.44, 4203.72);
INSERT INTO #c
VALUES ('GOOG', 2006, 10604.92, 6379.89, 3550, 3077.45);
INSERT INTO #c
VALUES ('GOOG', 2005, 6138.56, 3561.47, 2017.28, 1465.4);
INSERT INTO #c
VALUES ('MSFT', 2010, 62484, 509, 24167, 18760);
INSERT INTO #c
VALUES ('MSFT', 2009, 58437, 46282, 21225, 14569);
INSERT INTO #c
VALUES ('MSFT', 2008, 60420, 48822, 22271, 17681);
INSERT INTO #c
VALUES ('MSFT', 2007, 51122, 40429, 18438, 14065);
INSERT INTO #c
VALUES ('MSFT', 2006, 44282, 36632, 16064, 12599);
INSERT INTO #c
VALUES ('ORCL', 2010, 26820, 21056, 9062, 6135);
INSERT INTO #c
VALUES ('ORCL', 2009, 23252, 18458, 8321, 5593);
INSERT INTO #c
VALUES ('ORCL', 2008, 22430, 17449, 7844, 5521);
INSERT INTO #c
VALUES ('ORCL', 2007, 17996, 13805, 5974, 4274);
INSERT INTO #c
VALUES ('ORCL', 2006, 14380, 11145, 4736, 3381);
INSERT INTO #c
VALUES ('SAP', 2009, 10672, 6980, 2588, 1748);
INSERT INTO #c
VALUES ('SAP', 2008, 11575, 7370, 2701, 1847);
INSERT INTO #c
VALUES ('SAP', 2007, 10256, 6631, 2698, 1906);
INSERT INTO #c
VALUES ('SAP', 2006, 9393, 6064, 2578, 1871);
INSERT INTO #c
VALUES ('SAP', 2005, 8509, 5460, 2337, 1496);

Now, calculate the correlation of the revenue (REV) against the year (YE) for each company (SYM)

SELECT #c.SYM,
       wct.PEARSON(REV, YE) as PEARSON
  FROM #c
 GROUP BY SYM;

This produces the following result.

{"columns":[{"field":"SYM"},{"field":"PEARSON","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"SYM":"GOOG","PEARSON":"0.988604792733014"},{"SYM":"MSFT","PEARSON":"0.91861026921264"},{"SYM":"ORCL","PEARSON":"0.983795721235544"},{"SYM":"SAP","PEARSON":"0.873067973316442"},{"SYM":"YHOO","PEARSON":"-0.219384585146269"}]}

In this example, we will calculate the Pearson coefficient of correlation of the operating income (OPINC) against the revenue (REV)

SELECT #c.SYM,
       wct.PEARSON(OPINC, REV) as PEARSON
  FROM #c
 GROUP BY SYM;

This produces the following result.

{"columns":[{"field":"SYM"},{"field":"PEARSON","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"SYM":"GOOG","PEARSON":"0.651906713868849"},{"SYM":"MSFT","PEARSON":".987612258172035"},{"SYM":"ORCL","PEARSON":".9924157389967"},{"SYM":"SAP","PEARSON":"0.844595495520328"},{"SYM":"YHOO","PEARSON":"0.677389856742323"}]}

Let’s say we wanted to perform the same analysis as in Example #1, but we only want to return the results where the Pearson coefficient of correlation is positive.

SELECT #c.SYM,
       wct.PEARSON(REV, YE) as PEARSON
  FROM #c
 GROUP BY SYM
HAVING wct.PEARSON(REV, YE) > 0;

This produces the following result.

{"columns":[{"field":"SYM"},{"field":"PEARSON","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"SYM":"GOOG","PEARSON":"0.988604792733014"},{"SYM":"MSFT","PEARSON":"0.91861026921264"},{"SYM":"ORCL","PEARSON":"0.983795721235544"},{"SYM":"SAP","PEARSON":"0.873067973316442"}]}