Logo

VIF

Updated 2023-11-02 13:41:11.003000

Syntax

SELECT * FROM [westclintech].[wct].[VIF](
   <@Matrix_RangeQuery, nvarchar(max),>)

Description

Use the table-valued function VIF to calculate the R2, tolerance and the variance inflation factor for a set of independent variables.

VIF performs a linear regression for each independent variable against all the other independent variables. For each of these regression analyses it calculates R2 as:

R^2 = 1 - \frac{SS_{res}}{SS_{tot}}

where SSres is the residual sum of squares and SStot is the total sum of squares.

The tolerance is derived directly from R2.

\text{VIF} = \frac{1}{\text{Tolerance}}

The variance inflation factor is calculated from the tolerance.

Arguments

@Matrix_RangeQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of x-values used in the calculation.

Return Type

table

{"columns": [{"field": "colName", "headerName": "Name", "header": "name"}, {"field": "colDatatype", "headerName": "Type", "header": "type"}, {"field": "colDesc", "headerName": "Description", "header": "description", "minWidth": 1000}], "rows": [{"id": "ed51afc6-645a-453e-ac16-9014f253072e", "colName": "ColNum", "colDatatype": "int", "colDesc": "The column number from the resultant table returned by @Matrix_RangeQuery."}, {"id": "6af10133-c43d-4776-a049-feddd149cddb", "colName": "Rsquared", "colDatatype": "float", "colDesc": "The R-squared value."}, {"id": "ef69de05-e8cc-4b5b-b3d2-390b7cded241", "colName": "VIF", "colDatatype": "float", "colDesc": "The variance inflation factor."}, {"id": "21255d7a-d5dd-4a8c-9fdb-02707baca80e", "colName": "Tolerance", "colDatatype": "float", "colDesc": "The tolerance."}]}

Remarks

@Matrix_RangeQuery must return at least 2 columns or an error will be returned.

If Rsquared = 1 then Tolerance is zero and VIF is NULL.

If there is only one column of data, no rows will be returned.

Examples

In this example we evaluate 3 independent variables: age, height and weight.

SELECT *

INTO #t

FROM

(

    VALUES

        (22, 76, 197.4),

        (27, 68.9, 189.3),

        (28, 67.7, 215.1),

        (20, 72.3, 173.6),

        (29, 71.2, 171),

        (23, 70.3, 184.6),

        (20, 73.4, 178.7),

        (23, 64.6, 217),

        (23, 68.2, 204),

        (24, 78.9, 249.9),

        (25, 76.3, 209.3),

        (22, 78.8, 209.7),

        (26, 61.4, 137.3),

        (24, 73.4, 186.2),

        (28, 76.5, 220.4),

        (20, 75.9, 171.8),

        (25, 77.1, 239.3),

        (25, 67.3, 169.9),

        (28, 71.8, 184.3),

        (21, 77.6, 180.3)

) n (Age, Height, Weight);

SELECT *

FROM wct.VIF('SELECT * FROM #t');

This produces the following result.

{"columns":[{"field":"ColNum","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Rsquared","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"VIF","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Tolerance","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"ColNum":"1","Rsquared":"0.166469615505746","VIF":"1.1997163134092","Tolerance":"0.833530384494254"},{"ColNum":"2","Rsquared":"0.349782414258577","VIF":"1.53794671495963","Tolerance":"0.650217585741423"},{"ColNum":"3","Rsquared":"0.296159110711204","VIF":"1.42077565429661","Tolerance":"0.703840889288796"}]}

While we put our data into a temporary table (#t) and executed the table-valued function by referencing #t, we could have just as easily put the SQL inside the call to the function as the following SQL demonstrates.

SELECT *

FROM wct.VIF('

   SELECT

       *

   FROM (VALUES

        (22,76,197.4)

       ,(27,68.9,189.3)

       ,(28,67.7,215.1)

       ,(20,72.3,173.6)

       ,(29,71.2,171)

       ,(23,70.3,184.6)

       ,(20,73.4,178.7)

       ,(23,64.6,217)

       ,(23,68.2,204)

       ,(24,78.9,249.9)

       ,(25,76.3,209.3)

       ,(22,78.8,209.7)

       ,(26,61.4,137.3)

       ,(24,73.4,186.2)

       ,(28,76.5,220.4)

       ,(20,75.9,171.8)

       ,(25,77.1,239.3)

       ,(25,67.3,169.9)

       ,(28,71.8,184.3)

       ,(21,77.6,180.3)

       )n(Age,Height,Weight)');

Since the table-valued function uses dynamic SQL, however, it will not produce the desired result if it references a derived table or a common table expression outside the scope of the dynamic SQL. For example, the following SQL will not work.

WITH mycte

AS (SELECT *

    FROM

    (

        VALUES

            (22, 76, 197.4),

            (27, 68.9, 189.3),

            (28, 67.7, 215.1),

            (20, 72.3, 173.6),

            (29, 71.2, 171),

            (23, 70.3, 184.6),

            (20, 73.4, 178.7),

            (23, 64.6, 217),

            (23, 68.2, 204),

            (24, 78.9, 249.9),

            (25, 76.3, 209.3),

            (22, 78.8, 209.7),

            (26, 61.4, 137.3),

            (24, 73.4, 186.2),

            (28, 76.5, 220.4),

            (20, 75.9, 171.8),

            (25, 77.1, 239.3),

            (25, 67.3, 169.9),

            (28, 71.8, 184.3),

            (21, 77.6, 180.3)

    ) n (Age, Height, Weight) )

SELECT *

FROM wct.VIF('SELECT * FROM mycte');

This produces the following message.

Exception:[Product version 1.13 Build: 0219.573]An error occurred in Function Package while attempting to access database data:Invalid object name 'mycte'. (msg:208 state:1 ln:0)

In this example we generate a random variable (x1) from the normal distribution with a mean of 0 and standard deviation of 1, another random variable (x2) from the normal distribution a mean of 2 and a standard deviation of 4. The x3 variable is calculated as x2 plus a uniform random variable between 0 and 1.

SELECT x1,

       x2,

       x2 + wct.RAND() as x3

INTO #t1

FROM

(

    SELECT X as X1,

           wct.RANDNORM(2, 4) as X2

    FROM wct.RANDSNORMAL(1000)

) n;

SELECT *

FROM wct.VIF('SELECT x1,x2,x3 FROM #t1');

This produces the following result (your result will be different).

{"columns":[{"field":"ColNum","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Rsquared","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"VIF","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Tolerance","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"ColNum":"1","Rsquared":"0.000933473047366573","VIF":"1.00093434523346","Tolerance":"0.999066526952633"},{"ColNum":"2","Rsquared":"0.994621815206927","VIF":"185.936340693973","Tolerance":"0.00537818479307317"},{"ColNum":"3","Rsquared":"0.994621666804285","VIF":"185.931210211501","Tolerance":"0.00537833319571512"}]}

As expected the R2 between x2 and x3 is very close to and the VIF for the 2 columns is very high.

In this example, we generate 3 independent random variables and a fourth variable which is calculated as x1 + 2*x2 - 3*x3. In other words, co-linearity arises not between any two columns but among all the columns.

SELECT x1,

       x2,

       x3,

       x1 + 2 * x2 - 3 * x3 as x4

INTO #t2

FROM

(

    SELECT X as X1,

           wct.RANDNORM(2, 4) as X2,

           wct.RANDNORM(5, 10) as X3

    FROM wct.RANDSNORMAL(1000)

) n;

SELECT *

FROM wct.VIF('SELECT x1,x2,x3,x4 FROM #t2');

This produces the following result (your result will be different).

{"columns":[{"field":"ColNum","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Rsquared","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"VIF"},{"field":"Tolerance","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"ColNum":"1","Rsquared":"1","VIF":"NULL","Tolerance":"0"},{"ColNum":"2","Rsquared":"1","VIF":"NULL","Tolerance":"0"},{"ColNum":"3","Rsquared":"1","VIF":"NULL","Tolerance":"0"},{"ColNum":"4","Rsquared":"1","VIF":"NULL","Tolerance":"0"}]}

Eliminating any of the columns will eliminate the co-linearity.

SELECT *

FROM wct.VIF('SELECT x1,x2,x4 FROM #t2');

This produces the following result (your result will be different).

{"columns":[{"field":"ColNum","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Rsquared","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"VIF","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"},{"field":"Tolerance","headerClass":"ag-right-aligned-header","cellClass":"ag-right-aligned-cell"}],"rows":[{"ColNum":"1","Rsquared":"8.47196531994499E-05","VIF":"1.00008472683123","Tolerance":"0.999915280346801"},{"ColNum":"2","Rsquared":"0.0746046878401549","VIF":"1.08061926277326","Tolerance":"0.925395312159845"},{"ColNum":"3","Rsquared":"0.0746307067340327","VIF":"1.08064964687842","Tolerance":"0.925369293265967"}]}

See Also

LINEST - the Ordinary Least Squares (OLS) solution for a series of x-values and y-values

LOGEST - Logarithmic regression

LOGIT - Logit regression

LOGITPRED - Calculate predicted values based on a logit regression

LOGITPROB - Calculate the probability that Y = 1 given a set of coefficients from a logistic regression

LOGITSUM - Logit regression using summary data