Predictive Analytics is the wave of the future and has become vital in today’s data evolution.
Predictive analytics incorporates a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future.
I’ll describe the linear regression approach and how to write a T-SQL function to calculate the regression and produce the Intercept, Slope and R2 which are used in a regression equation to predict a value. In simple linear regression, the topic of this post, the predictions of Y when plotted as a function of X form a straight line.
The following table lists a sample data of car drivers driving experiences (in years) and monthly auto insurance premiums.
Driving Experience (X value) |
Insurance premium (Y values) |
7 |
75 |
10 |
68 |
12 |
65 |
18 |
60 |
20 |
57 |
25 |
50 |
Table1
Here is the T-SQL code will take the X and Y values from table 1 and calculate the regression coefficients:
declare @n int,
@Intercept DECIMAL(38, 10),
@Slope DECIMAL(38, 10),
@R2 DECIMAL(38, 10)
Select @n=count(*) from Table1
Select
@Slope = ((@n * sum(x*y)) – (sum(x)*sum(y)))/ ((@n * sum(Power(x,2)))-Power(Sum(x),2)),
@Intercept = avg(y) – ((@n * sum(x*y)) – (sum(x)*sum(y)))/((@n * sum(Power(x,2)))-Power(Sum(x),2)) * avg(x)
From Table1
Select @R2 = (@Intercept * SUM(Y) + @Slope * SUM(x*y)-SUM(Y)*SUM(y)/@n) / (SUM(y*y) – SUM(Y)* SUM(Y) / @n)
Select @Slope as Slope, @Intercept as Intercept, @R2 AS R2
From Table1
So to predict what will be the insurance premium if the driving experience years = 15
Y = Intercept + 15 * Slope (where y is the insurance premium)
For an in depth article about the linear regression, please refer to the Wikipedia Entry,