The fundamental process in the machine learning development life cycle is identifying
the dependent and independent variables for developing a data model. The two basic
categories of supervised machine learning are classification and regression.
Regression is arguably the most basic form of machine learning algorithms and
suitable for beginners in machine learning. Though it is one of the most basic algorithms
for machine learning, it is also one of the modes widely applied machine learning
algorithms, with multiple variations of this algorithm. In this tip we will learn
how to develop a machine learning model using a linear regression algorithm.
It is recommended that if you are new to Python or Machine Learning Services
in SQL Server 2017, consider reading the
Python and SQL Server 2017 Basics tutorial. It is also assumed that you have
SQL Server 2017, Python and Machine Learning Services installed on your development
The first step for any data science exercise is analyzing sample data and determining
fields of interest. So, we need to consider a sample dataset that we will use for
this exercise, from which we will predict the output of dependent variable using
the independent variable. Data Quality is a crucial factor for data science, for
which ETL, MDM and other mechanisms are applied to standardize and cleanse data.
As our focus in on learning the machine learning model, so we would consider it
as granted that we have access to processed quality data.
The next step in machine learning is exploratory data analysis to identify the
independent and dependent variables of interest. Since this is more domain driven,
we will take it as granted that independent and dependent variables have already
been identified to maintain focus on developing machine learning model.
As we have considered the preliminary tasks in machine learning lifecycle as
granted, we will be using the dataset as shown below, having just two fields. Generally,
after applying techniques like exploratory data analysis, principal component analysis,
dimensionality reduction etc., the dependent and independent fields of interest
are identified. In the sample dataset of Celsius and Fahrenheit that we are using,
consider C as the independent field and F as the dependent or the response field.
Our intention is to find the relationship between the two variables from the below
In real-life scenarios, statistical models would almost never have a perfect
linear relationship unless and until it’s a deterministic model. One short-cut method
of determining whether linear regression can be applied to the data in question
is by creating a scatterplot of dependent versus independent variables and creating
a trend line to analyze whether there is a linear or near-linear relationship.
If we create a scatterplot of the data shown above, it would look as shown in
the picture below, which is a strong indicator that linear regression algorithms
can be applied on this data.
The linear regression algorithm finds the shortest line (also known as best-fit
line) that passes through all the data points in a way that all the data points
are at a minimum distance from the line. Consider reading more about the statistical
formula for this algorithm from
We intend to discover the relationship between Celsius and Fahrenheit using linear
regression algorithm, and we would implement the same using Python and T-SQL, with
the assumption that this data is hosted in SQL Server. Create a table named Readings
which has the readings of Celsius and Fahrenheit as shown below before proceeding
the actual algorithm implementation.
We will be implementing the T-SQL code for the linear regression algorithm with
the approach mentioned below.
- Fahrenheit is the dependent variable and Celsius is the independent variable.
- The formula of ordinary least squares linear regression algorithm is Y (also
known as Y-hat) = a + bX, where a is the y-intercept and b is the slope. By
applying the algorithm, we will derive the coefficients “a” and “b”.
- In our case, Y-hat is Fahrenheit, X is Celsius, “a” is the Y-intercept
and “b” is the slope as shown below. We already know what is X and
Y. The result of our model will deliver the values of a and b.
- Using the revoscalepy library we are using functions rx_lin_mod to create
linear model and rx_predict to predict Fahrenheit from the linear model.
- Finally, we will print the original input and predicted output in the output
Execute the below code to create a linear model, train the data model using the
above dataset and finally predict the output for a given input value.
EXEC sp_execute_external_script @language = N'Python' , @script = N' from revoscalepy import rx_lin_mod, rx_predict linearmodel = rx_lin_mod(formula = "Fahrenheit ~ Celsius", data = InputDataSet); # Formula specifies Fahrenheit is dependent variable and Celsius in independent variable predict = rx_predict(linearmodel, data = InputDataSet[["Celsius"]]) # Predict the data for the input Celcius values predict.insert(loc=0, column="Celsius", value=InputDataSet[["Celsius"]]) # Add the original Celcius field to the predicted Fahrenheit field OutputDataSet = predict # Assigned the dataset to output data frame ' , @input_data_1 = N'SELECT * FROM Readings'
Once you execute the above code, the output would look as shown below. From a
model life-cycle perspective, we may want to store, extract and use the model at
required. For the same we can store the model in a varbinary variable after serializing
the model, and deserialize it while extraction. You can learn about this from an
example explained in this
We successfully created the model and predicted the output too. But the key value
of creating the model is to find the co-efficient values, which gives insights to
data analysts or data scientists regarding the influence of independent field on
the dependent field. To find out the coefficients “a” and “b”,
just print the summary of your linear model, and you should be able to find result
as shown below.
Some points that can be derived from the above result are as follows:
- R-squared is the accuracy of the model with which it was able to explain
the variation in the dataset. Here “1” mean 100% variation is explained,
so we can be confident that the coefficients are accurate. It is almost never
100% in real-life cases of statistical models.
- There are other statistics as well like skew, kurtosis, p-value, etc. These
statistics explain different properties of the model related to accuracy and
confidence in the model.
- Using the derived co-efficient and linear regressions equation, we can easily
calculate or predict values. For example, in this case the intercept here is
32 and slope is 1.8. Given that we intend to predict the value of Fahrenheit
for 1 degree Celsius, the formula of linear regression method to predict the
same would be as mentioned below.
Fahrenheit = (Y-intercept) + (slope * Celsius) i.e. Fahrenheit = 32 + (1.8 * 1) = 33.8.
This means 1 degree Celsius would be 33.8-degree Fahrenheit. Using this equation
with the derived coefficients, we can predict the value of Fahrenheit for any
given value of Celsius. We have developed a model using Linear Regression algorithm
and trained our model by feeding it a dataset, from which it learned and derived
inferences to predict intended results. This is what we term as machine learning
in its simplest form.
- Consider learning thoroughly about the algorithm of your choice and learn
more details of how the implement the same in Python using SQL Server Machine
Last Update: 2018-08-08
About the author
Siddharth Mehta is an Associate Manager with Accenture in the Avanade Division focusing on Business Intelligence.
View all my tips