Using Python with Microsoft Reporting Services (SSRS)
This article is originally published at https://tomaztsql.wordpress.com
Using Python with SQL Server 2017 SSRS should not be an obstacle, since the framework and the technology is the same as with R language. With SQL Server 2017, Python got a full and functional support for native SSRS.
The usability and functionality of Python is simply immense. This blog post will be dedicated on using Python for data science in SSRS
Most of the samples will be derived from the Adventure-Works, both SQL Server sample datasets are available at Microfost Github.
Selecting Python results with SSRS
External procedure sp_execute_external_script within SQL Server database using Python language does not need to have the ODBC drivers or python modules like pyodbc or sqlachemy for extracting or writing data between Sql server engine and python engine, the only module needed is Python pandas, since the communication between sql server requires data frames for exposing data to python (from sql server), respectively.
-- Check Python runtime EXECUTE sp_execute_external_script @language =N'Python' ,@script=N'OutputDataSet = InputDataSet' ,@input_data_1 = N'SELECT 1 AS result' WITH RESULT SETS (( result INT )) GO
Please note: Python code, as specified with @script parameter, must be validated and is sensitive to indentation. So single missing (or excessive) indent might cause the complete query to fail.
We will introduce the input parameters for better python data selection. At the same time, this parameter will be used as input parameter in SSRS. T-SQL query with python script:
-- query parametrization DECLARE @MaritalStatusIN CHAR(1) = 'S' -- S/M Single/Married EXECUTE sp_execute_external_script @language =N'Python', @script = N' import pandas as pd df = InputDataSet df_gender = df[''MaritalStatus''] == MaritalStatus df_gen = df[df_gender] correlation = df_gen.corr(method=''pearson'') OutputDataSet = pd.DataFrame(correlation, columns=["nof","age"])', @input_data_1 = N'SELECT COUNT(*) AS nof,MaritalStatus,age FROM AdventureWorksDW2014.dbo.vTargetMail WHERE age < 100 GROUP BY maritalstatus,age' ,@params = N'@MaritalStatus CHAR(1)' ,@MaritalStatus = @MaritalStatusIN WITH RESULT SETS (( CountObs FLOAT ,Age FLOAT )); GO
For the purposes of SSRS, I will encapsulate the upper code into stored procedure with one input parameter – that is Marital status. This same parameter will be propagated into python script.
EXEC SelectingPythonResults @MaritalStatusIN = 'S'
With results presented through SSRS with option of selecting the values.
Predicting and scoring with Python in SSRS
In addition to selecting filtered parameters in SSRS, we can also check how our prediction model is working.
For this manner, we will create a slightly different subset and we will try to predict the number of cars owned by the customers in the dataset, based on the age and the daily commute distance. Query for the dataset will the following:
--- Query for scoring the model SELECT age ,NumberCarsOwned ,CAST(REPLACE(LEFT(CommuteDistance,2),'-','') AS TINYINT) as CommuteDistance FROM AdventureWorksDW2014.dbo.vTargetMail WHERE age < 100
We will run a simple linear regression with dependent (y) variable = NumberCarsOwned and age + CommuteDistance both as X variables. We will store couple of different versions of prediction model in database, trained on different size of training subsets; 20%, 30%, 40% and 50% of the original dataset for training purposes. All four different models will be serialized and stored in a table.
Let’s create a table to store serialized models:
-- creating table for storing models CREATE TABLE PredictingWithPy_models (model VARBINARY(MAX) ,modelName VARCHAR(100) ,trainSize FLOAT )
Now, we will use Scikit-learn python module to train the model based on different subsets of original dataset.
CREATE OR ALTER PROCEDURE [dbo].[RunningPredictionsPy] ( @size FLOAT --- format: 0.3 or 0.4 or 0.5 ,@name VARCHAR(100) ,@trained_model varbinary(max) OUTPUT ) AS BEGIN EXEC sp_execute_external_script @language = N'Python' ,@script = N' import numpy as np import pandas as pd import pickle from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split ##Create SciKit-Learn linear regression model X = df2[["age", "CommuteDistance"]] y = np.ravel(df2[["nofCars"]]) name = name ##Create training (and testing) variables based on test_size X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=size) ## CreateLinear Model SKL_lr = LinearRegression() linRegObj = SKL_lr.fit(X_train, y_train) ##Serialize model trained_model = pickle.dumps(linRegObj)' ,@input_data_1 = N'SELECT * FROM vTargetMail' ,@input_data_1_name = N'df2' ,@params = N'@trained_model varbinary(max) OUTPUT, @size FLOAT, @name VARCHAR(100)' ,@trained_model = @trained_model OUTPUT ,@size = @size ,@name = @name END; GO
And executing this procedure to store trained models in previously created table:
DECLARE @model VARBINARY(MAX); EXEC [RunningPredictionsPy] @size = 0.2 ,@name = 'Ln_20' ,@trained_model = @model OUTPUT INSERT INTO PredictingWithPy_models (model, modelName, trainSize) VALUES(@model, 'Ln_20', 0.2); GO
We repeat, so that at the end, we end up with percentage for 20, 30, 40 and 50; complete list is available at GitHub. At the end, the table with stored models should hold 4 lines with serialized models.
Now that we have trained the models, we want to predict the results, using SSRS.
For this manner, we will need another stored procedure, that will have X variables (independent variables: age and Commutedistance) as input parameters, besides selected model and have the prediction returned (Y variable – Number of cars owned).
Following procedure will be used in SSRS:
-- CREATE Procedure to predict the number of cars owned CREATE OR ALTER PROCEDURE [dbo].[RunningPredictionWithValesPy] ( @model varchar(100) ,@age INT ,@commuteDistance INT ) AS BEGIN DECLARE @modelIN VARBINARY(MAX) = (SELECT model FROM PredictingWithPy_models WHERE modelName = @model) -- Create a T-SQL Query DECLARE @q NVARCHAR(MAX) = N'SELECT '+CAST(@age AS VARCHAR(5))+' AS age, '+ CAST(@commuteDistance AS VARCHAR(5))+' AS CommuteDistance' -- Store the new values for prediction in temp table DROP TABLE IF EXISTS #t CREATE TABLE #t (age INT, CommuteDistance INT) INSERT INTO #t (age, CommuteDistance) EXEC sp_executesql @q EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pickle import numpy as np import pandas as pd from sklearn import metrics ##Deserialize model mod = pickle.loads(modelIN) X = InputDataSet[["age", "CommuteDistance"]] ##Create numpy Array when you introducte more values at the same time (bulk prediction) predArray = mod.predict(X) #predList = [] #for i in range(len(predArray)): # predList.append((predArray[i])[1]) OutputDataSet = pd.DataFrame(data = predArray, columns = ["predictions"])' ,@input_data_1 = N'SELECT * FROM #t' ,@input_data_1_name = N'InputDataSet' ,@params = N'@modelIN varbinary(max)' ,@modelIN = @modelIN WITH RESULT SETS (( prediction_Score FLOAT )); END GO
Checking for the procedure to work fine, the following T-SQL query will run the predictions against new dataset:
EXEC [RunningPredictionWithValesPy] @model = 'Ln_30' ,@age = 44 ,@commuteDistance = 1
We can integrate the procedure into reporting services (SSRS) and have the combination of the input values displayed with the selected model (size of test set):
with the results of predictions in SSRS or in SSMS; both return same results:
All reports featured here will also be available at GitHub.
Visualizing results with Python in SSRS
As we have created four different models, we would also like to have the accuary of the model visually represented using SSRS.
Showing plots created with Python might not be as straight forward, as with R Language.
Following procedure will extract the data from database and generate plot, that can be used and visualized in SSRS.
CREATE OR ALTER PROCEDURE [dbo].[VisualizeWithPyR2] ( @inputVariable VARCHAR(100) ) AS BEGIN DECLARE @q NVARCHAR(MAX) = N'SELECT '+CAST(@inputVariable AS VARCHAR(50))+' AS val1 FROM vTargetMail' -- Store the values in temp table DROP TABLE IF EXISTS #t CREATE TABLE #t (val1 FLOAT) INSERT INTO #t (val1) EXEC sp_executesql @q EXEC sp_execute_external_script @language = N'Python' ,@script = N' import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt fig = plt.figure(figsize=(12, 5)) plt.plot(plotdataset) fig.savefig(''C:\\\PyGraphs\\firstGraph.png'') OutputDataSet = pd.DataFrame(data =[1], columns = ["plot"])' ,@input_data_1 = N'SELECT * FROM #t' ,@input_data_1_name = N'plotdataset' WITH RESULT SETS (( plot INT )); END GO
So far, I have found this as to be the simplest way – saving the plot on same server, where SSRS is running and exposing the disk path to SSRS. Python stores the graph to disk on the host / client / server with following command:
fig = plt.figure(figsize=(12, 5)) plt.plot(plotdataset) fig.savefig(''C:\\\PyGraphs\\firstGraph.png'')
Resulting in SSRS as:
In Report builder, one additional setting worth mentioning is the image properties, where the path to the saved plot should be the same as the one presented in Python code. Savefig function in python is pointing to same location as the external source for this image setting, with function as: file://C://PyGraphs//firstGraph.png.
Evaluating all four linear regression models, based on the size of the test sample size, can be calculated using R2 and accompanying python code:
from sklearn.metrics import r2_score def R2EvalMod(test_sizeIN): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_sizeIN) SKL_lr = LinearRegression() linRegObj = SKL_lr.fit(X_train, y_train) predArray = linRegObj.predict(X_test) R2 = r2_score(y_test, predArray) return R2 sql_conn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=TOMAZK\\MSSQLSERVER2017;DATABASE=SQLPY;Trusted_Connection=yes') query3 = ''' SELECT trainSize FROM [dbo].[PredictingWithPy_models] ''' df3 = pd.read_sql(query3, sql_conn) R2List = [] for i in range(len(df3)): z = df3.at[i, 'trainSize'] R2=R2EvalMod(z) R2List.append(([z],[R2])) labels = ['trainsize', 'R2'] df33 = pd.DataFrame.from_records(R2List, columns=labels)
With the fact that, better the model the higher R2 is – between 0 and 1 (also negative -1 to 0), showing that the model R2 is best when using 30% of dataset for testing and 70% for training.
As always, code is available at GitHub.
Happy Pythoning.
Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.