Python / R News

Python Pandas MultiIndex and reading data from SQL Server

by tomaztsql · June 26, 2019

This article is originally published at https://tomaztsql.wordpress.com

Python Pandas multiIndex is a hierarchical indexing over multiple tuples or arrays of data, enabling advanced dataframe wrangling and analysis on higher data dimensionality.

Everyone have had come across multiIndex in Python Pandas and had little annoyancens as the first time.

1. Python Pandas MultiIndex in SQL Server

With this in mind, we can create an example of pandas dataframe:

import pandas as pd

dt = pd.DataFrame([
                    ['2019-05-12','python',6],
                    ['2019-05-13','python',5],
                    ['2019-05-14','python',10],
                    ['2019-05-12','t-sql',12],
                    ['2019-05-13','t-sql',12],
                    ['2019-05-14','t-sql',12]
                  ],
                  columns = ['date','language','version'])

And the results are:

MultiIndex will give further data wrangling the capability of higher dimensionality wrangling. By simple:

EXEC sp_execute_external_script
	 @language = N'Python'
	,@script = N'
import pandas as pd

dt = pd.DataFrame([
                    [''2019-05-12'',''python'',6],
                    [''2019-05-13'',''python'',5],
                    [''2019-05-14'',''python'',10],
                    [''2019-05-12'',''t-sql'',12],
                    [''2019-05-13'',''t-sql'',12],
                    [''2019-05-14'',''t-sql'',12]
                  ],
                  columns = [''date'',''language'',''version''])

OutputDataSet = dt'
WITH RESULT SETS
((
  py_date SMALLDATETIME
 ,py_lang VARCHAR(10)
 ,py_ver TINYINT
))

So what happens, when we have data with primary key over multiple columns and we want to maintain this dimensionality in Python, as well as, propagate the results using Python Pandas with MultiIndex.

Now, let’s run the same example with additional option of adding pandas MultiIndex:

EXEC sp_execute_external_script
	 @language = N'Python'
	,@script = N'
import pandas as pd

dt = pd.DataFrame([
                    [''2019-05-12'',''python'',6],
                    [''2019-05-13'',''python'',5],
                    [''2019-05-14'',''python'',10],
                    [''2019-05-12'',''t-sql'',12],
                    [''2019-05-13'',''t-sql'',12],
                    [''2019-05-14'',''t-sql'',12]
                  ],
                  columns = [''date'',''language'',''version''])
dt.set_index([''language'', ''version''], inplace=True)
OutputDataSet = dt'
WITH RESULT SETS
((
 py_date SMALLDATETIME
,py_lang VARCHAR(10)
,py_ver TINYINT
))

We get in return an error message:

Msg 11537, Level 16, State 3, Line 135
EXECUTE statement failed because its WITH RESULT SETS clause specified 3 column(s) for result set number 1, but the statement sent 1 column(s) at run time.

This is due to the fact, how pandas MultiIndex operates. It in-places the index columns (hence multiIndex) into same column and keep it for higher dimensionality data wrangling. The error message would have not occurred, if I would have changed the RESULT SETS to a single column (py_date). Same can be seen in your favorite Python IDE, how columns are in-placed in index:

To make your exports using pandas dataframes in in-database Machine Learning services in your SQL Server, one way is to concatenate the index within Python, to preserve (and expose) the information of index.

-- MultiIndex with inplace True
-- adding a concatenated value
EXEC sp_execute_external_script
 @language = N'Python'
,@script = N'

import pandas as pd
dt = pd.DataFrame([
                    [''2019-05-12'',''python'',6],
                    [''2019-05-13'',''python'',5],
                    [''2019-05-14'',''python'',10],
                    [''2019-05-12'',''t-sql'',12],
                    [''2019-05-12'',''t-sql'',12],
                    [''2019-05-12'',''t-sql'',12]
                  ],
                  columns = [''date'',''language'',''version''])

dt[''PreservedIndex''] = dt[''language''].astype(str) + '';'' + \
 dt[''version''].astype(str)

dt.set_index([''language'', ''version''], inplace=True)
OutputDataSet=dt'
WITH RESULT SETS
((
  py_date SMALLDATETIME
 ,py_PreservedIndex VARCHAR(30)
))

And the results would be:

So we managed to exposed the constructed MultiIndex back to SQL Server.

2. How are SQL Server Indexes manages by Python Pandas

Continuing with the same example, let’s create a SQL Server table and with a Primary key constraint.

-- Going from T-SQL
-- And why T-SQL constraints does not play a role in Python

DROP TABLE IF EXISTS PyLang;
GO

CREATE TABLE PyLang (
	 [Date] DATETIME NOT NULL
	,[language] VARCHAR(10) NOT NULL
	,[version] INT
	,CONSTRAINT PK_PyLang PRIMARY KEY(date, language)
);
GO

INSERT INTO PyLang (date, language, version)
          SELECT '2019-05-12', 'python', 6
UNION ALL SELECT '2019-05-13', 'python', 5
UNION ALL SELECT '2019-05-14', 'python', 10
UNION ALL SELECT '2019-05-12', 't-sql', 12
UNION ALL SELECT '2019-05-13', 't-sql', 12
UNION ALL SELECT '2019-05-14', 't-sql', 12;

SELECT * FROM PyLang;
GO

with a simple dataset as:

With index created on Date and Language, we can easily override this in Python pandas and created different MultiIndex with different columns:

EXEC sp_execute_external_script
 @language = N'Python'
,@script = N'

import pandas as pd

dt = InputDataSet
dt[''PreservedIndex''] = dt[''language''].astype(str) + '';'' + \
 dt[''version''].astype(str)
dt.set_index([''language'', ''version''], inplace=True)
OutputDataSet=dt'
,@input_data_1 = N'SELECT * FROM PyLang'
WITH RESULT SETS
((
  py_date SMALLDATETIME
 ,py_MyPreservedIndex VARCHAR(40)
))

With the results

showing that pandas MultiIndex was created using Language and Version column, as in T-SQL table, index was created on Date and Language.

Conclusion

SQL Server and Python Pandas Indexes are two different worlds and should not be mixed.
SQL Server uses Index primarily for DML operations and to keep data ACID.
Python Pandas uses Index and MultiIndex for keeping data dimensionality when performing data wrangling and statistical analysis.
SQL Server Index and Python Pandas Index don’t know about each other’s existence, meaning if user want to propagate the T-SQL index to Python Pandas (in order to minimize the impact of duplicates, missing values or to impose the relational model), it needs to be introduced and created, once data enters “in the python world”.
When inserting Python Pandas MultiIndex results into SQL Server table with index, make sure that the data dimensionality fits together and in case of T-SQL primary keys, make sure, Python code is not producing any duplicates.
Both, Python and T-SQL, indexes are help you having consistent data
Performance is down-graded in both cases, if indexes are not used correctly or when overused.

As always, code is available at GitHub.

Happy coding.

Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Python Pandas MultiIndex and reading data from SQL Server

You may also like...

Categories

Python Pandas MultiIndex and reading data from SQL Server

1. Python Pandas MultiIndex in SQL Server

2. How are SQL Server Indexes manages by Python Pandas

Conclusion

You may also like...

DART: Dropout Regularization in Boosting Ensembles

Resampling to understand gender in #TidyTuesday art history data

Data discovery: seasonal speed

Categories