Python Pandas MultiIndex and reading data from SQL Server
This article is originally published at https://tomaztsql.wordpress.com
Python Pandas multiIndex is a hierarchical indexing over multiple tuples or arrays of data, enabling advanced dataframe wrangling and analysis on higher data dimensionality.
Everyone have had come across multiIndex in Python Pandas and had little annoyancens as the first time.
1. Python Pandas MultiIndex in SQL Server
With this in mind, we can create an example of pandas dataframe:
import pandas as pd dt = pd.DataFrame([ ['2019-05-12','python',6], ['2019-05-13','python',5], ['2019-05-14','python',10], ['2019-05-12','t-sql',12], ['2019-05-13','t-sql',12], ['2019-05-14','t-sql',12] ], columns = ['date','language','version'])
And the results are:
MultiIndex will give further data wrangling the capability of higher dimensionality wrangling. By simple:
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd dt = pd.DataFrame([ [''2019-05-12'',''python'',6], [''2019-05-13'',''python'',5], [''2019-05-14'',''python'',10], [''2019-05-12'',''t-sql'',12], [''2019-05-13'',''t-sql'',12], [''2019-05-14'',''t-sql'',12] ], columns = [''date'',''language'',''version'']) OutputDataSet = dt' WITH RESULT SETS (( py_date SMALLDATETIME ,py_lang VARCHAR(10) ,py_ver TINYINT ))
So what happens, when we have data with primary key over multiple columns and we want to maintain this dimensionality in Python, as well as, propagate the results using Python Pandas with MultiIndex.
Now, let’s run the same example with additional option of adding pandas MultiIndex:
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd dt = pd.DataFrame([ [''2019-05-12'',''python'',6], [''2019-05-13'',''python'',5], [''2019-05-14'',''python'',10], [''2019-05-12'',''t-sql'',12], [''2019-05-13'',''t-sql'',12], [''2019-05-14'',''t-sql'',12] ], columns = [''date'',''language'',''version'']) dt.set_index([''language'', ''version''], inplace=True) OutputDataSet = dt' WITH RESULT SETS (( py_date SMALLDATETIME ,py_lang VARCHAR(10) ,py_ver TINYINT ))
We get in return an error message:
Msg 11537, Level 16, State 3, Line 135
EXECUTE statement failed because its WITH RESULT SETS clause specified 3 column(s) for result set number 1, but the statement sent 1 column(s) at run time.
This is due to the fact, how pandas MultiIndex operates. It in-places the index columns (hence multiIndex) into same column and keep it for higher dimensionality data wrangling. The error message would have not occurred, if I would have changed the RESULT SETS to a single column (py_date). Same can be seen in your favorite Python IDE, how columns are in-placed in index:
To make your exports using pandas dataframes in in-database Machine Learning services in your SQL Server, one way is to concatenate the index within Python, to preserve (and expose) the information of index.
-- MultiIndex with inplace True -- adding a concatenated value EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd dt = pd.DataFrame([ [''2019-05-12'',''python'',6], [''2019-05-13'',''python'',5], [''2019-05-14'',''python'',10], [''2019-05-12'',''t-sql'',12], [''2019-05-12'',''t-sql'',12], [''2019-05-12'',''t-sql'',12] ], columns = [''date'',''language'',''version'']) dt[''PreservedIndex''] = dt[''language''].astype(str) + '';'' + \ dt[''version''].astype(str) dt.set_index([''language'', ''version''], inplace=True) OutputDataSet=dt' WITH RESULT SETS (( py_date SMALLDATETIME ,py_PreservedIndex VARCHAR(30) ))
And the results would be:
So we managed to exposed the constructed MultiIndex back to SQL Server.
2. How are SQL Server Indexes manages by Python Pandas
Continuing with the same example, let’s create a SQL Server table and with a Primary key constraint.
-- Going from T-SQL -- And why T-SQL constraints does not play a role in Python DROP TABLE IF EXISTS PyLang; GO CREATE TABLE PyLang ( [Date] DATETIME NOT NULL ,[language] VARCHAR(10) NOT NULL ,[version] INT ,CONSTRAINT PK_PyLang PRIMARY KEY(date, language) ); GO INSERT INTO PyLang (date, language, version) SELECT '2019-05-12', 'python', 6 UNION ALL SELECT '2019-05-13', 'python', 5 UNION ALL SELECT '2019-05-14', 'python', 10 UNION ALL SELECT '2019-05-12', 't-sql', 12 UNION ALL SELECT '2019-05-13', 't-sql', 12 UNION ALL SELECT '2019-05-14', 't-sql', 12; SELECT * FROM PyLang; GO
with a simple dataset as:
With index created on Date and Language, we can easily override this in Python pandas and created different MultiIndex with different columns:
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd dt = InputDataSet dt[''PreservedIndex''] = dt[''language''].astype(str) + '';'' + \ dt[''version''].astype(str) dt.set_index([''language'', ''version''], inplace=True) OutputDataSet=dt' ,@input_data_1 = N'SELECT * FROM PyLang' WITH RESULT SETS (( py_date SMALLDATETIME ,py_MyPreservedIndex VARCHAR(40) ))
With the results
showing that pandas MultiIndex was created using Language and Version column, as in T-SQL table, index was created on Date and Language.
Conclusion
- SQL Server and Python Pandas Indexes are two different worlds and should not be mixed.
- SQL Server uses Index primarily for DML operations and to keep data ACID.
- Python Pandas uses Index and MultiIndex for keeping data dimensionality when performing data wrangling and statistical analysis.
- SQL Server Index and Python Pandas Index don’t know about each other’s existence, meaning if user want to propagate the T-SQL index to Python Pandas (in order to minimize the impact of duplicates, missing values or to impose the relational model), it needs to be introduced and created, once data enters “in the python world”.
- When inserting Python Pandas MultiIndex results into SQL Server table with index, make sure that the data dimensionality fits together and in case of T-SQL primary keys, make sure, Python code is not producing any duplicates.
- Both, Python and T-SQL, indexes are help you having consistent data
- Performance is down-graded in both cases, if indexes are not used correctly or when overused.
As always, code is available at GitHub.
Happy coding.
Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.