Returning Tables from Embedded R Execution …. Simplified
This article is originally published at https://blogs.oracle.com/r/compendium/rss
In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the automated process of returning the data frame schema from the output of embedded R execution runs.
Embedded R ExecutionORE embedded R execution provides a powerful and convenient way to execute custom R scripts at the database server, from either R or SQL. It also enables running those scripts in a data-parallel or task-parallel manner. With embedded R execution, the user also gets the opportunity to call any 3rd party R package and spawn one or more R engines to run their user-defined R function in parallel. A more detailed explanation can be found in this blog post Introduction to ORE Embedded R Script Execution.
The ORE embedded R execution functions, e.g., ore.tableApply and ore.rowApply, provide flexible choices of objects to return, such as R objects, models, etc. One of the most popular choices is an ORE frame from R or tables from SQL.
In order to get an ORE frame as a return object, we need to supply the schema (column names and types) to the embedded R function in the argument FUN.VALUE.
For instance, consider the dataset iris.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa...145 6.7 3.3 5.7 2.5 virginica146 6.7 3.0 5.2 2.3 virginica147 6.3 2.5 5.0 1.9 virginica148 6.5 3.0 5.2 2.0 virginica149 6.2 3.4 5.4 2.3 virginica150 5.9 3.0 5.1 1.8 virginicaWe want to build a model for each iris species, say, using the feature Sepal.Length and return the coefficients associated with each species as a data.frame. To this end, the groupApply function can build a linear regression model on data for each species and do so in parallel. Consider the following code.
grpAp2 <- ore.groupApply(IRIS, IRIS$Species, function(df) { if (nrow(df) == 0) { species <- character() cf <- numeric() names(cf) <- character() } else { species <- as.character(df$Species[1]) cf <- coef(lm(Sepal.Length ~ ., data = df[1:4])) } data.frame(Species = species, CoefName = names(cf), CoefValue = unname(cf), stringsAsFactors = FALSE) }, FUN.VALUE = data.frame(Species = character(), CoefName = character(), CoefValue = numeric(), stringsAsFactors = FALSE), parallel = TRUE)To get back an ore.frame from ore.groupApply, we need to supply FUN.VALUE that contains the schema of the returned ore.frame. This schema contains 3 columns, Species, CoefName, CoefValue. This is required if we want the output of the ore.groupApply function to be an ore.frame, otherwise it will be a list of the component data.frames. The data frame schema required here also needs us to specify the type of each column, such as character(), numeric(), etc. This is OK when the data frame has only a few columns, but in reality, there are many use cases where the returned ORE frame has a large number of columns. Even when the number of columns is not large, specifying the type of each column needs some effort.
In most of cases, the user can first run the user-defined function locally and then obtain the schema based on the first row from the output data. However, this needs us to repeat a similar process every time. Also, the user needs to pull the data from the database, which may not be efficient.
To automate this process and let the process run at the database server machine, we introduce a convenience function ore.getApplySchema to illustrate working with embedded R execution. The idea is to supply a representative subset data frame as an input to the function. Then the schema can be produced from the output and returned to the user. The user can then use this schema when calling the embedded R execution function to return an ore.frame. While it is not difficult to realize this process by writing a few lines of R code, we wanted to provide a general-purpose convenience function to avoid writing such code and illustrate working with embedded R execution.
Our ore.getApplySchema function below supports using rowApply, tableApply, and groupApply, however, this could be expanded to work with doEval and indexApply as well. Note that the type of the embedded R execution function needs to be specified because the input and code to retrieve the schema is different for different functions.
ore.getApplySchema<- function(ODF, function_name, ..., col = NULL, row.num = NULL, type = 'rowApply', ore.connect.flag = FALSE){ # check whether the script gets loaded res <- ore.scriptList(name = function_name) if(nrow(res) == 0){ stop('Function does not exist in the repository!') } rownames(ODF) <- ODF[,1] if(is.null(row.num)) INPUT <- ODF else INPUT <- ODF[row.num,] switch(type, rowApply = { res <- ore.rowApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, tableApply = { res <- ore.tableApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, groupApply = { if(is.null(col)) stop("group apply requires the col information!") res <- ore.groupApply(INPUT, INPUT[, col], FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) } ) if(type == 'tableApply'){ schema <- ore.pull(res)[0,] } else schema <- res[[1]][0,] return(schema) }For the IRIS data, we first write the user-defined function separately. Note that the user defined function should return the data frame with a schema conformed to the desired ORE schema. As seen in the following code, we add lines of code to show that the case of input of an empty data frame.
build_model <- function(df) { if (nrow(df) == 0) { species <- character() cf <- numeric() names(cf) <- character() } else { species <- as.character(df$SPECIES[1]) cf <- coef(lm(SEPAL.LENGTH ~ ., data = df[1:4])) } data.frame(SPECIES = species, COEFNAME = names(cf), COEFVALUE = unname(cf), stringsAsFactors = FALSE) }Then call our function as follows.
schema <- ore.getApplySchema(IRIS, 'build_model', col = 'SPECIES', row.num = NULL, type = 'groupApply') [1] SPECIES COEFNAME COEFVALUE <0 rows> (or 0-length row.names)Next, we will use another example for more detailed demonstration of this function. We do this demonstration in two difference scenarios. One is done in the R environment and the other is done for SQL version of Embedded R execution.
Example in Oracle R EnterpriseTo illustrate working with significantly more columns, consider an analysis on the adult dataset from the UCI data repository. The dataset contains demographic information about people. We first load the data and view a few rows:
adult.df <- read.csv(file = "/scratch/data/adult.csv", header=F) colnames(adult.df) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "label")head(adult.df) age workclass fnlwgt education education_num 1 50 Self-emp-not-inc 83311 Bachelors 13 2 38 Private 215646 HS-grad 9 3 53 Private 234721 11th 7 4 28 Private 338409 Bachelors 13 5 37 Private 284582 Masters 14 6 49 Private 160187 9th 5 marital_status occupation relationship race sex 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female 5 Married-civ-spouse Exec-managerial Wife White Female 6 Married-spouse-absent Other-service Not-in-family Black Female capital_gain capital_loss hours_per_week native_country label 1 0 0 13 United-States <=50K 2 0 0 40 United-States <=50K 3 0 0 40 United-States <=50K 4 0 0 40 Cuba <=50K 5 0 0 40 United-States <=50K 6 0 0 16 Jamaica <=50KSince the dataset contains a lot of categorical data, we wish to create dummy variables, or one hot encoding, for those categorical variables. This is useful because a lot of R models such as xgboost or glmnet require converting categorical variables into numerical vectors.
For instance, the categorical variable marital_status contains 6 levels such as Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married and Separated. The dummy variable will generate a vector with 6 binary values (0, 1), with each column related to each level. For persons with the marital status 'Separated', the dummy columns will be (0, 0, 0, 0, 0, 1).
There are 8 categorical variables, the total number of dummy variables will be equal to the number of distinct levels of all categorical features. As we can imagine, the result ORE frame will have a lot of columns, which could be tedious to write explicitly.
We apply the function dummy.data.frame in the 'dummies' packages to the categorical columns:
In total, there are 109 columns after the dummy variables are created.
Suppose the data set is inside an Oracle database. We use tableApply to call the dummies package, but it is cumbersome to supply a data frame schema explicitly when we call tableApply.
In this case, we can use the function ore.getApplySchema to retrieve the schema.
First, let us load the dataset into Oracle Database.
We write the function to generate the dummy variables:
convert_dummies <- function(adult.df){ library(dummies) # do not forget to convert the column names to upper case in order to avoid adding extra "" when using the columns in queries. factor.features <- toupper(c("workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country")) output.df <- dummy.data.frame(adult.df, names=factor.features, sep="_") return(output.df) }(Note that here the column names come from the original values of the features, which contains spaces and ˜-˜, which may not be accepted for Oracle table column names. We added extra code to reformat the column names.
Upload this function to the R script repository:
The code to call the function is as follows.
schema <- ore.getApplySchema(ADULT, "convert_dummies", col = NULL, row.num = NULL, type = 'tableApply')After we run the function, the output schema is returned. Let us take a look:
> schema [1] AGE WORKCLASS_? WORKCLASS_FEDERALGOV [4] WORKCLASS_LOCALGOV WORKCLASS_NEVERWORKED WORKCLASS_PRIVATE [7] WORKCLASS_SELFEMPINC WORKCLASS_SELFEMPNOTINC WORKCLASS_STATEGOV [10] WORKCLASS_WITHOUTPAY FNLWGT EDUCATION_10TH [13] EDUCATION_11TH EDUCATION_12TH EDUCATION_1ST4TH [16] EDUCATION_5TH6TH EDUCATION_7TH8TH EDUCATION_9TH [19] EDUCATION_ASSOCACDM EDUCATION_ASSOCVOC EDUCATION_BACHELORS OCCUPATION_ Tech-support ...Looks like we retrieved all the new columns in the schema!
Now we can use the schema to actually run tableApply.
The result res.odf is an ORE frame as the output. It contains all the columns of dummy variables. Let us inspect this output
> names(res.odf) [1] "AGE" "WORKCLASS_?" "WORKCLASS_FEDERALGOV" "WORKCLASS_LOCALGOV" [5] "WORKCLASS_NEVERWORKED" "WORKCLASS_PRIVATE" "WORKCLASS_SELFEMPINC" "WORKCLASS_SELFEMPNOTINC" [9] "WORKCLASS_STATEGOV" "WORKCLASS_WITHOUTPAY" "FNLWGT" "EDUCATION_10TH" [13] "EDUCATION_11TH" "EDUCATION_12TH" "EDUCATION_1ST4TH" "EDUCATION_5TH6TH" [17] "EDUCATION_7TH8TH" "EDUCATION_9TH" "EDUCATION_ASSOCACDM" "EDUCATION_ASSOCVOC" [21] "EDUCATION_BACHELORS" "EDUCATION_DOCTORATE" "EDUCATION_HSGRAD" "EDUCATION_MASTERS" ... > head(res.odf) AGE WORKCLASS_? WORKCLASS_FEDERALGOV WORKCLASS_LOCALGOV WORKCLASS_NEVERWORKED WORKCLASS_PRIVATE WORKCLASS_SELFEMPINC WORKCLASS_SELFEMPNOTINC 1 23 0 0 0 0 1 0 0 2 40 0 0 0 0 1 0 0 3 41 0 0 0 0 0 0 1 4 24 0 0 0 0 0 0 0 5 20 1 0 0 0 0 0 0 6 38 0 0 0 0 1 0 0 WORKCLASS_STATEGOV WORKCLASS_WITHOUTPAY FNLWGT EDUCATION_10TH EDUCATION_11TH EDUCATION_12TH EDUCATION_1ST4TH EDUCATION_5TH6TH 1 0 0 115458 0 0 0 0 0 2 0 0 347890 0 0 0 0 0 3 0 0 196001 0 0 0 0 0 4 1 0 273905 0 0 0 0 0 5 0 0 119156 0 0 0 0 0 6 0 0 179488 0 0 0 0 0 EDUCATION_7TH8TH EDUCATION_9TH EDUCATION_ASSOCACDM EDUCATION_ASSOCVOC EDUCATION_BACHELORS EDUCATION_DOCTORATE EDUCATION_HSGRAD 1 0 0 0 0 0 0 1 2 0 0 0 0 1 0 0 3 0 0 0 0 0 0 1 4 0 0 1 0 0 0 0 5 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 EDUCATION_MASTERS EDUCATION_PRESCHOOL EDUCATION_PROFSCHOOL EDUCATION_SOMECOLLEGE EDUCATION_NUM MARITAL_STATUS_DIVORCED 1 0 0 0 0 9 0 2 0 0 0 0 13 0 3 0 0 0 0 9 0It works! Using this function, we avoid writing the schema explicitly. This can quicken the way of applying embedded R execution.
Example in Oracle SQL Embedded R ExecutionEmbedded R execution can be initiated from Oracle SQL as well. Consider the following use case. A data scientist called our convenient function to produce the schema of the output. Then hand it over to an analyst who is mainly using SQL and call the R code from tableApply in SQL. How can we facilitate this process?
First, on the R side, make sure to upload the convert_dummies function to the R script repository:
The main difficulty here is how to specify the output schema of the table with all dummy variables in SQL. Our solution is to save the schema into a table in Oracle Database and then call rqTableEval. The entire process can be automated from R side, by adding a few lines into the convenient function ore.getApplySchema.
ore.getApplySchema<- function(ODF, function_name, ..., col = NULL, row.num = NULL, type = 'rowApply', ore.connect.flag = FALSE, sql = FALSE, schema.table = NULL){ # check whether the script gets loaded res <- ore.scriptList(name = function_name) if(nrow(res) == 0){ stop('Function does not exist in the repository!') } rownames(ODF) <- ODF[,1] if(is.null(row.num)) INPUT <- ODF else INPUT <- ODF[row.num,] switch(type, rowApply = { res <- ore.rowApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, tableApply = { res <- ore.tableApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, groupApply = { if(is.null(col)) stop("group apply requires the col information!") res <- ore.groupApply(INPUT, INPUT[, col], FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) } ) if(type == 'tableApply'){ schema <- ore.pull(res)[0,] } else schema <- res[[1]][0,] if(sql == TRUE){ stopifnot(!is.null(schema.table )) ore.drop(table = schema.table) ore.create(schema, table = schema.table) qry <- paste0("SELECT * FROM ", schema.table) return(qry) } return(schema) }In this use case, we can call the function in the following way.
qry <- ore.getApplySchema(ADULT, "convert_dummies", col = NULL, row.num = NULL, type = 'tableApply', sql = TRUE, schema.table = 'ADULT_SCHEMA')The function will save the schema into a SQL table with name chosen as 'ADULT_SCHEMA'. Then, the returned value qry is a string of the query: 'SELECT * FROM ADULT_SCHEMA'. This is the query for schema used in rqTableEval.
From the SQL side, we can also create a table ADULT_RESULT to store the result. This avoids the extra work of materializing the ORE frame into an Oracle Database table.
Let us check the results:
SELECT * FROM ADULT_RESULT;The output shows the result is returned as a table with all dummy columns.
ConclusionWe provide a convenience function for automatically generating the result data frame schema for use in embedded R execution when returning a table. We provided an example to illustrate how to use this function in both R and SQL. This function helps to automate this process and let the user focus on other important data processing tasks.
Thanks for visiting r-craft.org
This article is originally published at https://blogs.oracle.com/r/compendium/rss
Please visit source website for post related comments.