R to Oracle Database Connectivity: Use ROracle for both Performance and Scalability (2018)
This article is originally published at https://blogs.oracle.com/r/compendium/rss
R users have a few choices of how to connect to Oracle Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However, these three packages have significantly different performance and scalability characteristics, which can greatly impact your application development. In this blog, we’ll discuss these options and highlight performance benchmark results on a wide range of data sets. This performance benchmark post is an update from our 2013 blog post and uses Exadata 7-2 with ROracle 1.3-1. There are some improvements to both the performance and scalability of RODBC and RJDBC compared to our first benchmark, but ROracle remains the overall leader. I'd like to acknowledge the contributions of Rajendra Pingte who provided the raw data for this blog.
By way of introduction, RODBC is an R package that implements ODBC database connectivity. There are two groups of functions: the largely internal odbc* functions that implement low-level access to the corresponding ODBC functions having a similar name, and the higher level sql* functions that support read, save, copy, and manipulation of data between R data.frame objects and database tables. Here is an example using RODBC:library(RODBC) con <- odbcConnect("DD1", uid="rquser", pwd="rquser", rows_at_time = 500) sqlSave(con,test_table, "TEST_TABLE") sqlQuery(con,"select count(*) from TEST_TABLE") d <- sqlQuery(con, "select * from TEST_TABLE") close(con)
The R package RJDBC is an implementation of the R DBI package – database interface – that uses JDBC as the back-end connection to the database. Any database that supports a JDBC driver can be used in connection with RJDBC. Here is an example using RJDBC:library(RJDBC) drv <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="…/ojdbc8.jar"," ") con <- dbConnect(drv, " jdbc:oracle:thin:@myHost:1521:db", "rquser","rqpasswd") dbWriteTable(con,"TEST_TABLE", test_table) dbGetQuery(con,"select count(*) from TEST_TABLE") d <- dbReadTable(con, "TEST_TABLE") dbDisconnect(con)
The ROracle package is an implementation of the R DBI package that uses Oracle OCI for high performance and scalability with Oracle Database. It requires Oracle Instant Client or Oracle Database Client to be installed on the client machine. Here is an example using ROracle:library(ROracle) drv <- dbDriver("Oracle") con <- dbConnect(drv, "rquser", "rqpasswd") dbWriteTable(con,"TEST_TABLE", test_table) dbGetQuery(con,"select count(*) from TEST_TABLE") d <- dbReadTable(con, "TEST_TABLE") dbDisconnect(con)
Notice that since both RJDBC and ROracle implement the DBI interface, their code is similar except for the driver and connection details.
To compare the performance of these interfaces, we prepared tests along several dimensions:
- Data Tables
- Number of rows – 1K, 10K, 100K, and 1M
- Number of columns – 10, 100, 1000
- Data types – NUMBER, BINARY_DOUBLE, TIMESTAMP, and VARCHAR
Numeric data is randomly generated, all character data is 10 characters long.
- Packages supporting RODBC, RJDBC, and ROracle:
DBI_1.0.0.tar.gz RODBCDBI_0.1.1.tar.gz RODBCext_0.3.1.tar.gz
odbc_1.1.6.tar.gz RJDBC_0.2-7.1.tar.gz RODBC_1.3-15.tar.gz
- Types of operations: select, create, insert, and connect
Loading database data to an R data.frame
Where an in-database R capability as provided by Oracle R Enterprise is not available, R users typically pull data to the R client for data exploration, preparation, modeling, etc. In Figure 1, we compare the execution time to pull 10, 100, and 1000 columns of data from 1K, 10K, 100K, and 1M rows for BINARY_DOUBLE data on a log-log scale. Notice that ROracle is consistently faster than both RODBC and RJDBC, and that all three interfaces largely scale linearly. The driver is labeled with a ".get" suffix to flag that this result is using dbGetQuery(), as opposed to dbSendQuery with a fetch, however both results were comparable and so dbSendQuery results are omitted. Below, we discuss the full set of benchmark results in greater detail.
Figure 1: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE for Select *
In Figure 2, we provide the benchmark results for RODBC, RJDBC, and ROracle across the four data types. Notice that ROracle provides the best performance for all data types except TIMESTAMP. With TIMESTAMP, RJDBC performs consistently better, excluding the smallest data set. We have found that the first call to RDJBC can result in a java startup cost that goes away upon subsequent execution.This also appears for VARCHAR2 as depicted below.
ROracle is consistently faster than RODBC: BINARY_DOUBLE data up to 11X faster, NUMBER data up to 3.4X faster, VARCHAR2 data up to 5.7X faster, and TIMESTAMP data up to 4.9X faster.
For RJDBC, ROracle is up to 48X faster on BINARY_DOUBLE data (mean of 17X), up to 10X faster on NUMBER data (mean of 3X), 36X for VARCHAR2 data (mean of 6X). As noted above, TIMESTAMP data is where RJDBC performs faster than ROracle due to the fact that RJDBC treats TIMESTAMP data as a character string, not as POSIXct.
Note that RODBC and RJDBC have a limit of 255 characters on the length the VARCHAR2 columns. ROracle creates VARCHAR2(4000).
Figure 2: Comparison of RJDBC, RODBC, and ROracle for select * from <table>
For reference, the data set sizes (represented in megabytes) are captured in Table 1 for all data types.
Table 1: Dataset sizes in megabytes
Creating database tables from an R data.frame
Data or results created in R may need to be written to a database table. In Figure 3, we compare the execution time to create tables with 10, 100, and 1000 columns with 1K, 10, 100K, and 1M rows for each data type, performing row inserts in batches of 1000 rows. In all cases, RODBC had the slowest create times. ROracle is up to 198X faster (mean of 23X and median 9X) across all entries. For RJDBC, ROracle is up to 5.6X faster (mean and median approaching 2X) across all entries. Note that RJDBC does not support the TIMESTAMP data type for CREATE.
Figure 3: Comparison of RJDBC, RODBC, and ROracle for creating a table
Inserting rows to database tables from an R data.frame
In other cases, data or results created in R may need to be inserted into existing database tables. In Figure 4, we compare the execution time to insert rows into existing tables with 10, 100, and 1000 columns, insert data.frames with 1K, 10K, 100K, and 1M rows for each data type as before, in batches of 1000 rows. Similar to CREATE, RODBC has the slowest performance. ROracle is up to 243X faster (mean of 34X and median 12.4X) across all entries. For RJDBC, ROracle is up to 18X faster (mean of 2.6X and median 2X). As for CREATE, RJDBC does not support the TIMESTAMP data type for INSERT.
Figure 4: Comparison of RJDBC, RODBC, and ROracle for inserting rows into a table
Connecting to Oracle Database
Depending on the application, sub-second response time may be sufficient to meet application database connection requirements. As depicted in Figure 5, ROracle and RJDBC require minimal time (0.05 seconds) to establish a database connection compared to RODBC, which is 8X slower.
Figure 5: Database connection times for ROracle, RJDBC, and RODBC
In summary, ROracle supports a wide range of application needs for performance and scalability. While RJDBC and RODBC have improved their overall performance, ROracle remains the best choice for Oracle Database.
All tests were performed on a quarter rack X7-2 Exadata. Oracle Database was version 126.96.36.199 with R 3.5.0.
For JDBC, we used a larger size for VM:
options(java.parameters = "-Xmx80g")
Please visit source website for post related comments.