Another Benchmark for Joining Two Data Frames
In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the...continue reading.
In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the...continue reading.
In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance...continue reading.
> require(‘RWeka’) > require(‘pROC’) > > # SEPARATE DATA INTO TRAINING AND TESTING SETS > df1 <- read.csv(‘credit_count.csv’) > df2 <- df1[df1$CARDHLDR == 1, 2:12] > set.seed(2013) > rows <-...continue reading.
In the example below, 552 rows are extracted from a data frame with 10 million rows using six different methods. Results show a significant disparity between the least and the...continue reading.
Similar to NLMIXED procedure in SAS, optim() in R provides the functionality to estimate a model by specifying the log likelihood function explicitly. Below is a demo showing how to...continue reading.
data.table (http://datatable.r-forge.r-project.org/) inherits from data.frame and provides functionality in fast subset, fast grouping, and fast joins. In previous posts, it is shown that the shortest CPU time to aggregate a...continue reading.
Motivated by my young friend, HongMing Song, I managed to find more handy ways to calculate aggregated statistics by group in R. They require loading additional packages, plyr, doBy, Hmisc,...continue reading.
Below is a piece of R snippet comparing the data import efficiencies among CSV, SQLITE, and HDF5. Similar to the case in Python posted yesterday, HDF5 shows the highest efficiency.continue reading.
After posting “Removing Records by Duplicate Values” yesterday, I had an interesting communication thread with my friend Jeffrey Allard tonight regarding how to code this in R, a combination of...continue reading.
Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how...continue reading.
In the practice of risk modeling, it is sometimes mandatory to maintain a monotonic relationship between the response and each predictor. Below is a demonstration showing how to develop a...continue reading.
In [1]: import pandas as pd In [2]: import statsmodels.api as sm In [3]: data = pd.read_table(‘/home/liuwensui/Documents/data/csdata.txt’) In [4]: Y = data.LEV_LT3 In [5]: X = sm.add_constant(data[[‘COLLAT1’, ‘SIZE1’, ‘PROF2’, ‘LIQ’,...continue reading.
SQLite is a light-weight database with zero-configuration. Being fast, reliable, and simple, SQLite is a good choice to store / query large data, e.g. terabytes, and is well supported by...continue reading.
library(chron) library(zoo) # STOCK TICKER OF Fifth Third Bancorp stock <- ‘FITB’ # DEFINE STARTING DATE start.date <- 1 start.month <- 1 start.year <- 2012 # DEFINE ENDING DATE end.date...continue reading.
################################################# ## FIT A MULTIVARIATE ADAPTIVE REGRESSION ## ## SPLINES MODEL (MARS) USING MDA PACKAGE ## ## DEVELOPED BY HASTIE AND TIBSHIRANI ## ##############################################…continue reading.