Programming / R / R News / Statistics

Beyond Univariate, Single-Sample Data with MCHT

by ntguardian · November 6, 2018

This article is originally published at https://ntguardian.wordpress.com

Introduction

I’ve spent the past few weeks writing about MCHT, my new package for Monte Carlo and bootstrap hypothesis testing. After discussing how to use MCHT safely, I discussed how to use it for maximized Monte Carlo (MMC) testing, then bootstrap testing. One may think I’ve said all I want to say about the package, but in truth, I’ve only barely passed the halfway point!

Today I’m demonstrating how general MCHT is, allowing one to use it for multiple samples and on non-univariate data. I’ll be doing so with two examples: a permutation test and the $F$ test for significance of a regression model.

Permutation Test

The idea of the permutation test dates back to Fisher (see [1]) and it forms the basis of computational testing for difference in mean. Let’s suppose that we have two samples with respective means $\mu_X$ and $\mu_Y$ , respectively. Suppose we wish to test

$H_0: \mu_X = \mu_Y$

against

$H_A: \mu_X > \mu_Y$

using samples $X_1, \ldots, X_m$ and $Y_1, \ldots, Y_n$ , respectively.

If the null hypothesis is true and we also make the stronger assumption that the two samples were drawn from distributions that could differ only in their means, then the labelling of the two samples is artificial, and if it were removed the two samples would be indistinguishable. Relabelling the data and artificially calling one sample the $X$ sample and the other the $Y$ sample would produce highly similar statistics to the one we actually observed. This observation suggests the following procedure:

Generate $N$ new datasets by randomly assigning labels to the combined sample of $X_1, \ldots, X_m$ and $Y_1, \ldots, Y_n$ .
Compute copies of the test statistic on each of the new samples; suppose that the test statistic used is the difference in means, $\bar{X} - \bar{Y}$ .
Compute the test statistic on the actual sample and compare to the simulated statistics. If the actual statistic is relatively large compared to the simulated statistics, then reject the null hypothesis in favor of the alternative; otherwise, don’t reject.

In practice step 3 is done by computing a $p$ -value representing the proportion of simulated statistics larger than the one actually computed.

Permutation Tests Using MCHT

The permutation test is effectively a bootstrap test, so it is supported by MCHT, though one may wonder how that’s the case when the parameters test_stat, stat_gen, and rand_gen only accept one parameter, x, representing the dataset (as opposed to, say, t.test(), which has an x and an optional y parameter). But MCHTest() makes very few assumptions about what object x actually is; if your object is either a vector or tabular, then the MCHTest object should not have a problem with it (it’s even possible a loosely structured list would be fine, but I have not tested this; tabular formats should cover most use cases).

In this case, putting our data in long-form format makes doing a permutation test fairly simple. One column will contain the group an observation belongs to while the other contains observation values. The test_stat function will split the data according to group, compute group-wise means, and finally compute the test statistic. rand_gen generates new dataset by permuting the labels in the data frame. stat_gen merely serves as the glue between the two.

The result is the following test.

library(MCHT)
library(doParallel)

registerDoParallel(detectCores())

ts <- function(x) {
  grp_means <- aggregate(value ~ group, data = x, FUN = mean)

  grp_means$value[1] - grp_means$value[2]
}

rg <- function(x) {
  x$group <- sample(x$group)
  x
}

sg <- function(x) {
  test_stat(x)
}

permute.test <- MCHTest(ts, sg, rg, seed = 123, N = 1000,
                        localize_functions = TRUE)

df <- data.frame("value" = c(rnorm(5, 2, 1), rnorm(10, 0, 1)),
                 "group" = rep(c("x", "y"), times = c(5, 10)))

permute.test(df)

## 
## 	Monte Carlo Test
## 
## data:  df
## S = 1.3985, p-value = 0.036

Linear Regression F Test

Suppose for each observation in our dataset there is an outcome of interest, $y_i$ , and there are variables $x_{1,i}, \ldots, x_{p,i}$ that could together help predict the value of $y_i$ if they are known. Consider then the following linear regression model (with $E[\epsilon_i] = 0$ ):

$y_i = \beta_0 + \beta_1 x_{1,i} + \ldots + \beta_p x_{p,i} + \epsilon_i$

The first question someone should asked when considering a regression model is whether it’s worth anything at all. An alternative approach to predicting $y_i$ is simply to predict its mean value. That is, the model

$y_i = \beta_0 + \epsilon_i$

is much simpler and should be preferred to the more complicated model listed above if it’s just as good at explaining the behavior of $y_i$ for all $i$ . Notice the second model is simply the first model with all the coefficients $\beta_1, \ldots, \beta_p$ identically equal to zero.

The $F$ -test (described in more detail here) can help us decide between these two competing models. Under the null hypothesis, the second model is the true model.

$H_0: \beta_1 = \ldots = \beta_p = 0$

The alternative says that at least one of the regressors is helpful in predicting $y_i$ .

$H_A: H_0 \text{ is false}$

We can use the $F$ statistic to decide between the two models:

$F = \frac{\frac{\text{RSS}_2 - \text{RSS}_1}{p - 1}}{\frac{\text{RSS}_1}{n - p}}$

$\text{RSS}_1$ and $\text{RSS}_2$ are the residual sum of squares of models 1 and
2, respectively.

This test is called the $F$ -test because usually the F-distribution is used to compute $p$ -values (as this is the distributiont the $F$ statistic should follow when certain conditions hold, at least asymptotically if not exactly). What then would a bootstrap-based procedure look like?

If the null hypothesis is true then the best model for the data is this:

$y_i = \bar{y} + \hat{\epsilon}_i$

$\bar{y}$ is the sample mean of $y$ and $\hat{epsilon}_i = y_i - \bar{y}$ is the residual. This suggests the following procedure:

Shuffle $y_i$ over all rows of the input dataset, with replacement, to generate new datasets.
Compute $F$ statistics for each of the generated datasets.
Compare the $F$ statistic of the actual dataset to the generated datasets’ statistics.

F Test Using MCHT

Let’s perform the $F$ test on a subset of the iris dataset. We will see if there is a relationship between the sepal length and sepal width among iris setosa flowers. Below is an initial split and visualization:

library(dplyr)

setosa <- iris %>% filter(Species == "setosa") %>%
            select(Sepal.Length, Sepal.Width)
plot(Sepal.Width ~ Sepal.Length, data = setosa)

There is an obvious relationship between the variables. Thus we should expect the test to reject the null hypothesis. That is what we would conclude if we were to run the conventional $F$ test:

res <- lm(Sepal.Width ~ Sepal.Length, data = setosa)
summary(res)

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = setosa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72394 -0.18273 -0.00306  0.15738  0.51709 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.5694     0.5217  -1.091    0.281    
## Sepal.Length   0.7985     0.1040   7.681 6.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2565 on 48 degrees of freedom
## Multiple R-squared:  0.5514,	Adjusted R-squared:  0.542 
## F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10

Let’s now implement the procedure I described with MCHTest().

ts <- function(x) {
  res <- lm(Sepal.Width ~ Sepal.Length, data = x)
  summary(res)$fstatistic[[1]]  # Only way I know to automatically compute the
                                # statistic
}

# rand_gen's function can use both x and n, and n will be the number of rows of
# the dataset
rg <- function(x, n) {
  x$Sepal.Width <- sample(x$Sepal.Width, replace = TRUE, size = n)
  x
}

b.f.test.1 <- MCHTest(ts, ts, rg, seed = 123, N = 1000)

b.f.test.1(setosa)

## 
## 	Monte Carlo Test
## 
## data:  setosa
## S = 58.994, p-value < 2.2e-16

Excellent! It reached the correct conclusion.

Conclusion

One may naturally ask whether we can write functions a bit more general than what I’ve shown here at least in the regression context. For example, one may want parameters specifying a formula so that the regression model isn’t hard-coded into the test. In short, the answer is yes; MCHTest objects try to pass as many parameters to the input functions as they can.

Here is the revised example that works for basically any formula:

ts <- function(x, formula) {
  res <- lm(formula = formula, data = x)
  summary(res)$fstatistic[[1]]
}

rg <- function(x, n, formula) {
  dep_var <- all.vars(formula)[1]  # Get the name of the dependent variable
  x[[dep_var]] <- sample(x[[dep_var]], replace = TRUE, size = n)
  x
}

b.f.test.2 <- MCHTest(ts, ts, rg, seed = 123, N = 1000)

b.f.test.2(setosa, formula = Sepal.Width ~ Sepal.Length)

## 
## 	Monte Carlo Test
## 
## data:  setosa
## S = 58.994, p-value < 2.2e-16

This shows that you can have a lot of control over how MCHTest objects handle their inputs, giving you considerable flexibility.

Next post: time series and MCHT

References

R. A. Fisher, The design of experiments (1935)

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

Thanks for visiting r-craft.org
This article is originally published at https://ntguardian.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Beyond Univariate, Single-Sample Data with MCHT

You may also like...

Categories

Beyond Univariate, Single-Sample Data with MCHT

Introduction

Permutation Test

Permutation Tests Using MCHT

Linear Regression F Test

F Test Using MCHT

Conclusion

References

You may also like...

parking riddle

More about Flexible Frequency Models

Spatial data and maps conference – FOSS4G

Categories