longley Macroeconomic Data)The built-in dataset longley contains macroeconomic data for predicting employment. We will attempt to model the Employed variable.
View(longley)
?longley
(a) What is the largest correlation between any pair of predictors in the dataset?
m = as.matrix(cor(longley[ , 1:6]))
max = max(m[m != 1.0])
which(m == max)
## [1] 12 32
max
## [1] 0.9953
m
## GNP.deflator GNP Unemployed Armed.Forces Population Year
## GNP.deflator 1.0000 0.9916 0.6206 0.4647 0.9792 0.9911
## GNP 0.9916 1.0000 0.6043 0.4464 0.9911 0.9953
## Unemployed 0.6206 0.6043 1.0000 -0.1774 0.6866 0.6683
## Armed.Forces 0.4647 0.4464 -0.1774 1.0000 0.3644 0.4172
## Population 0.9792 0.9911 0.6866 0.3644 1.0000 0.9940
## Year 0.9911 0.9953 0.6683 0.4172 0.9940 1.0000
The largest correlation factor is between Year and GNP whose value is 0.9953.
(b) Fit a model with Employed as the response and the remaining variables as predictors. Calculate and report the variance inflation factor (VIF) for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?
library(faraway)
m = lm(Employed ~ ., data = longley)
#summary(m)
v = vif(m)
v
## GNP.deflator GNP Unemployed Armed.Forces Population Year
## 135.532 1788.513 33.619 3.589 399.151 758.981
vif = max(v)
vif
## [1] 1789
The GNP variable has the largest variance inflation factor of 1788.5135. The other variables which have high possibility of colinearity are GNP.deflator, Unemployed, Population, and Year using a criteria of vif > 5.
(c) What proportion of the observed variation in Population is explained by a linear relationship with the other predictors?
m_pop = lm(Population ~ . - Employed, data = longley)
#summary(m_pop)
r2 = summary(m_pop)$r.squared
r2
## [1] 0.9975
The observed variation of the predictor variable, Population, can be explained by 0.9975 relative to the other predictors.
(d) Calculate the partial correlation coefficient for Population and Employed with the effects of the other predictors removed.
m0 = lm(Employed ~ . - Population, data = longley)
m1 = lm(Population ~ . - Employed, data = longley)
pcorr = cor(resid(m0), resid(m1))
pcorr
## [1] -0.07514
The partial correlation coefficient is -0.0751 and is fairly small. There is probably not much of a benefit to include Population as a predictor with the other predictors currently in the model.
(e) Fit a new model with Employed as the response and the predictors from the model in (b) that were significant. (Use \(\alpha = 0.05\).) Calculate and report the variance inflation factor for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?
library(knitr)
m_e = lm(Employed ~ Unemployed + Armed.Forces + Year, data = longley)
#summary(m_e)
df = data.frame(vif = as.vector(vif(m_e)))
rownames(df) = c('Unemployed', 'Armed.Forces', 'Year')
kable(df)
| vif | |
|---|---|
| Unemployed | 3.318 |
| Armed.Forces | 2.223 |
| Year | 3.891 |
With the new model, the largest vif is 3.891 and belongs to Year. All of the vif values are less than 5 and do not show high colinearity.
(f) Use an \(F\)-test to compare the models in parts (b) and (e). Report the following:
dim(longley)
## [1] 16 7
m_b = lm(Employed ~ ., data = longley)
anova_res = anova(m_e, m_b)
anova_res
## Analysis of Variance Table
##
## Model 1: Employed ~ Unemployed + Armed.Forces + Year
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population +
## Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 12 1.323
## 2 9 0.836 3 0.487 1.75 0.23
The null hypothesis is \(H_0: \beta_{gnp.deflator} = \beta_{armed.forces} = \beta_{population} = 0\) and the alternate hypothesis is \(H_A: at\ least\ one: \beta_{gnp.deflator} , \beta_{armed.forces} , \beta_{population} \neq 0\). The test statistic is \(F = 1.75\). It follows a F distribution with df1 = 6, and df2 = 9. The p-value is 0.23 and if we used a criteria of \(\alpha = 0.05\) we would fail to reject the null hypothesis. As a result, we would choose the smaller model (e).
(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?
library(lmtest)
bp_test = bptest(m_e)
bp_test
##
## studentized Breusch-Pagan test
##
## data: m_e
## BP = 2.5, df = 3, p-value = 0.5
sw_test = shapiro.test(resid(m_e))
sw_test
##
## Shapiro-Wilk normality test
##
## data: resid(m_e)
## W = 0.93, p-value = 0.2
par(mfrow = c(1,2))
plot_fitted_resid(m_e)
plot_qq(m_e)
The assumption of constant variance and normality of the residual have not been violated with criteria of \(\alpha = 0.05\). Both Breusch Pagan test and Shapiro Wilkes test both have p-values greater than 0.05.
Credit Data)For this exercise, use the Credit data from the ISLR package. Use the following code to remove the ID variable which is not useful for modeling.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
data(Credit)
Credit = subset(Credit, select = -c(ID))
Use ?Credit to learn about this dataset.
(a) Find a “good” model for balance using the available predictors. Use any methods seen in class except transformations of the response. The model should:
1400.90Store your model in a variable called mod_a. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.
str(Credit)
pairs(Credit)
loocv = function(n, model){
sqrt (sum((resid(model)/(1 - hatvalues(model)))^2) * (1/n))
}
n=nrow(Credit)
m0 = lm(Balance ~ 1, data = Credit)
mod_a = step(m0, scope = Balance ~ Income * Limit * Cards * Age * Education * Married + I(Limit^2) + I(Cards^2) + I(Income^2) + I(Limit^3) + I(Income^3) + I(Limit^4) + I(Age^2) + I(Age^3) , direction = 'forward', trace = FALSE, k=2)
summary(mod_a)
##
## Call:
## lm(formula = Balance ~ Limit + Income + I(Limit^2) + I(Limit^3) +
## I(Limit^4) + Cards + I(Age^3) + Married + I(Income^3) + Education +
## Limit:Income + Limit:Education + Cards:Education, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -266.2 -69.8 -36.4 -0.6 508.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72e+02 1.24e+02 1.39 0.16590
## Limit -2.68e-01 6.74e-02 -3.98 8.4e-05 ***
## Income -1.81e+00 1.71e+00 -1.06 0.28905
## I(Limit^2) 1.15e-04 1.88e-05 6.09 2.7e-09 ***
## I(Limit^3) -8.36e-09 2.17e-09 -3.86 0.00013 ***
## I(Limit^4) 2.41e-13 7.90e-14 3.05 0.00244 **
## Cards 5.85e+01 2.15e+01 2.72 0.00673 **
## I(Age^3) -1.23e-04 4.05e-05 -3.03 0.00259 **
## MarriedYes -3.33e+01 1.45e+01 -2.30 0.02197 *
## I(Income^3) 2.00e-04 4.71e-05 4.25 2.7e-05 ***
## Education 2.21e+00 7.05e+00 0.31 0.75464
## Limit:Income -1.74e-03 3.94e-04 -4.42 1.3e-05 ***
## Limit:Education 2.07e-03 9.85e-04 2.10 0.03622 *
## Cards:Education -2.86e+00 1.57e+00 -1.82 0.06903 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 139 on 386 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.909
## F-statistic: 306 on 13 and 386 DF, p-value: <2e-16
bptest(mod_a)
##
## studentized Breusch-Pagan test
##
## data: mod_a
## BP = 26, df = 13, p-value = 0.02
loocv(n, mod_a)
## [1] 141.5
library(lmtest)
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
get_loocv_rmse(mod_a)
## [1] 141.5
get_adj_r2(mod_a)
## [1] 0.9085
get_bp_decision(mod_a, alpha = 0.01)
## [1] "Fail to Reject"
get_num_params(mod_a)
## [1] 14
(b) Find another “good” model for balance using the available predictors. Use any methods seen in class except transformations of the response. The model should:
1300.85Store your model in a variable called mod_b. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.
library(leaps)
mod = summary(regsubsets(Balance ~ Income * Limit * Cards * Student + I(Income^2) + I(Limit^2) + I(Cards^2) + I(Income^3) + I(Limit^3) + I(Cards^3) + log(Limit) +log(Income) + log(Cards) + I(Income^4) + I(Limit^4) + I(Cards^4), data = Credit))
mod
## Subset selection object
## Call: regsubsets.formula(Balance ~ Income * Limit * Cards * Student +
## I(Income^2) + I(Limit^2) + I(Cards^2) + I(Income^3) + I(Limit^3) +
## I(Cards^3) + log(Limit) + log(Income) + log(Cards) + I(Income^4) +
## I(Limit^4) + I(Cards^4), data = Credit)
## 27 Variables (and intercept)
## Forced in Forced out
## Income FALSE FALSE
## Limit FALSE FALSE
## Cards FALSE FALSE
## StudentYes FALSE FALSE
## I(Income^2) FALSE FALSE
## I(Limit^2) FALSE FALSE
## I(Cards^2) FALSE FALSE
## I(Income^3) FALSE FALSE
## I(Limit^3) FALSE FALSE
## I(Cards^3) FALSE FALSE
## log(Limit) FALSE FALSE
## log(Income) FALSE FALSE
## log(Cards) FALSE FALSE
## I(Income^4) FALSE FALSE
## I(Limit^4) FALSE FALSE
## I(Cards^4) FALSE FALSE
## Income:Limit FALSE FALSE
## Income:Cards FALSE FALSE
## Limit:Cards FALSE FALSE
## Income:StudentYes FALSE FALSE
## Limit:StudentYes FALSE FALSE
## Cards:StudentYes FALSE FALSE
## Income:Limit:Cards FALSE FALSE
## Income:Limit:StudentYes FALSE FALSE
## Income:Cards:StudentYes FALSE FALSE
## Limit:Cards:StudentYes FALSE FALSE
## Income:Limit:Cards:StudentYes FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Income Limit Cards StudentYes I(Income^2) I(Limit^2) I(Cards^2)
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " "*" "*" " "
## 5 ( 1 ) " " " " " " " " " " "*" " "
## 6 ( 1 ) " " " " " " " " " " "*" " "
## 7 ( 1 ) " " " " "*" " " " " "*" " "
## 8 ( 1 ) " " " " " " " " " " "*" " "
## I(Income^3) I(Limit^3) I(Cards^3) log(Limit) log(Income) log(Cards)
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " "*" " " " "
## I(Income^4) I(Limit^4) I(Cards^4) Income:Limit Income:Cards
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " "
## 4 ( 1 ) " " " " " " "*" " "
## 5 ( 1 ) " " " " " " "*" " "
## 6 ( 1 ) " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" " "
## 8 ( 1 ) " " " " " " "*" " "
## Limit:Cards Income:StudentYes Limit:StudentYes Cards:StudentYes
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " "*" " "
## 4 ( 1 ) " " " " "*" " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " "*" " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) "*" " " "*" " "
## Income:Limit:Cards Income:Limit:StudentYes Income:Cards:StudentYes
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" " "
## 8 ( 1 ) " " "*" " "
## Limit:Cards:StudentYes Income:Limit:Cards:StudentYes
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
mod$adjr2
## [1] 0.7419 0.8705 0.9672 0.9751 0.9795 0.9848 0.9885 0.9896
which.max(mod$adjr2)
## [1] 8
n=nrow(Credit)
mod_b = lm(Balance ~ (Limit:Student) + (Income:Limit) + I(Limit^2) , data = Credit )
summary(mod_b)
##
## Call:
## lm(formula = Balance ~ (Limit:Student) + (Income:Limit) + I(Limit^2),
## data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -310.36 -53.35 4.59 49.16 253.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.57e+02 1.59e+01 -9.89 <2e-16 ***
## I(Limit^2) 2.93e-05 8.20e-07 35.76 <2e-16 ***
## Limit:StudentNo 5.74e-02 6.42e-03 8.95 <2e-16 ***
## Limit:StudentYes 1.46e-01 6.51e-03 22.42 <2e-16 ***
## Limit:Income -1.62e-03 3.31e-05 -48.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76 on 395 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.973
## F-statistic: 3.55e+03 on 4 and 395 DF, p-value: <2e-16
bptest(mod_b)
##
## studentized Breusch-Pagan test
##
## data: mod_b
## BP = 72, df = 4, p-value = 1e-14
loocv(n, mod_b)
## [1] 77.69
library(lmtest)
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
get_loocv_rmse(mod_b)
## [1] 77.69
get_adj_r2(mod_b)
## [1] 0.9726
get_sw_decision(mod_b, alpha = 0.01)
## [1] "Reject"
get_num_params(mod_b)
## [1] 5
Sacramento Housing Data)For this exercise, use the Sacramento data from the caret package. Use the following code to perform some preprocessing of the data.
library(caret)
library(ggplot2)
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == "SACRAMENTO", "in", "out"))
sac_data = subset(sac_data, select = -c(city, zip))
Instead of using the city or zip variables that exist in the dataset, we will simply create a variable (limits) indicating whether or not a house is technically within the city limits of Sacramento. (We do this because they would both be factor variables with a large number of levels. This is a choice that is made due to laziness, not necessarily because it is justified. Think about what issues these variables might cause.)
Use ?Sacramento to learn more about this dataset.
A plot of longitude versus latitude gives us a sense of where the city limits are.
qplot(y = longitude, x = latitude, data = sac_data,
col = limits, main = "Sacramento City Limits ")
After these modifications, we test-train split the data.
set.seed(420)
sac_trn_idx = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]
The training data should be used for all model fitting. Our goal is to find a model that is useful for predicting home prices.
(a) Find a “good” model for price. Use any methods seen in class. The model should reach a LOOCV-RMSE below 77,500 in the training data. Do not use any transformations of the response variable.
str(sac_trn_data)
## 'data.frame': 745 obs. of 8 variables:
## $ beds : int 5 4 3 4 4 3 4 3 3 3 ...
## $ baths : num 4.5 2 2 2 2 1 2 2 4 2 ...
## $ sqft : int 3179 1713 1516 1182 1354 936 1316 1118 1197 1520 ...
## $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ price : int 389000 263500 97750 247480 126714 200000 89000 122000 224500 170000 ...
## $ latitude : num 38.9 38.6 38.5 38.7 38.7 ...
## $ longitude: num -121 -121 -121 -121 -121 ...
## $ limits : Factor w/ 2 levels "in","out": 2 2 2 2 2 2 1 2 2 1 ...
head(sac_trn_data)
## beds baths sqft type price latitude longitude limits
## 821 5 4.5 3179 Residential 389000 38.87 -121.3 out
## 101 4 2.0 1713 Residential 263500 38.55 -121.2 out
## 170 3 2.0 1516 Residential 97750 38.49 -121.1 out
## 462 4 2.0 1182 Residential 247480 38.69 -121.5 out
## 376 4 2.0 1354 Residential 126714 38.71 -121.4 out
## 73 3 1.0 936 Residential 200000 38.66 -121.3 out
n_trn = nrow(sac_trn_data)
m_0 = lm(price ~ 1, data = sac_trn_data)
m_2 = step(m_0, scope = price ~ beds * baths * sqft * type * latitude * longitude * limits, direction = 'forward', trace = FALSE, k = log(n_trn))
model = m_2
summary(model)
##
## Call:
## lm(formula = price ~ sqft + longitude + beds + longitude:beds +
## sqft:beds, data = sac_trn_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205249 -48770 -9895 35602 413031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.08e+07 9.77e+06 -3.15 0.0017 **
## sqft 2.19e+02 1.62e+01 13.48 < 2e-16 ***
## longitude -2.53e+05 8.04e+04 -3.15 0.0017 **
## beds 1.52e+07 2.97e+06 5.12 3.8e-07 ***
## longitude:beds 1.25e+05 2.44e+04 5.12 3.8e-07 ***
## sqft:beds -1.83e+01 3.86e+00 -4.75 2.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75500 on 739 degrees of freedom
## Multiple R-squared: 0.658, Adjusted R-squared: 0.656
## F-statistic: 285 on 5 and 739 DF, p-value: <2e-16
loocv(n_trn, model)
## [1] 76004
The LOOCV-RMSE training value for my model was $76004 which is a sizeable amount to be in error.
(b) Is a model that achieves a LOOCV-RMSE below 77,500 useful in this case? That is, is an average error of 77,500 low enough when predicting home prices? To further investigate, use the held-out test data and your model from part (a) to do two things:
Based on all of this information, argue whether or not this model is useful.
n = nrow(sac_tst_data)
y_pred = predict(m_2, newdata = sac_tst_data)
ave_percenterror = sum((abs(y_pred - sac_tst_data$price) / sac_tst_data$price) * 100) * 1/n
ave_percenterror
## [1] 29.14
l = lm(y_pred ~ sac_tst_data$price)
plot(y_pred ~ sac_tst_data$price, main = 'Predicted Price vs. Actual Price', xlab = 'Actual Price', ylab = 'Predicted Price', col = 'grey')
abline(l, col = 'blue')
The average percentage of error of 29% is too high to be useful in the real estate market. This is a considerable amount of money to be in error. A real estate agent wouldn’t be in business very long if they were off by this amount.
In this exercise, we will investigate how well backwards AIC and BIC actually perform. For either to be “working” correctly, they should result in a low number of both false positives and false negatives. In model selection,
Consider the true model
\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \beta_9 x_9 + \beta_{10} x_{10} + \epsilon \]
where \(\epsilon \sim N(0, \sigma^2 = 4)\). The true values of the \(\beta\) parameters are given in the R code below.
beta_0 = 1
beta_1 = -1
beta_2 = 2
beta_3 = -2
beta_4 = 1
beta_5 = 1
beta_6 = 0
beta_7 = 0
beta_8 = 0
beta_9 = 0
beta_10 = 0
sigma = 2
Then, as we have specified them, some variables are significant, and some are not. We store their names in R variables for use later.
not_sig = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
We now simulate values for these x variables, which we will use throughout part (a).
set.seed(420)
n = 100
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = runif(n, 0, 10)
x_9 = runif(n, 0, 10)
x_10 = runif(n, 0, 10)
We then combine these into a data frame and simulate y according to the true model.
sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
We do a quick check to make sure everything looks correct.
head(sim_data_1)
## x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 y
## 1 6.055 4.088 8.7894 1.8180 0.8198 8.146 9.7305 9.6673 6.915 4.5523 -11.627
## 2 9.703 3.634 5.0768 5.5784 6.3193 6.033 3.2301 2.6707 2.214 0.4861 -0.147
## 3 1.745 3.899 0.5431 4.5068 1.0834 3.427 3.2223 5.2746 8.242 7.2310 15.145
## 4 4.758 5.315 7.6257 0.1287 9.4057 6.168 0.2472 6.5325 2.102 4.5814 2.404
## 5 7.245 7.225 9.5763 3.0398 0.4194 5.937 9.2169 4.6228 2.527 9.2349 -7.910
## 6 8.761 5.177 1.7983 0.5949 9.2944 9.392 1.0017 0.4476 5.508 5.9687 9.764
Now, we fit an incorrect model.
fit = lm(y ~ x_1 + x_2 + x_6 + x_7, data = sim_data_1)
coef(fit)
## (Intercept) x_1 x_2 x_6 x_7
## -1.3758 -0.3572 2.1040 0.1344 -0.3367
Notice, we have coefficients for x_1, x_2, x_6, and x_7. This means that x_6 and x_7 are false positives, while x_3, x_4, and x_5 are false negatives.
To detect the false negatives, use:
# which are false negatives?
signif
## [1] "x_1" "x_2" "x_3" "x_4" "x_5"
names(coef(fit))
## [1] "(Intercept)" "x_1" "x_2" "x_6" "x_7"
!(signif %in% names(coef(fit)))
## [1] FALSE FALSE TRUE TRUE TRUE
To detect the false positives, use:
# which are false positives?
names(coef(fit)) %in% not_sig
## [1] FALSE FALSE FALSE TRUE TRUE
Note that in both cases, you could sum() the result to obtain the number of false negatives or positives.
(a) Set a seed equal to your birthday; then, using the given data for each x variable above in sim_data_1, simulate the response variable y 300 times. Each time,
x variables.Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table.
runsim = function(n_sim) {
df = data.frame(AIC_fp = 0, AIC_fn = 0, BIC_fp = 0, BIC_fn = 0)
not_sig = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
set.seed(03031970)
n = 100
for (i in c(1:n_sim)) {
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = runif(n, 0, 10)
x_9 = runif(n, 0, 10)
x_10 = runif(n, 0, 10)
sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
m = lm(y ~ ., data = sim_data_1) #full additive 1st deg model
m_AIC = step(m, direction = 'backward', trace = FALSE, k = 2)
#coef(m_AIC)
aic_fn = sum(!(signif %in% names(coef(m_AIC))))#false negatives
aic_fp = sum(names(coef(m_AIC)) %in% not_sig) #false positives
m_BIC = step(m, direction = 'backward', trace = FALSE, k = log(n))
#coef(m_BIC)
bic_fn = sum(!(signif %in% names(coef(m_BIC)))) #false negatives
bic_fp = sum(names(coef(m_BIC)) %in% not_sig) #false positives
df$AIC_fp = df$AIC_fp + aic_fp
df$AIC_fn = df$AIC_fn + aic_fn
df$BIC_fp = df$BIC_fp + bic_fp
df$BIC_fn = df$BIC_fn + bic_fn
}
#df = df / n_sim
row.names(df) = c('sum')
df
}
df_result = runsim(300)
library(knitr)
kable(df_result)
| AIC_fp | AIC_fn | BIC_fp | BIC_fn | |
|---|---|---|---|---|
| sum | 267 | 0 | 58 | 0 |
The AIC model results show a higher false positive rate compared to the BIC model. The total false positives for the AIC model was 267 and the total for the BIC model was 58. The BIC has a higher penalty for larger models and thus produces smaller models. This is the reason why it has a smaller false positive rate compared to the AIC model. The AIC penalty is governed by \(2*p\) whereas the BIC is \(log(n)*p\). For n=100 the penalty is about twice as much in the BIC model. The false negative rate was zero in both models. There was enough signal for both models to detect the true predictors.
(b) Set a seed equal to your birthday; then, using the given data for each x variable below in sim_data_2, simulate the response variable y 300 times. Each time,
x variables.Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table. Also compare to your answers in part (a) and suggest a reason for any differences.
set.seed(94)
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = x_1 + rnorm(n, 0, 0.1)
x_9 = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)
sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
runsim2 = function(n_sim) {
df = data.frame(AIC_fp = rep(0, n_sim), AIC_fn = rep(0, n_sim), BIC_fp = rep(0, n_sim), BIC_fn = rep(0, n_sim))
not_sig = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
set.seed(03031970)
n = 100
for (i in c(1:n_sim)) {
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = x_1 + rnorm(n, 0, 0.1)
x_9 = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)
sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
m = lm(y ~ ., data = sim_data_2) #full additive 1st deg model
m_AIC = step(m, direction = 'backward', trace = FALSE, k = 2)
#coef(m_AIC)
aic_fn = sum(!(signif %in% names(coef(m_AIC))))#false negatives
aic_fp = sum(names(coef(m_AIC)) %in% not_sig) #false positives
m_BIC = step(m, direction = 'backward', trace = FALSE, k = log(n))
#coef(m_BIC)
bic_fn = sum(!(signif %in% names(coef(m_BIC)))) #false negatives
bic_fp = sum(names(coef(m_BIC)) %in% not_sig) #false positives
df$AIC_fp[i] = aic_fp
df$AIC_fn[i] = aic_fn
df$BIC_fp[i] = bic_fp
df$BIC_fn[i] = bic_fn
}
df
}
df_result2 = runsim2(300)
#head(df_result2)
df_summary = data.frame(AIC_fp = sum(df_result2[, 1]),
AIC_fn = sum(df_result2[, 2]),
BIC_fp = sum(df_result2[, 3]),
BIC_fn = sum(df_result2[, 4]))
row.names(df_summary) = 'sum'
kable(df_summary)
| AIC_fp | AIC_fn | BIC_fp | BIC_fn | |
|---|---|---|---|---|
| sum | 472 | 223 | 297 | 245 |
The AIC false positive rate was roughly 40% greater than the BIC model, the total false positives were 472 versus 297, respectively. The differences in false negatives were not as great, the AIC and BIC models had total false negatives of 297 and 245, respectively. The reason why both false positives and false negatives are higher in this case (b) is because of colinearity in the data set. Variables \(x_8,\ x_9,\ x_{10}\) are correlated with \(x_1,\ x_2\) which cause a wider distribution of \(\beta\) values when performing the simulations. The correlated predictors can have explain a considerable amount of the variation but only because they are correlated to the true predictors. This will influence the beta values depending which predictors are in the model.