Week 9 - Homework

Exercise 1 (longley Macroeconomic Data)
Exercise 2 (Credit Data)
Exercise 3 (Sacramento Housing Data)
Exercise 4 (Does It Work?)

Exercise 1 (`longley` Macroeconomic Data)

The built-in dataset longley contains macroeconomic data for predicting employment. We will attempt to model the Employed variable.

View(longley)
?longley

(a) What is the largest correlation between any pair of predictors in the dataset?

m = as.matrix(cor(longley[ , 1:6]))
max = max(m[m != 1.0])
which(m == max)

## [1] 12 32

max

## [1] 0.9953

##              GNP.deflator    GNP Unemployed Armed.Forces Population   Year
## GNP.deflator       1.0000 0.9916     0.6206       0.4647     0.9792 0.9911
## GNP                0.9916 1.0000     0.6043       0.4464     0.9911 0.9953
## Unemployed         0.6206 0.6043     1.0000      -0.1774     0.6866 0.6683
## Armed.Forces       0.4647 0.4464    -0.1774       1.0000     0.3644 0.4172
## Population         0.9792 0.9911     0.6866       0.3644     1.0000 0.9940
## Year               0.9911 0.9953     0.6683       0.4172     0.9940 1.0000

The largest correlation factor is between Year and GNP whose value is 0.9953.

(b) Fit a model with Employed as the response and the remaining variables as predictors. Calculate and report the variance inflation factor (VIF) for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?

library(faraway)
m = lm(Employed ~ ., data = longley)
#summary(m)
v = vif(m)
v

## GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
##      135.532     1788.513       33.619        3.589      399.151      758.981

vif = max(v)
vif

## [1] 1789

The GNP variable has the largest variance inflation factor of 1788.5135. The other variables which have high possibility of colinearity are GNP.deflator, Unemployed, Population, and Year using a criteria of vif > 5.

(c) What proportion of the observed variation in Population is explained by a linear relationship with the other predictors?

m_pop = lm(Population ~ . - Employed, data = longley)
#summary(m_pop)
r2 = summary(m_pop)$r.squared
r2

## [1] 0.9975

The observed variation of the predictor variable, Population, can be explained by 0.9975 relative to the other predictors.

(d) Calculate the partial correlation coefficient for Population and Employed with the effects of the other predictors removed.

m0 = lm(Employed ~ . - Population, data = longley)
m1 = lm(Population ~ . - Employed, data = longley)
pcorr = cor(resid(m0), resid(m1))
pcorr

## [1] -0.07514

The partial correlation coefficient is -0.0751 and is fairly small. There is probably not much of a benefit to include Population as a predictor with the other predictors currently in the model.

(e) Fit a new model with Employed as the response and the predictors from the model in (b) that were significant. (Use $\alpha = 0.05$.) Calculate and report the variance inflation factor for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?

library(knitr)
m_e = lm(Employed ~ Unemployed + Armed.Forces + Year, data = longley)
#summary(m_e)
df = data.frame(vif = as.vector(vif(m_e)))
rownames(df) = c('Unemployed', 'Armed.Forces', 'Year')
kable(df)

	vif
Unemployed	3.318
Armed.Forces	2.223
Year	3.891

With the new model, the largest vif is 3.891 and belongs to Year. All of the vif values are less than 5 and do not show high colinearity.

(f) Use an $F$-test to compare the models in parts (b) and (e). Report the following:

The null hypothesis
The test statistic
The distribution of the test statistic under the null hypothesis
The p-value
A decision
Which model you prefer, (b) or (e)

dim(longley)

## [1] 16  7

m_b = lm(Employed ~ ., data = longley)
anova_res = anova(m_e, m_b)
anova_res

## Analysis of Variance Table
## 
## Model 1: Employed ~ Unemployed + Armed.Forces + Year
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + 
##     Year
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     12 1.323                         
## 2      9 0.836  3     0.487 1.75   0.23

The null hypothesis is $H_0: \beta_{gnp.deflator} = \beta_{armed.forces} = \beta_{population} = 0$ and the alternate hypothesis is $H_A: at\ least\ one: \beta_{gnp.deflator} , \beta_{armed.forces} , \beta_{population} \neq 0$. The test statistic is $F = 1.75$. It follows a F distribution with df1 = 6, and df2 = 9. The p-value is 0.23 and if we used a criteria of $\alpha = 0.05$ we would fail to reject the null hypothesis. As a result, we would choose the smaller model (e).

(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?

library(lmtest)

bp_test = bptest(m_e)
bp_test

## 
##  studentized Breusch-Pagan test
## 
## data:  m_e
## BP = 2.5, df = 3, p-value = 0.5

sw_test = shapiro.test(resid(m_e))
sw_test

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(m_e)
## W = 0.93, p-value = 0.2

par(mfrow = c(1,2))
plot_fitted_resid(m_e)
plot_qq(m_e)

The assumption of constant variance and normality of the residual have not been violated with criteria of $\alpha = 0.05$. Both Breusch Pagan test and Shapiro Wilkes test both have p-values greater than 0.05.

Exercise 2 (`Credit` Data)

For this exercise, use the Credit data from the ISLR package. Use the following code to remove the ID variable which is not useful for modeling.

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.6.3

data(Credit)
Credit = subset(Credit, select = -c(ID))

Use ?Credit to learn about this dataset.

(a) Find a “good” model for balance using the available predictors. Use any methods seen in class except transformations of the response. The model should:

Reach a LOOCV-RMSE below 140
Obtain an adjusted $R^2$ above 0.90
Fail to reject the Breusch-Pagan test with an $\alpha$ of $0.01$
Use fewer than 10 $\beta$ parameters

Store your model in a variable called mod_a. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.

str(Credit)

pairs(Credit)

loocv = function(n, model){
  sqrt (sum((resid(model)/(1 -  hatvalues(model)))^2) * (1/n))
}

n=nrow(Credit)
m0 = lm(Balance ~ 1, data = Credit)

mod_a = step(m0, scope = Balance ~  Income * Limit * Cards * Age * Education * Married + I(Limit^2) + I(Cards^2) + I(Income^2) + I(Limit^3) + I(Income^3) + I(Limit^4)  + I(Age^2) + I(Age^3) , direction = 'forward', trace = FALSE, k=2)
summary(mod_a)

## 
## Call:
## lm(formula = Balance ~ Limit + Income + I(Limit^2) + I(Limit^3) + 
##     I(Limit^4) + Cards + I(Age^3) + Married + I(Income^3) + Education + 
##     Limit:Income + Limit:Education + Cards:Education, data = Credit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -266.2  -69.8  -36.4   -0.6  508.4 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.72e+02   1.24e+02    1.39  0.16590    
## Limit           -2.68e-01   6.74e-02   -3.98  8.4e-05 ***
## Income          -1.81e+00   1.71e+00   -1.06  0.28905    
## I(Limit^2)       1.15e-04   1.88e-05    6.09  2.7e-09 ***
## I(Limit^3)      -8.36e-09   2.17e-09   -3.86  0.00013 ***
## I(Limit^4)       2.41e-13   7.90e-14    3.05  0.00244 ** 
## Cards            5.85e+01   2.15e+01    2.72  0.00673 ** 
## I(Age^3)        -1.23e-04   4.05e-05   -3.03  0.00259 ** 
## MarriedYes      -3.33e+01   1.45e+01   -2.30  0.02197 *  
## I(Income^3)      2.00e-04   4.71e-05    4.25  2.7e-05 ***
## Education        2.21e+00   7.05e+00    0.31  0.75464    
## Limit:Income    -1.74e-03   3.94e-04   -4.42  1.3e-05 ***
## Limit:Education  2.07e-03   9.85e-04    2.10  0.03622 *  
## Cards:Education -2.86e+00   1.57e+00   -1.82  0.06903 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 139 on 386 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.909 
## F-statistic:  306 on 13 and 386 DF,  p-value: <2e-16

bptest(mod_a)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod_a
## BP = 26, df = 13, p-value = 0.02

loocv(n, mod_a)

## [1] 141.5

library(lmtest)

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

get_loocv_rmse(mod_a)

## [1] 141.5

get_adj_r2(mod_a)

## [1] 0.9085

get_bp_decision(mod_a, alpha = 0.01)

## [1] "Fail to Reject"

get_num_params(mod_a)

## [1] 14

(b) Find another “good” model for balance using the available predictors. Use any methods seen in class except transformations of the response. The model should:

Reach a LOOCV-RMSE below 130
Obtain an adjusted $R^2$ above 0.85
Fail to reject the Shapiro-Wilk test with an $\alpha$ of $0.01$
Use fewer than 25 $\beta$ parameters

Store your model in a variable called mod_b. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.

library(leaps)
mod = summary(regsubsets(Balance ~ Income * Limit * Cards * Student + I(Income^2) + I(Limit^2) + I(Cards^2) + I(Income^3) + I(Limit^3) + I(Cards^3) + log(Limit) +log(Income) + log(Cards) + I(Income^4) + I(Limit^4) + I(Cards^4), data = Credit))
mod

## Subset selection object
## Call: regsubsets.formula(Balance ~ Income * Limit * Cards * Student + 
##     I(Income^2) + I(Limit^2) + I(Cards^2) + I(Income^3) + I(Limit^3) + 
##     I(Cards^3) + log(Limit) + log(Income) + log(Cards) + I(Income^4) + 
##     I(Limit^4) + I(Cards^4), data = Credit)
## 27 Variables  (and intercept)
##                               Forced in Forced out
## Income                            FALSE      FALSE
## Limit                             FALSE      FALSE
## Cards                             FALSE      FALSE
## StudentYes                        FALSE      FALSE
## I(Income^2)                       FALSE      FALSE
## I(Limit^2)                        FALSE      FALSE
## I(Cards^2)                        FALSE      FALSE
## I(Income^3)                       FALSE      FALSE
## I(Limit^3)                        FALSE      FALSE
## I(Cards^3)                        FALSE      FALSE
## log(Limit)                        FALSE      FALSE
## log(Income)                       FALSE      FALSE
## log(Cards)                        FALSE      FALSE
## I(Income^4)                       FALSE      FALSE
## I(Limit^4)                        FALSE      FALSE
## I(Cards^4)                        FALSE      FALSE
## Income:Limit                      FALSE      FALSE
## Income:Cards                      FALSE      FALSE
## Limit:Cards                       FALSE      FALSE
## Income:StudentYes                 FALSE      FALSE
## Limit:StudentYes                  FALSE      FALSE
## Cards:StudentYes                  FALSE      FALSE
## Income:Limit:Cards                FALSE      FALSE
## Income:Limit:StudentYes           FALSE      FALSE
## Income:Cards:StudentYes           FALSE      FALSE
## Limit:Cards:StudentYes            FALSE      FALSE
## Income:Limit:Cards:StudentYes     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Income Limit Cards StudentYes I(Income^2) I(Limit^2) I(Cards^2)
## 1  ( 1 ) " "    "*"   " "   " "        " "         " "        " "       
## 2  ( 1 ) "*"    "*"   " "   " "        " "         " "        " "       
## 3  ( 1 ) " "    " "   " "   " "        " "         "*"        " "       
## 4  ( 1 ) " "    " "   " "   " "        "*"         "*"        " "       
## 5  ( 1 ) " "    " "   " "   " "        " "         "*"        " "       
## 6  ( 1 ) " "    " "   " "   " "        " "         "*"        " "       
## 7  ( 1 ) " "    " "   "*"   " "        " "         "*"        " "       
## 8  ( 1 ) " "    " "   " "   " "        " "         "*"        " "       
##          I(Income^3) I(Limit^3) I(Cards^3) log(Limit) log(Income) log(Cards)
## 1  ( 1 ) " "         " "        " "        " "        " "         " "       
## 2  ( 1 ) " "         " "        " "        " "        " "         " "       
## 3  ( 1 ) " "         " "        " "        " "        " "         " "       
## 4  ( 1 ) " "         " "        " "        " "        " "         " "       
## 5  ( 1 ) "*"         "*"        " "        " "        " "         " "       
## 6  ( 1 ) "*"         " "        " "        " "        " "         " "       
## 7  ( 1 ) "*"         " "        " "        " "        " "         " "       
## 8  ( 1 ) "*"         "*"        " "        "*"        " "         " "       
##          I(Income^4) I(Limit^4) I(Cards^4) Income:Limit Income:Cards
## 1  ( 1 ) " "         " "        " "        " "          " "         
## 2  ( 1 ) " "         " "        " "        " "          " "         
## 3  ( 1 ) " "         " "        " "        "*"          " "         
## 4  ( 1 ) " "         " "        " "        "*"          " "         
## 5  ( 1 ) " "         " "        " "        "*"          " "         
## 6  ( 1 ) " "         "*"        " "        "*"          " "         
## 7  ( 1 ) " "         "*"        " "        "*"          " "         
## 8  ( 1 ) " "         " "        " "        "*"          " "         
##          Limit:Cards Income:StudentYes Limit:StudentYes Cards:StudentYes
## 1  ( 1 ) " "         " "               " "              " "             
## 2  ( 1 ) " "         " "               " "              " "             
## 3  ( 1 ) " "         " "               "*"              " "             
## 4  ( 1 ) " "         " "               "*"              " "             
## 5  ( 1 ) " "         " "               "*"              " "             
## 6  ( 1 ) " "         " "               "*"              " "             
## 7  ( 1 ) " "         " "               "*"              " "             
## 8  ( 1 ) "*"         " "               "*"              " "             
##          Income:Limit:Cards Income:Limit:StudentYes Income:Cards:StudentYes
## 1  ( 1 ) " "                " "                     " "                    
## 2  ( 1 ) " "                " "                     " "                    
## 3  ( 1 ) " "                " "                     " "                    
## 4  ( 1 ) " "                " "                     " "                    
## 5  ( 1 ) " "                " "                     " "                    
## 6  ( 1 ) " "                "*"                     " "                    
## 7  ( 1 ) " "                "*"                     " "                    
## 8  ( 1 ) " "                "*"                     " "                    
##          Limit:Cards:StudentYes Income:Limit:Cards:StudentYes
## 1  ( 1 ) " "                    " "                          
## 2  ( 1 ) " "                    " "                          
## 3  ( 1 ) " "                    " "                          
## 4  ( 1 ) " "                    " "                          
## 5  ( 1 ) " "                    " "                          
## 6  ( 1 ) " "                    " "                          
## 7  ( 1 ) " "                    " "                          
## 8  ( 1 ) " "                    " "

mod$adjr2

## [1] 0.7419 0.8705 0.9672 0.9751 0.9795 0.9848 0.9885 0.9896

which.max(mod$adjr2)

## [1] 8

n=nrow(Credit)
mod_b = lm(Balance ~ (Limit:Student) + (Income:Limit) + I(Limit^2) , data = Credit )
summary(mod_b)

## 
## Call:
## lm(formula = Balance ~ (Limit:Student) + (Income:Limit) + I(Limit^2), 
##     data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310.36  -53.35    4.59   49.16  253.74 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.57e+02   1.59e+01   -9.89   <2e-16 ***
## I(Limit^2)        2.93e-05   8.20e-07   35.76   <2e-16 ***
## Limit:StudentNo   5.74e-02   6.42e-03    8.95   <2e-16 ***
## Limit:StudentYes  1.46e-01   6.51e-03   22.42   <2e-16 ***
## Limit:Income     -1.62e-03   3.31e-05  -48.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76 on 395 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.973 
## F-statistic: 3.55e+03 on 4 and 395 DF,  p-value: <2e-16

bptest(mod_b)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod_b
## BP = 72, df = 4, p-value = 1e-14

loocv(n, mod_b)

## [1] 77.69

library(lmtest)

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

get_loocv_rmse(mod_b)

## [1] 77.69

get_adj_r2(mod_b)

## [1] 0.9726

get_sw_decision(mod_b, alpha = 0.01)

## [1] "Reject"

get_num_params(mod_b)

## [1] 5

Exercise 3 (`Sacramento` Housing Data)

For this exercise, use the Sacramento data from the caret package. Use the following code to perform some preprocessing of the data.

library(caret)
library(ggplot2)
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == "SACRAMENTO", "in", "out"))
sac_data = subset(sac_data, select = -c(city, zip))

Instead of using the city or zip variables that exist in the dataset, we will simply create a variable (limits) indicating whether or not a house is technically within the city limits of Sacramento. (We do this because they would both be factor variables with a large number of levels. This is a choice that is made due to laziness, not necessarily because it is justified. Think about what issues these variables might cause.)

Use ?Sacramento to learn more about this dataset.

A plot of longitude versus latitude gives us a sense of where the city limits are.

qplot(y = longitude, x = latitude, data = sac_data,
      col = limits, main = "Sacramento City Limits ")

After these modifications, we test-train split the data.

set.seed(420)
sac_trn_idx  = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]

The training data should be used for all model fitting. Our goal is to find a model that is useful for predicting home prices.

(a) Find a “good” model for price. Use any methods seen in class. The model should reach a LOOCV-RMSE below 77,500 in the training data. Do not use any transformations of the response variable.

str(sac_trn_data)

## 'data.frame':    745 obs. of  8 variables:
##  $ beds     : int  5 4 3 4 4 3 4 3 3 3 ...
##  $ baths    : num  4.5 2 2 2 2 1 2 2 4 2 ...
##  $ sqft     : int  3179 1713 1516 1182 1354 936 1316 1118 1197 1520 ...
##  $ type     : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ price    : int  389000 263500 97750 247480 126714 200000 89000 122000 224500 170000 ...
##  $ latitude : num  38.9 38.6 38.5 38.7 38.7 ...
##  $ longitude: num  -121 -121 -121 -121 -121 ...
##  $ limits   : Factor w/ 2 levels "in","out": 2 2 2 2 2 2 1 2 2 1 ...

head(sac_trn_data)

##     beds baths sqft        type  price latitude longitude limits
## 821    5   4.5 3179 Residential 389000    38.87    -121.3    out
## 101    4   2.0 1713 Residential 263500    38.55    -121.2    out
## 170    3   2.0 1516 Residential  97750    38.49    -121.1    out
## 462    4   2.0 1182 Residential 247480    38.69    -121.5    out
## 376    4   2.0 1354 Residential 126714    38.71    -121.4    out
## 73     3   1.0  936 Residential 200000    38.66    -121.3    out

n_trn = nrow(sac_trn_data)
m_0 = lm(price ~ 1, data = sac_trn_data)

m_2 = step(m_0, scope = price ~ beds * baths * sqft * type * latitude * longitude * limits, direction = 'forward', trace = FALSE, k = log(n_trn))

model = m_2
summary(model)

## 
## Call:
## lm(formula = price ~ sqft + longitude + beds + longitude:beds + 
##     sqft:beds, data = sac_trn_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -205249  -48770   -9895   35602  413031 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.08e+07   9.77e+06   -3.15   0.0017 ** 
## sqft            2.19e+02   1.62e+01   13.48  < 2e-16 ***
## longitude      -2.53e+05   8.04e+04   -3.15   0.0017 ** 
## beds            1.52e+07   2.97e+06    5.12  3.8e-07 ***
## longitude:beds  1.25e+05   2.44e+04    5.12  3.8e-07 ***
## sqft:beds      -1.83e+01   3.86e+00   -4.75  2.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75500 on 739 degrees of freedom
## Multiple R-squared:  0.658,  Adjusted R-squared:  0.656 
## F-statistic:  285 on 5 and 739 DF,  p-value: <2e-16

loocv(n_trn, model)

## [1] 76004

The LOOCV-RMSE training value for my model was $76004 which is a sizeable amount to be in error.

(b) Is a model that achieves a LOOCV-RMSE below 77,500 useful in this case? That is, is an average error of 77,500 low enough when predicting home prices? To further investigate, use the held-out test data and your model from part (a) to do two things:

Calculate the average percent error: \[ \frac{1}{n}\sum_i\frac{|\text{predicted}_i - \text{actual}_i|}{\text{predicted}_i} \times 100 \]
Plot the predicted versus the actual values and add the line $y = x$.

Based on all of this information, argue whether or not this model is useful.

n = nrow(sac_tst_data)
y_pred = predict(m_2, newdata = sac_tst_data)
ave_percenterror = sum((abs(y_pred - sac_tst_data$price) / sac_tst_data$price) * 100) * 1/n
ave_percenterror

## [1] 29.14

l = lm(y_pred ~ sac_tst_data$price)
plot(y_pred ~ sac_tst_data$price, main = 'Predicted Price vs. Actual Price', xlab = 'Actual Price', ylab = 'Predicted Price', col = 'grey')
abline(l, col = 'blue')

The average percentage of error of 29% is too high to be useful in the real estate market. This is a considerable amount of money to be in error. A real estate agent wouldn’t be in business very long if they were off by this amount.

Exercise 4 (Does It Work?)

In this exercise, we will investigate how well backwards AIC and BIC actually perform. For either to be “working” correctly, they should result in a low number of both false positives and false negatives. In model selection,

False Positive, FP: Incorrectly including a variable in the model. Including a non-significant variable
False Negative, FN: Incorrectly excluding a variable in the model. Excluding a significant variable

Consider the true model

\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \beta_9 x_9 + \beta_{10} x_{10} + \epsilon \]

where $\epsilon \sim N(0, \sigma^2 = 4)$. The true values of the $\beta$ parameters are given in the R code below.

beta_0  = 1
beta_1  = -1
beta_2  = 2
beta_3  = -2
beta_4  = 1
beta_5  = 1
beta_6  = 0
beta_7  = 0
beta_8  = 0
beta_9  = 0
beta_10 = 0
sigma = 2

Then, as we have specified them, some variables are significant, and some are not. We store their names in R variables for use later.

not_sig  = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")

We now simulate values for these x variables, which we will use throughout part (a).

set.seed(420)
n = 100
x_1  = runif(n, 0, 10)
x_2  = runif(n, 0, 10)
x_3  = runif(n, 0, 10)
x_4  = runif(n, 0, 10)
x_5  = runif(n, 0, 10)
x_6  = runif(n, 0, 10)
x_7  = runif(n, 0, 10)
x_8  = runif(n, 0, 10)
x_9  = runif(n, 0, 10)
x_10 = runif(n, 0, 10)

We then combine these into a data frame and simulate y according to the true model.

sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
)

We do a quick check to make sure everything looks correct.

head(sim_data_1)

##     x_1   x_2    x_3    x_4    x_5   x_6    x_7    x_8   x_9   x_10       y
## 1 6.055 4.088 8.7894 1.8180 0.8198 8.146 9.7305 9.6673 6.915 4.5523 -11.627
## 2 9.703 3.634 5.0768 5.5784 6.3193 6.033 3.2301 2.6707 2.214 0.4861  -0.147
## 3 1.745 3.899 0.5431 4.5068 1.0834 3.427 3.2223 5.2746 8.242 7.2310  15.145
## 4 4.758 5.315 7.6257 0.1287 9.4057 6.168 0.2472 6.5325 2.102 4.5814   2.404
## 5 7.245 7.225 9.5763 3.0398 0.4194 5.937 9.2169 4.6228 2.527 9.2349  -7.910
## 6 8.761 5.177 1.7983 0.5949 9.2944 9.392 1.0017 0.4476 5.508 5.9687   9.764

Now, we fit an incorrect model.

fit = lm(y ~ x_1 + x_2 + x_6 + x_7, data = sim_data_1)
coef(fit)

## (Intercept)         x_1         x_2         x_6         x_7 
##     -1.3758     -0.3572      2.1040      0.1344     -0.3367

Notice, we have coefficients for x_1, x_2, x_6, and x_7. This means that x_6 and x_7 are false positives, while x_3, x_4, and x_5 are false negatives.

To detect the false negatives, use:

# which are false negatives?
signif

## [1] "x_1" "x_2" "x_3" "x_4" "x_5"

names(coef(fit))

## [1] "(Intercept)" "x_1"         "x_2"         "x_6"         "x_7"

!(signif %in% names(coef(fit)))

## [1] FALSE FALSE  TRUE  TRUE  TRUE

To detect the false positives, use:

# which are false positives?
names(coef(fit)) %in% not_sig

## [1] FALSE FALSE FALSE  TRUE  TRUE

Note that in both cases, you could sum() the result to obtain the number of false negatives or positives.

(a) Set a seed equal to your birthday; then, using the given data for each x variable above in sim_data_1, simulate the response variable y 300 times. Each time,

Fit an additive model using each of the x variables.
Perform variable selection using backwards AIC.
Perform variable selection using backwards BIC.
Calculate and store the number of false negatives for the models chosen by AIC and BIC.
Calculate and store the number of false positives for the models chosen by AIC and BIC.

Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table.

runsim = function(n_sim) {
  df = data.frame(AIC_fp = 0, AIC_fn = 0, BIC_fp = 0, BIC_fn = 0)
  not_sig  = c("x_6", "x_7", "x_8", "x_9", "x_10")
  signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
  set.seed(03031970)
  n = 100

  for (i in c(1:n_sim)) { 
    x_1  = runif(n, 0, 10)
    x_2  = runif(n, 0, 10)
    x_3  = runif(n, 0, 10)
    x_4  = runif(n, 0, 10)
    x_5  = runif(n, 0, 10)
    x_6  = runif(n, 0, 10)
    x_7  = runif(n, 0, 10)
    x_8  = runif(n, 0, 10)
    x_9  = runif(n, 0, 10)
    x_10 = runif(n, 0, 10)
    
    sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
      y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
          beta_5 * x_5 + rnorm(n, 0 , sigma)
    )
    m = lm(y ~ ., data = sim_data_1) #full additive 1st deg model
    
    m_AIC = step(m, direction = 'backward', trace = FALSE, k = 2)
    #coef(m_AIC)
    aic_fn = sum(!(signif %in% names(coef(m_AIC))))#false negatives
    aic_fp = sum(names(coef(m_AIC)) %in% not_sig)   #false positives
    
    m_BIC = step(m, direction = 'backward', trace = FALSE, k = log(n))
    #coef(m_BIC)
    bic_fn = sum(!(signif %in% names(coef(m_BIC)))) #false negatives
    bic_fp = sum(names(coef(m_BIC)) %in% not_sig)  #false positives
    
    df$AIC_fp = df$AIC_fp + aic_fp
    df$AIC_fn = df$AIC_fn + aic_fn
    
    df$BIC_fp = df$BIC_fp + bic_fp
    df$BIC_fn = df$BIC_fn + bic_fn
  }
  #df = df / n_sim
  row.names(df) = c('sum')
  df
}

df_result = runsim(300)

library(knitr)
kable(df_result)

	AIC_fp	AIC_fn	BIC_fp	BIC_fn
sum	267	0	58	0

The AIC model results show a higher false positive rate compared to the BIC model. The total false positives for the AIC model was 267 and the total for the BIC model was 58. The BIC has a higher penalty for larger models and thus produces smaller models. This is the reason why it has a smaller false positive rate compared to the AIC model. The AIC penalty is governed by $2*p$ whereas the BIC is $log(n)*p$. For n=100 the penalty is about twice as much in the BIC model. The false negative rate was zero in both models. There was enough signal for both models to detect the true predictors.

(b) Set a seed equal to your birthday; then, using the given data for each x variable below in sim_data_2, simulate the response variable y 300 times. Each time,

Fit an additive model using each of the x variables.
Perform variable selection using backwards AIC.
Perform variable selection using backwards BIC.
Calculate and store the number of false negatives for the models chosen by AIC and BIC.
Calculate and store the number of false positives for the models chosen by AIC and BIC.

Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table. Also compare to your answers in part (a) and suggest a reason for any differences.

set.seed(94)
x_1  = runif(n, 0, 10)
x_2  = runif(n, 0, 10)
x_3  = runif(n, 0, 10)
x_4  = runif(n, 0, 10)
x_5  = runif(n, 0, 10)
x_6  = runif(n, 0, 10)
x_7  = runif(n, 0, 10)
x_8  = x_1 + rnorm(n, 0, 0.1)
x_9  = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)

sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
)

runsim2 = function(n_sim) {
  
  df = data.frame(AIC_fp = rep(0, n_sim), AIC_fn = rep(0, n_sim), BIC_fp = rep(0, n_sim), BIC_fn = rep(0, n_sim))
  not_sig  = c("x_6", "x_7", "x_8", "x_9", "x_10")
  signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
  set.seed(03031970)
  n = 100

  for (i in c(1:n_sim)) { 
    
    x_1  = runif(n, 0, 10)
    x_2  = runif(n, 0, 10)
    x_3  = runif(n, 0, 10)
    x_4  = runif(n, 0, 10)
    x_5  = runif(n, 0, 10)
    x_6  = runif(n, 0, 10)
    x_7  = runif(n, 0, 10)
    x_8  = x_1 + rnorm(n, 0, 0.1)
    x_9  = x_1 + rnorm(n, 0, 0.1)
    x_10 = x_2 + rnorm(n, 0, 0.1)
    
    sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
      y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
          beta_5 * x_5 + rnorm(n, 0 , sigma)
    )

    m = lm(y ~ ., data = sim_data_2) #full additive 1st deg model
    
    m_AIC = step(m, direction = 'backward', trace = FALSE, k = 2)
    #coef(m_AIC)
    aic_fn = sum(!(signif %in% names(coef(m_AIC))))#false negatives
    aic_fp = sum(names(coef(m_AIC)) %in% not_sig)   #false positives
    
    m_BIC = step(m, direction = 'backward', trace = FALSE, k = log(n))
    #coef(m_BIC)
    bic_fn = sum(!(signif %in% names(coef(m_BIC)))) #false negatives
    bic_fp = sum(names(coef(m_BIC)) %in% not_sig)  #false positives
    
    df$AIC_fp[i] =  aic_fp
    df$AIC_fn[i] =  aic_fn
    
    df$BIC_fp[i] =  bic_fp
    df$BIC_fn[i] =  bic_fn
  }

  df
}

df_result2 = runsim2(300)
#head(df_result2)

df_summary = data.frame(AIC_fp = sum(df_result2[, 1]),
  AIC_fn = sum(df_result2[, 2]),
  BIC_fp = sum(df_result2[, 3]),
  BIC_fn = sum(df_result2[, 4]))
row.names(df_summary) = 'sum'
kable(df_summary)

	AIC_fp	AIC_fn	BIC_fp	BIC_fn
sum	472	223	297	245

The AIC false positive rate was roughly 40% greater than the BIC model, the total false positives were 472 versus 297, respectively. The differences in false negatives were not as great, the AIC and BIC models had total false negatives of 297 and 245, respectively. The reason why both false positives and false negatives are higher in this case (b) is because of colinearity in the data set. Variables $x_8,\ x_9,\ x_{10}$ are correlated with $x_1,\ x_2$ which cause a wider distribution of $\beta$ values when performing the simulations. The correlated predictors can have explain a considerable amount of the variation but only because they are correlated to the true predictors. This will influence the beta values depending which predictors are in the model.

Week 9 - Homework

STAT 420, Summer 2020, D. Unger

Exercise 1 (longley Macroeconomic Data)

Exercise 2 (Credit Data)

Exercise 3 (Sacramento Housing Data)

Exercise 4 (Does It Work?)

Exercise 1 (`longley` Macroeconomic Data)

Exercise 2 (`Credit` Data)

Exercise 3 (`Sacramento` Housing Data)