In this tutorial, we will use linear regression to predict salaries using Linear Regression based on given input attributes.

On high level, we will take following steps to do the prediction analysis.

- Environment setup
- Loading the dataset (Data Ingestion)
- Data Cleansing and Transformation
- Basic Data Exploration and Visualization
- Predicting salary with linear fitted model

## Environment Setup

- Install R
- Install R Studio
- Install Car Package

## Loading the dataset

We will be using Salaries data of Professors from CAR library. Salary data includes the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

#### Load the file

Run following commands in R Studio.

```
##### Install Car Package and load the data
install.packages("car")
library(car)
data(Salaries)
```

#### Understand data

Lets see sample data for the the Salaries dataset by running below command which selects every 5th row starting from 1st row.

```
Salaries[seq(from=1, to=60, by=5),]
```

Here is the description given by CAR package for given data. It’s a data frame with 397 observations on the following 6 variables.

- rank - a factor with levels AssocProf AsstProf Prof
- discipline - a factor with levels A (“theoretical” departments) or B (“applied” departments).
- yrs.since.phd - years since PhD.
- yrs.service - years of service.
- sex - a factor with levels Female Male
- salary - nine-month salary, in dollars.

#### Evaluate the data

Once data is loaded, evaluate the data using str function.

```
##### View data type of training dataset
class(Salaries)
##### Check data with str function
str(Salaries)
$ rank : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
$ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
$ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
```

Notice above the different attributes and their data type and the first few values of each attribute.

## Data Cleansing and Transformation

In R, the lm(), or “linear model,” function can be used to create a simple regression model.

### Detecting missing values

As you know, input sample data may have missing values which can impact the overall prediction. Let’s find out if we have any missing values.

```
##### use amelia package to find missing values
install.packages("Amelia")
require(Amelia)
missmap(Salaries, main="Professor Salary Data - Missings Map", col=c("red", "green"),
legend=FALSE)
```

It appears that we do not have any missing values so we are good.

## Basic Data Exploration and Visualization

#### Find variable relationships with Salary attribute

First, we will visualize the variable Salaries against rank, discipline, yrs.since.phd, yrs.service and sex

```
par(mfrow=c(3,2))
plot(Salaries$salary ~ Salaries$rank)
plot(Salaries$salary ~ Salaries$discipline)
plot(Salaries$salary ~ Salaries$yrs.since.phd)
plot(Salaries$salary ~ Salaries$yrs.service)
plot(Salaries$salary ~ Salaries$sex)
```

**Things to Note:**

- From the graph it appears that professor’s sex is not a factor in salary determination, however we see that males has slight higher salary than females.
- There appears to be somewhat linear relationship with other attributes like yrs.since.phd,yrs.service
- rank, discipline also seems to have positive relationship with salary. As rank, discipline increases, salary increases.

Lets us lm function to fit the model and analyze what all attributes are significant.

```
lmfit = lm(salary ~ ., data = Salaries)
summary(lmfit)
##### output
Call:
lm(formula = salary ~ ., data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-65248 -13211 -1775 10384 99592
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65955.2 4588.6 14.374 < 2e-16 ***
rankAssocProf 12907.6 4145.3 3.114 0.00198 **
rankProf 45066.0 4237.5 10.635 < 2e-16 ***
disciplineB 14417.6 2342.9 6.154 1.88e-09 ***
yrs.since.phd 535.1 241.0 2.220 0.02698 *
yrs.service -489.5 211.9 -2.310 0.02143 *
sexMale 4783.5 3858.7 1.240 0.21584
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22540 on 390 degrees of freedom
Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
```

**Things to Note:**

- If you look at coefficients section, We can seet that rank, discipline, yrs.since.phd and yrs.service show a significance (p-value < 0.05).
- We can drop the insignificant sex attribute (which has a p-value greater than 0.05).

Lets fit the independent variables (rank, discipline, yrs.since.phd and yrs.service) with the dependent variable (salaries) in the linear model.

```
lmfit = lm(salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries)
summary(lmfit)
##### output
Call:
lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service,
data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-65244 -13498 -1455 9638 99682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69869.0 3332.1 20.968 < 2e-16 ***
rankAssocProf 12831.5 4147.7 3.094 0.00212 **
rankProf 45287.7 4236.7 10.689 < 2e-16 ***
disciplineB 14505.2 2343.4 6.190 1.52e-09 ***
yrs.since.phd 534.6 241.2 2.217 0.02720 *
yrs.service -476.7 211.8 -2.250 0.02497 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22550 on 391 degrees of freedom
Multiple R-squared: 0.4525, Adjusted R-squared: 0.4455
F-statistic: 64.64 on 5 and 391 DF, p-value: < 2.2e-16
```

As we fit selected predictor variables, it raises the f-statistic from 54.2 to 64.64.

#### Draw Diagnostic plot

Let’s draw diagnostic plot of lmfit

```
par(mfrow=c(2,2))
plot(lmfit)
```

**Things to Note:**

- Within the diagnostic plot, all the four plots indicate that the regression model follows the regression assumption.
- However, from residuals versus fitted and scale-location plot, residuals of smaller fitted values are biased toward the regression model.

#### Diagnose the multi-colinearity of the regression model

To detect multi-colinearity, we can calculate the variance inflation and generalized variance inflation factors for linear and generalized linear models with the vif function.

Multi-colinearity takes place when a predictor is highly correlated with others. If multi-colinearity exists in the model, you might see some variables have a high R-squared value but are shown as variables insignificant.

Run below commands.

```
vif(lmfit)
##### output
GVIF Df GVIF^(1/(2*Df))
rank 2.003562 2 1.189736
discipline 1.063139 1 1.031086
yrs.since.phd 7.518920 1 2.742065
yrs.service 5.908984 1 2.430840
sqrt(vif(lmfit)) > 2
##### output
GVIF Df GVIF^(1/(2*Df))
rank FALSE FALSE FALSE
discipline FALSE FALSE FALSE
yrs.since.phd TRUE FALSE FALSE
yrs.service TRUE FALSE FALSE
```

**Things to Note:**

- If multi-colinearity exists, we should find predictors with the square root of variance inflation factor above 2. Then, we may remove redundant predictors or use a principal component analysis to transform predictors to a smaller set of uncorrelated components.

#### Diagnose the heteroscedasticity

We can perform the Breusch-Pagan test for heteroscedasticity with the bptest function available with the lmtest package.

The simple regression model assumes that the variance of the error is constant or homogeneous across observations. Heteroscedasticity means that the variance is unequal across observations.

Run below commands to diagnose the heteroscedasticity of the regression model using the bptest function.

```
install.packages("lmtest")
library(lmtest)
bptest(lmfit)
studentized Breusch-Pagan test
data: lmfit
BP = 63.172, df = 5, p-value = 2.681e-12
```

**Things to Note:**

- In this case, the p-value shows 2.681e-12 (<0.5), which rejects the null hypothesis of homoscedasticity (no heteroscedasticity).

#### Correct the standard Errors

From p-value above, It implies that the standard errors of the parameter estimates are incorrect. However, we can use robust standard errors to correct the standard error (do not remove the heteroscedasticity) and increase the significance of truly significant parameters with robcov function available in the rms package.

Run below commands to correct standard errors with robcov:

```
install.packages("rms")
library(rms)
olsfit = ols(salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries, x= TRUE, y= TRUE)
robcov(olsfit)
#####
Linear Regression Model
ols(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service,
data = Salaries, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 397 LR chi2 239.17 R2 0.453
sigma 22554.1128 d.f. 5 R2 adj 0.446
d.f. 391 Pr(> chi2) 0.0000 g 22503.353
Residuals
Min 1Q Median 3Q Max
-65244 -13498 -1455 9638 99682
Coef S.E. t Pr(>|t|)
Intercept 69869.0110 1999.2090 34.95 <0.0001
rank=AssocProf 12831.5375 2201.3419 5.83 <0.0001
rank=Prof 45287.6890 3262.5067 13.88 <0.0001
discipline=B 14505.1514 2295.5470 6.32 <0.0001
yrs.since.phd 534.6313 311.1552 1.72 0.0865
yrs.service -476.7179 304.4571 -1.57 0.1182
```

## Predicting salary with linear fitted model

With a fitted regression model, we can apply the model to predict unknown values. For regression models, we can express the precision of prediction with a prediction interval and a confidence interval.

We will adopt following approach for predicting passenger survival.

- Get 2 rows from existing data set
- Use linear regression model generated previously.
- Run the prediction model on the testing dataset.

```
##### Select 2 rows
Salaries[seq(from=1, to=10, by=5),]
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
6 AssocProf B 6 6 Male 97000
newProfessors = Salaries[seq(from=1, to=10, by=5),]
##### Predict using confidence interval
predict(lmfit, newProfessors, interval="confidence", level=0.95)
fit lwr upr
1 131238.92 126982.8 135495.1
6 97553.18 91283.9 103822.5
##### Predict using predict interval
predict(lmfit, newProfessors, interval="predict")
fit lwr upr
1 131238.92 86692.62 175785.2
6 97553.18 52769.68 142336.7
```

Note that the confidence interval predicted salary values are quite close to real values in question.

## References

# | Reference |
---|---|

1 | wehrley’s Blog |

2 | Machine Learning with R |

## Version History

Date | Description |
---|---|

2016-09-01 | Initial Version |