In this tutorial, we will use linear regression to predict salaries using Linear Regression based on given input attributes.
On high level, we will take following steps to do the prediction analysis.
- Environment setup
- Loading the dataset (Data Ingestion)
- Data Cleansing and Transformation
- Basic Data Exploration and Visualization
- Predicting salary with linear fitted model
- Install R
- Install R Studio
- Install Car Package
Loading the dataset
We will be using Salaries data of Professors from CAR library. Salary data includes the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.
Load the file
Run following commands in R Studio.
##### Install Car Package and load the data install.packages("car") library(car) data(Salaries)
Lets see sample data for the the Salaries dataset by running below command which selects every 5th row starting from 1st row.
Salaries[seq(from=1, to=60, by=5),]
Here is the description given by CAR package for given data. It’s a data frame with 397 observations on the following 6 variables.
- rank - a factor with levels AssocProf AsstProf Prof
- discipline - a factor with levels A (“theoretical” departments) or B (“applied” departments).
- yrs.since.phd - years since PhD.
- yrs.service - years of service.
- sex - a factor with levels Female Male
- salary - nine-month salary, in dollars.
Evaluate the data
Once data is loaded, evaluate the data using str function.
##### View data type of training dataset class(Salaries) ##### Check data with str function str(Salaries) $ rank : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ... $ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ... $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ... $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ... $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ... $ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
Notice above the different attributes and their data type and the first few values of each attribute.
Data Cleansing and Transformation
In R, the lm(), or “linear model,” function can be used to create a simple regression model.
Detecting missing values
As you know, input sample data may have missing values which can impact the overall prediction. Let’s find out if we have any missing values.
##### use amelia package to find missing values install.packages("Amelia") require(Amelia) missmap(Salaries, main="Professor Salary Data - Missings Map", col=c("red", "green"), legend=FALSE)
It appears that we do not have any missing values so we are good.
Basic Data Exploration and Visualization
Find variable relationships with Salary attribute
First, we will visualize the variable Salaries against rank, discipline, yrs.since.phd, yrs.service and sex
par(mfrow=c(3,2)) plot(Salaries$salary ~ Salaries$rank) plot(Salaries$salary ~ Salaries$discipline) plot(Salaries$salary ~ Salaries$yrs.since.phd) plot(Salaries$salary ~ Salaries$yrs.service) plot(Salaries$salary ~ Salaries$sex)
Things to Note:
- From the graph it appears that professor’s sex is not a factor in salary determination, however we see that males has slight higher salary than females.
- There appears to be somewhat linear relationship with other attributes like yrs.since.phd,yrs.service
- rank, discipline also seems to have positive relationship with salary. As rank, discipline increases, salary increases.
Lets us lm function to fit the model and analyze what all attributes are significant.
lmfit = lm(salary ~ ., data = Salaries) summary(lmfit) ##### output Call: lm(formula = salary ~ ., data = Salaries) Residuals: Min 1Q Median 3Q Max -65248 -13211 -1775 10384 99592 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 65955.2 4588.6 14.374 < 2e-16 *** rankAssocProf 12907.6 4145.3 3.114 0.00198 ** rankProf 45066.0 4237.5 10.635 < 2e-16 *** disciplineB 14417.6 2342.9 6.154 1.88e-09 *** yrs.since.phd 535.1 241.0 2.220 0.02698 * yrs.service -489.5 211.9 -2.310 0.02143 * sexMale 4783.5 3858.7 1.240 0.21584 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22540 on 390 degrees of freedom Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463 F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
Things to Note:
- If you look at coefficients section, We can seet that rank, discipline, yrs.since.phd and yrs.service show a significance (p-value < 0.05).
- We can drop the insignificant sex attribute (which has a p-value greater than 0.05).
Lets fit the independent variables (rank, discipline, yrs.since.phd and yrs.service) with the dependent variable (salaries) in the linear model.
lmfit = lm(salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries) summary(lmfit) ##### output Call: lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries) Residuals: Min 1Q Median 3Q Max -65244 -13498 -1455 9638 99682 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 69869.0 3332.1 20.968 < 2e-16 *** rankAssocProf 12831.5 4147.7 3.094 0.00212 ** rankProf 45287.7 4236.7 10.689 < 2e-16 *** disciplineB 14505.2 2343.4 6.190 1.52e-09 *** yrs.since.phd 534.6 241.2 2.217 0.02720 * yrs.service -476.7 211.8 -2.250 0.02497 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22550 on 391 degrees of freedom Multiple R-squared: 0.4525, Adjusted R-squared: 0.4455 F-statistic: 64.64 on 5 and 391 DF, p-value: < 2.2e-16
As we fit selected predictor variables, it raises the f-statistic from 54.2 to 64.64.
Draw Diagnostic plot
Let’s draw diagnostic plot of lmfit
Things to Note:
- Within the diagnostic plot, all the four plots indicate that the regression model follows the regression assumption.
- However, from residuals versus fitted and scale-location plot, residuals of smaller fitted values are biased toward the regression model.
Diagnose the multi-colinearity of the regression model
To detect multi-colinearity, we can calculate the variance inflation and generalized variance inflation factors for linear and generalized linear models with the vif function.
Multi-colinearity takes place when a predictor is highly correlated with others. If multi-colinearity exists in the model, you might see some variables have a high R-squared value but are shown as variables insignificant.
Run below commands.
vif(lmfit) ##### output GVIF Df GVIF^(1/(2*Df)) rank 2.003562 2 1.189736 discipline 1.063139 1 1.031086 yrs.since.phd 7.518920 1 2.742065 yrs.service 5.908984 1 2.430840 sqrt(vif(lmfit)) > 2 ##### output GVIF Df GVIF^(1/(2*Df)) rank FALSE FALSE FALSE discipline FALSE FALSE FALSE yrs.since.phd TRUE FALSE FALSE yrs.service TRUE FALSE FALSE
Things to Note:
- If multi-colinearity exists, we should find predictors with the square root of variance inflation factor above 2. Then, we may remove redundant predictors or use a principal component analysis to transform predictors to a smaller set of uncorrelated components.
Diagnose the heteroscedasticity
We can perform the Breusch-Pagan test for heteroscedasticity with the bptest function available with the lmtest package.
The simple regression model assumes that the variance of the error is constant or homogeneous across observations. Heteroscedasticity means that the variance is unequal across observations.
Run below commands to diagnose the heteroscedasticity of the regression model using the bptest function.
install.packages("lmtest") library(lmtest) bptest(lmfit) studentized Breusch-Pagan test data: lmfit BP = 63.172, df = 5, p-value = 2.681e-12
Things to Note:
- In this case, the p-value shows 2.681e-12 (<0.5), which rejects the null hypothesis of homoscedasticity (no heteroscedasticity).
Correct the standard Errors
From p-value above, It implies that the standard errors of the parameter estimates are incorrect. However, we can use robust standard errors to correct the standard error (do not remove the heteroscedasticity) and increase the significance of truly significant parameters with robcov function available in the rms package.
Run below commands to correct standard errors with robcov:
install.packages("rms") library(rms) olsfit = ols(salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries, x= TRUE, y= TRUE) robcov(olsfit) ##### Linear Regression Model ols(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries, x = TRUE, y = TRUE) Model Likelihood Discrimination Ratio Test Indexes Obs 397 LR chi2 239.17 R2 0.453 sigma 22554.1128 d.f. 5 R2 adj 0.446 d.f. 391 Pr(> chi2) 0.0000 g 22503.353 Residuals Min 1Q Median 3Q Max -65244 -13498 -1455 9638 99682 Coef S.E. t Pr(>|t|) Intercept 69869.0110 1999.2090 34.95 <0.0001 rank=AssocProf 12831.5375 2201.3419 5.83 <0.0001 rank=Prof 45287.6890 3262.5067 13.88 <0.0001 discipline=B 14505.1514 2295.5470 6.32 <0.0001 yrs.since.phd 534.6313 311.1552 1.72 0.0865 yrs.service -476.7179 304.4571 -1.57 0.1182
Predicting salary with linear fitted model
With a fitted regression model, we can apply the model to predict unknown values. For regression models, we can express the precision of prediction with a prediction interval and a confidence interval.
We will adopt following approach for predicting passenger survival.
- Get 2 rows from existing data set
- Use linear regression model generated previously.
- Run the prediction model on the testing dataset.
##### Select 2 rows Salaries[seq(from=1, to=10, by=5),] rank discipline yrs.since.phd yrs.service sex salary 1 Prof B 19 18 Male 139750 6 AssocProf B 6 6 Male 97000 newProfessors = Salaries[seq(from=1, to=10, by=5),] ##### Predict using confidence interval predict(lmfit, newProfessors, interval="confidence", level=0.95) fit lwr upr 1 131238.92 126982.8 135495.1 6 97553.18 91283.9 103822.5 ##### Predict using predict interval predict(lmfit, newProfessors, interval="predict") fit lwr upr 1 131238.92 86692.62 175785.2 6 97553.18 52769.68 142336.7
Note that the confidence interval predicted salary values are quite close to real values in question.
|2||Machine Learning with R|