Predicting salaries using Linear Regression

In Supervised Machine Learning, Regression algorithms helps us to build a model by which we can predict the values of a dependent variable from the values of one or more independent variables. For example, predicting future demand for a product based on previous demand.

In this tutorial, we will use linear regression to predict salaries using Linear Regression based on given input attributes.

On high level, we will take following steps to do the prediction analysis.

Environment Setup


back to top

  • Install R
  • Install R Studio
  • Install Car Package

Loading the dataset


back to top

We will be using Salaries data of Professors from CAR library. Salary data includes the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

Load the file

Run following commands in R Studio.

#####  Install Car Package and load the data
install.packages("car")
library(car)

data(Salaries)

Understand data

Lets see sample data for the the Salaries dataset by running below command which selects every 5th row starting from 1st row.

Salaries[seq(from=1, to=60, by=5),]

Sample titanic dataset

Here is the description given by CAR package for given data. It’s a data frame with 397 observations on the following 6 variables.

  • rank - a factor with levels AssocProf AsstProf Prof
  • discipline - a factor with levels A (“theoretical” departments) or B (“applied” departments).
  • yrs.since.phd - years since PhD.
  • yrs.service - years of service.
  • sex - a factor with levels Female Male
  • salary - nine-month salary, in dollars.

Evaluate the data

Once data is loaded, evaluate the data using str function.


#####  View data type of training dataset
class(Salaries)

#####  Check data with str function
str(Salaries)

 $ rank         : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
 $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
 $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
 $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
 $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
 $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

Notice above the different attributes and their data type and the first few values of each attribute.

Data Cleansing and Transformation


back to top

In R, the lm(), or “linear model,” function can be used to create a simple regression model.

Detecting missing values

As you know, input sample data may have missing values which can impact the overall prediction. Let’s find out if we have any missing values.

#####  use amelia package to find missing values
install.packages("Amelia")
require(Amelia)
missmap(Salaries, main="Professor Salary Data - Missings Map", col=c("red", "green"),
 legend=FALSE)

Missing Values titanic dataset

It appears that we do not have any missing values so we are good.

Basic Data Exploration and Visualization


back to top

Find variable relationships with Salary attribute

First, we will visualize the variable Salaries against rank, discipline, yrs.since.phd, yrs.service and sex

par(mfrow=c(3,2)) 
plot(Salaries$salary ~ Salaries$rank)
plot(Salaries$salary ~ Salaries$discipline)
plot(Salaries$salary ~ Salaries$yrs.since.phd)
plot(Salaries$salary ~ Salaries$yrs.service)
plot(Salaries$salary ~ Salaries$sex)

Sample titanic dataset

Things to Note:

  • From the graph it appears that professor’s sex is not a factor in salary determination, however we see that males has slight higher salary than females.
  • There appears to be somewhat linear relationship with other attributes like yrs.since.phd,yrs.service
  • rank, discipline also seems to have positive relationship with salary. As rank, discipline increases, salary increases.

Lets us lm function to fit the model and analyze what all attributes are significant.

lmfit = lm(salary ~ ., data = Salaries)

summary(lmfit)

##### output
Call:
lm(formula = salary ~ ., data = Salaries)

Residuals:
   Min     1Q Median     3Q    Max 
-65248 -13211  -1775  10384  99592 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    65955.2     4588.6  14.374  < 2e-16 ***
rankAssocProf  12907.6     4145.3   3.114  0.00198 ** 
rankProf       45066.0     4237.5  10.635  < 2e-16 ***
disciplineB    14417.6     2342.9   6.154 1.88e-09 ***
yrs.since.phd    535.1      241.0   2.220  0.02698 *  
yrs.service     -489.5      211.9  -2.310  0.02143 *  
sexMale         4783.5     3858.7   1.240  0.21584    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22540 on 390 degrees of freedom
Multiple R-squared:  0.4547,	Adjusted R-squared:  0.4463 
F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16

Things to Note:

  • If you look at coefficients section, We can seet that rank, discipline, yrs.since.phd and yrs.service show a significance (p-value < 0.05).
  • We can drop the insignificant sex attribute (which has a p-value greater than 0.05).

Lets fit the independent variables (rank, discipline, yrs.since.phd and yrs.service) with the dependent variable (salaries) in the linear model.

lmfit = lm(salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries)

summary(lmfit)

##### output
Call:
lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, 
    data = Salaries)

Residuals:
   Min     1Q Median     3Q    Max 
-65244 -13498  -1455   9638  99682 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    69869.0     3332.1  20.968  < 2e-16 ***
rankAssocProf  12831.5     4147.7   3.094  0.00212 ** 
rankProf       45287.7     4236.7  10.689  < 2e-16 ***
disciplineB    14505.2     2343.4   6.190 1.52e-09 ***
yrs.since.phd    534.6      241.2   2.217  0.02720 *  
yrs.service     -476.7      211.8  -2.250  0.02497 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22550 on 391 degrees of freedom
Multiple R-squared:  0.4525,	Adjusted R-squared:  0.4455 
F-statistic: 64.64 on 5 and 391 DF,  p-value: < 2.2e-16

As we fit selected predictor variables, it raises the f-statistic from 54.2 to 64.64.

Draw Diagnostic plot

Let’s draw diagnostic plot of lmfit

par(mfrow=c(2,2))

plot(lmfit)

Sample titanic dataset

Things to Note:

  • Within the diagnostic plot, all the four plots indicate that the regression model follows the regression assumption.
  • However, from residuals versus fitted and scale-location plot, residuals of smaller fitted values are biased toward the regression model.

Diagnose the multi-colinearity of the regression model

To detect multi-colinearity, we can calculate the variance inflation and generalized variance inflation factors for linear and generalized linear models with the vif function.

Multi-colinearity takes place when a predictor is highly correlated with others. If multi-colinearity exists in the model, you might see some variables have a high R-squared value but are shown as variables insignificant.

Run below commands.

vif(lmfit)

##### output
                  GVIF Df GVIF^(1/(2*Df))
rank          2.003562  2        1.189736
discipline    1.063139  1        1.031086
yrs.since.phd 7.518920  1        2.742065
yrs.service   5.908984  1        2.430840


sqrt(vif(lmfit)) > 2

##### output
               GVIF    Df GVIF^(1/(2*Df))
rank          FALSE FALSE           FALSE
discipline    FALSE FALSE           FALSE
yrs.since.phd  TRUE FALSE           FALSE
yrs.service    TRUE FALSE           FALSE

Things to Note:

  • If multi-colinearity exists, we should find predictors with the square root of variance inflation factor above 2. Then, we may remove redundant predictors or use a principal component analysis to transform predictors to a smaller set of uncorrelated components.

Diagnose the heteroscedasticity

We can perform the Breusch-Pagan test for heteroscedasticity with the bptest function available with the lmtest package.

The simple regression model assumes that the variance of the error is constant or homogeneous across observations. Heteroscedasticity means that the variance is unequal across observations.

Run below commands to diagnose the heteroscedasticity of the regression model using the bptest function.

install.packages("lmtest")
library(lmtest)

bptest(lmfit)

	studentized Breusch-Pagan test

data:  lmfit
BP = 63.172, df = 5, p-value = 2.681e-12

Things to Note:

  • In this case, the p-value shows 2.681e-12 (<0.5), which rejects the null hypothesis of homoscedasticity (no heteroscedasticity).

Correct the standard Errors

From p-value above, It implies that the standard errors of the parameter estimates are incorrect. However, we can use robust standard errors to correct the standard error (do not remove the heteroscedasticity) and increase the significance of truly significant parameters with robcov function available in the rms package.

Run below commands to correct standard errors with robcov:

install.packages("rms")
library(rms)
olsfit = ols(salary ~ rank + discipline + yrs.since.phd + yrs.service, data = Salaries, x= TRUE, y= TRUE)
robcov(olsfit)

#####
Linear Regression Model

ols(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, 
    data = Salaries, x = TRUE, y = TRUE)
                    Model Likelihood     Discrimination    
                       Ratio Test           Indexes        
Obs          397    LR chi2    239.17    R2       0.453    
sigma 22554.1128    d.f.            5    R2 adj   0.446    
d.f.         391    Pr(> chi2) 0.0000    g    22503.353    

Residuals

   Min     1Q Median     3Q    Max 
-65244 -13498  -1455   9638  99682 

               Coef       S.E.      t     Pr(>|t|)
Intercept      69869.0110 1999.2090 34.95 <0.0001 
rank=AssocProf 12831.5375 2201.3419  5.83 <0.0001 
rank=Prof      45287.6890 3262.5067 13.88 <0.0001 
discipline=B   14505.1514 2295.5470  6.32 <0.0001 
yrs.since.phd    534.6313  311.1552  1.72 0.0865  
yrs.service     -476.7179  304.4571 -1.57 0.1182

Predicting salary with linear fitted model


back to top

With a fitted regression model, we can apply the model to predict unknown values. For regression models, we can express the precision of prediction with a prediction interval and a confidence interval.

We will adopt following approach for predicting passenger survival.

  • Get 2 rows from existing data set
  • Use linear regression model generated previously.
  • Run the prediction model on the testing dataset.
#####  Select 2 rows

Salaries[seq(from=1, to=10, by=5),]
       rank discipline yrs.since.phd yrs.service  sex salary
1      Prof          B            19          18 Male 139750
6 AssocProf          B             6           6 Male  97000

newProfessors = Salaries[seq(from=1, to=10, by=5),]

#####  Predict using confidence interval
predict(lmfit, newProfessors, interval="confidence", level=0.95)

        fit      lwr      upr
1 131238.92 126982.8 135495.1
6  97553.18  91283.9 103822.5

#####  Predict using predict interval
predict(lmfit, newProfessors, interval="predict")

        fit      lwr      upr
1 131238.92 86692.62 175785.2
6  97553.18 52769.68 142336.7

Note that the confidence interval predicted salary values are quite close to real values in question.

References


# Reference
1 wehrley’s Blog
2 Machine Learning with R

Version History


Date Description
2016-09-01    Initial Version