Logistic Regression Model R
You have to do STAT 8561 project on data from the Titanic. The data set has 1371 observations from the titanic. Each observation is one of the people on the titanic and each variable is the age, class, and whether or not that particular person survived when the titanic sank. You have to create a logistic regression model to predict the probability that someone lived on the titanic based on their age, class on the boat, and their gender.
In order to properly conduct this, we will first need that the assumptions of logistic regression are met:
- Dependent variable is binary
- P(y=1) is the probability of the event occurring (i.e passenger lives)
- Model should be fitted correctly. We will use a stepwise method to determine which variables are to be included in the model and perform the Hosmer-Lemeshow GOF test. We will also look at the deviance residuals and the partial residuals. The partial residuals will indicate if any of the predictors need to be transformed.
- Error terms are independent.
- Linearity of independent variables and log odds. We will check this by looking at a scatter plot of each predictor versus log odds.
- Large sample size. This is certainly done since n = 1371.
By doing this, the data is displayed as a two way table to display the information in a better way. Then use R to produce a logistic regression model of the data. The estimated coefficients interpreted in terms of the predicted survival odds. Then use R to create confidence intervals for the coefficient estimates and the coefficient standard errors.
If there is still enough pages left in the project, the proportion of incorrectly predicted outcomes estimated while also showing a few predicted values. Upon doing this, graph a receiver operating characteristic curve to see a graphical display of the predictive ability of the model.
I’d like to do my STAT project on mtcars dataset inbuilt in R. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Description of the dataset :
A data frame with 32 observations on 11 variables.
|[, 1]||mpg||Miles/(US) gallon|
|[, 2]||cyl||Number of cylinders|
|[, 3]||disp||Displacement (cu.in.)|
|[, 4]||Hp||Gross horsepower|
|[, 5]||drat||Rear axle ratio|
|[, 6]||Wt||Weight (1000 lbs)|
|[, 7]||qsec||1/4 mile time|
|[, 9]||Am||Transmission (0 = automatic, 1 = manual)|
|[, 10]||gear||Number of forward gears|
|[, 11]||carb||Number of carburetors|
Create a multiple linear regression model to predict the variable mpg based on the other variables.
For doing this, we’d follow steps below:
- Observe that the variable mpg is a continuous variable
- Model should be as good as possible. To ensure so, first we’ll draw the scatterplots and look at the correlation matrix to select which variables are important predictors in predicting mpg . This’d help us to keep the model simple & neat.
- After running the multiple regression models in R, we’d look at the standardised coefficients to check their p-values and would conclude about significant predictors.
- We’d check the normality of residuals.
Once we get the results, we’d look at the value to conclude how well does the model work in predicting mpg. We’ll also try to get confidence interval of the coefficients in the model to get a better understanding of the model.