For Project 2, I practiced summarizing data with an exploratory data analysis (EDA), creating linear regression and ensemble tree-based models, and automating Markdown reports.

What Would I Do Differently?

At first, I struggled to do the EDA because I did not know where to start or which variables to focus on. I began by making graphs that just focused on one or two variables. I realized that I would need to do this analysis for all of the variables (including the numeric and categorical). At first, my EDA did not flow very well because it did not tell a story about the entire data set. I decided to start fresh and only investigate the response variable. After I made a few graphs, I then discovered that I could make multiple histograms with hist.data.frame(). I then began to visualize both the numeric and non-numeric variables. If I was able to do this part again, I would start by planning out my EDA and deciding which variables to visualize first. Then, I would prepare tables/graphs to explain each variable.

What was the Most Difficult Part?

The most difficult part of this project was understanding how to automate this process. I remember learning how to do this in RMarkdown in lecture by setting parameters and using the apply and render functions. I learned that the trick is to not “hard-code” everything so that the automation could work.

What are the Big Take-Aways from this Project?

Every model has its upsides and downsides. For example, a tree model is more flexible than a simple linear regression model (SLR) because a tree model splits up the predictor space into regions and then averages all the values within each region to make a prediction whereas an SLR fits one line for the entire data set. This gives a tree model more flexibility than an SLR, but it may increase the variance.

One way to decrease the variance of trees is to use the average prediction from many trees as our final prediction (e.g., random forest and boosted tree models). If the predictions are independent from one another, taking the average is a great way to reduce variation. But, if the data set has a strong predictor, then all the trees will probably use that predictor for their first split, resulting in trees that are highly correlated with one another. To fix this, we can use a random forest model because it uses a random subset of predictors in each tree instead of including every predictor in each tree. However, there is a risk of overfitting. One way to fix this is to use a boosted tree model, which trains each tree sequentially so that every tree that’s created is based off the previous one. However, boosted tree models are affected by outliers.

You need to pick the best model based off the data and the resulting statistics, like the standard deviation of the prediction errors, or root mean square error. The repository we made can compare multiple models for several data sets to find the best one for each.

Cool Stuff in R

Fourth Blog Post