Project 2 Reflections
For the second project in ST 558, we had to automate the process of ETL, EDA, model building, model comparison, and outputting a report for different 6 subsets of data through a single .Rmd file. In each report, we performed some basic EDA, then constructed 4 models: 2 liear regression, 1 random forest, and 1 boosted tree model. Then we selected the best model based on the performance on the test set.
The repo containing all of the files can be found here, and the GitHub pages of the repo can be found here.
Difficulties
To be honest, I didn’t find any part of this project too difficult. I think that programatically, this project was a good bit easier than the first one. It was drilled into me from when I first started programming that “hard coding is bad coding,” so it didn’t really require any adjustment from me to adapt things for the automation.
There were 2 things outside of the programming that caused some trouble:
-
GitHub in general. It wasn’t as bad this time as the first project, but GitHub still remains one of the biggest things I feel like I don’t actually understand at all. I can get it to work (obviously), but I just feel so lost using it for the most part.
-
The computation time for the random forest model. When I first tried to run the code to build the model, I hadn’t bothered to set up any of the parallel processing yet - I just wanted to see if the code worked. I probably waited for around 45 minutes before I realized that I was being kind of an idiot, and that I really should just stop and parallelize it since I was gonna have to run it multiple times to test different methods anyway. I ended up testing several different methods to try and speed it up a bit (such as inputting the predictors as a matrix instead of data frame so they took up less memory across workers), and by the end it would run in less than 10 minutes or so. That was definitely a lot more manageable, but still a bit of a pain to work around.
What I’d Change
If I could go back in time I would definitely have started working on the project earlier, though that feels like it’s a bit of an eternal problem (and honestly I don’t know if/when I had the time to start this one any earlier anyway). I suppose I also would have set up the parallel processing from the beginning as well, but I fixed that relatively quickly anyway. Other than that, I don’t think there’s anything I’d really want to change.
Main Takeaways
Probably the biggest takeaway I had from this project is how much work it would be to fully implement a model into production when there are many more thousands (or even millions) of records that we are dealing with. I obviously don’t have the computational resources that a large-scale company would, but my computer is reasonably powerful and I still had to do a fair bit to make the computation times more manageable. I can’t imagine how difficult it would be when you have that much more data, and need something that is “live” so to speak.
The other main takeaway I had is how cool ensemble methods are. I was able to throw basically every single variable in the dataset into the random forest model, and didn’t have to worry about any higher-order terms, interactions, etc. - and it would still reasonably account for all of that. When I started to build my linear regression model, it was a little overwhelming with how many variables we had in the data, so I ended up just using stepwise selection so I wouldn’t have to go through all of the variables individually. But there could’ve been all sorts of interactions or higher-order effects that weren’t accounted for in that model, but the random forest just naturally handles them. Obviously the computational intensity is much higher, but it made things much easier outside of that (and more accurate in almost every case).