Mortgage Modeling


Capstone Machine Learning Data Analytics Project

Social Media | Machine Learning Mortgage Modeling

Capstone Project: Mortgage Default Prediction Project Proposal

Fannie Mae Data:

Fannie Mae Single-Family Acquisition and Performance data. Files are released quarterly but contain monthly information. Each of the quarterly files (Acquisition and Performance) contains information on loans that originated in that quarter and all history to the most recent quarter. Fannie Mae is the nickname of FNMA, the Federal National Mortgage Association. Fannie Mae was established in 1938 by congress as part of the New Deal to stimulate the housing market by making mortgage more attainable for low- and middle-income families. Fannie Mae does not originate loans, but it does back or guarantee them.

Canueza's Goal:

The goal of this project is to use Fannie Mae’s mortgage performance data to make predictions about a given customer. We want to forecast 1. the probability a borrower will go into default in the next quarter, 2. the probability a borrower will not pay their next mortgage payment, 3. how long until a borrower goes into default, and 4. how these trends vary geographically.

Inspiration:

The Federal National Mortgage Association (Fannie Mae) guarantees mortgages to stimulate lenders to assume the risk of extending a mortgage to a qualified borrower. In order to qualify for a mortgage backed by Fannie Mae, borrowers must meet certain criteria that are strong indicators of their ability to pay the loan. Despite choosing borrowers that meet criteria, sometimes, borrowers are not able to make regular payments on their loan and might even go into default. Lenders know there will be some loss involved in mortgage lending. If the lender can predict how much loss they might incur, they can employ methods to hedge their losses.

In this project we seek to provide a tool that will help a lender to determine if an individual account will default given certain origination information and current age.

Analysis:

The focus of this analysis is to develop a loan level model that predicts whether a borrower will default (miss 4 or more payments) on their mortgage based on origination nand performance information available through Fannie Mae.

Model fitting began with logistic regression on a random sample of Fannie Mae data. The resulting model had an accuracy score off 99.8%, which seems too good to be true. When you look at the precision of both outcomes: default or not default. There was not a single prediction of defaulting. Thus, not defaulting was predicted with 100% accuracy, and defaulting was predicted with 0% accuracy. This averaged out to a score of 99.8 because the number of defaults present in the data is very small.

In order to make predictions about defaulting, more observations of default are needed. Because the model assumes independence among the observations and is not trying to make predictions about the population, but instead a single account, the proportion of loans that defaulted are sampled at a higher rate than loans that do not default.

Running logistic regression using LogisticRegressionCV with the oversampled data provided better results. While the accuracy number was still high, 91%, the precision of default was improved, though not ideal at 61%.

A decision tree was fitted to the oversampled data set. In order to see any significant splits that identified delinquency, the depth had to be set at at least 5. At this depth, the tree is very complex, with 32 nodes, with only 5 of those nodes identifying default.

A random forest model was fitted using XGBoost provided a very accurate prediction of both defaulting and not defaulting (0 and 100%, respectively).

All fitted models seem to be most influenced by the age of a loan. As a loan ages, it's likelihood of defaulting increases. This might be because once a borrower begins to miss payments, it becomes increasingly hard as time passes to right their account. As the borrower gets further away from the date they qualified for their loan, there is more opportunity for their credit score, or debt to income to change, thus providing more possibility for default as time passes.

Conclusion:

Though several modeling techniques were investigated, the best models to provide a prediction of default were logistic regression and Random Forest using XGBoost. Both models are used to provide a user with a prediction of default given information enter for an individual loan.

What’s next for analysis?:

This initial analysis of the Fannie Mae data uses two modeling techniques to answer one question, whether a borrower will default. There are more opportunities for analysis and prediction using this data. There are also other methods and modifications to the existing analysis that might improve the prediction of default.

Further Default Prediction Analysis There are more modeling techniques that might provide better prediction of a borrower defaulting. A Bayesian forecast technique could be implemented assuming a distribution on the current loan delinquency status from the previous month to predict whether a borrower will default. A RandomizedSearchCV method of random forest optimization is thought to yield more robust solutions than the XGBoost method that was used here. It would be of interest to compare the accuracy of different random forest modeling methods.

Data Enhancements This study focuses on a narrow timeline to draw conclusions about defaulting early in the life of a mortgage. Defaulting is possible at any stage of a loan’s lifetime. Including data over a broader timeline would create a more robust model could predict default in more mature mortgages. To better capture the regional differences in the economic wellbeing in the country, including an MSA level Case Shiller Index would provide another variable that likely has a high correlation with defaulting trends.

Additional Questions the Data Can Answer While this study focuses on defaulting (missing 4 or more payments), an early look at delinquency could also be useful. The likelihood of missing a payment given the current loan delinquency status could be modeled using a Bayesian Logistic Regression, where the current delinquency status has an assumed distribution. Another outcome of interest is whether a loan will be prepaid. It would be interesting to look at factors that are predictive of whether a borrower will prepay their loan. Another way to look at the data is to consider how long it takes for a customer to default or miss a payment. Estimating the time until default could be accomplished using a hazard survival model where defaulting is the event of interest and loans that do not default in the observed time period are considered censored. Considering 3 possible ways for a mortgage to end, payment until maturity, default, and prepay, a competing risk hazard model could predict the time until any event that might result in the loan not reaching maturity.