Exploring Prosper Data
Intro
The dataset investigated is from Prosper, which is a peer-to-peer money investing or borrowing website. It works in the following way: borrowers choose loan amount, purpose and post a loan listing; investors review loan listings and invest in listings they are interested in; once the process is complete, borrowers make fixed monthly payments and investors receive a portion of those payments directly to their Prosper account.
My main goals for this exploratory data analysis is two folds. The first one is to understand some of the variables and visualize the distribution. The second one is try to find possbile correlations among the variables.
The methodology is through univariate, bivariate and multivariate analysis. The tool I will be using is R’s visualization package ggplot2 and linear model.
Approaches
I conducted univariate analysis, bivariate analysis and multivariate analysis to explore this dataset.
Ingsight
The Prosper loan data set has 113,937 transaction record with 81 variables. I explored 15 out of these 81 variables. I started by looking at the documentation and tried to find interesting variables. Then I used various plots to check how these variable distributed. I struggled with understanding this dataset and asking interesting questions out of this dataset. To further understand this dataset, I tried to visualize the interaction between two variables. For instance, I investigated how BorrowerRate is correlated with LenderYield, etc. Then one question came into my mind: if I were a lender, what kind of borrower should I lend my money to? From the lender’s perspective, he doesn’t care about the borrower’s income range, the borrower’s credit score, the borrower’s purpose of borrowing money or the borrower’s delinquent amount. The only thing he cares about is to make more money. With this on mind, I made more bivariate plots and tried to find which variables are correlated with the LenderYield variable. I found that the rating from Prosper was an good indicator in a seemingly strange way. That is the lower the rating is, the higher the expected yield is. This is surprising at first. Then I found that this was true, since we need higher return to compensate high risk. I also found the BorrowerState, Term, AmountDelinquent, DebtToIncomeRatio helpful to predicting the loan result through multivariate analysis. To summarize my finding in one single sentence, you should lend your money to some borrowers from Alabama, with Propser rating HR, high delinquent amount, high debt to income ratio for 36 or 60 months.