This dataset contains more than 110,000 records of loan data with 81 variables, in this example I started from learning the basics about the data structure and variables, and then focus on exploring the relationship and correlation between APR and other variables.
In this section, I visualized the distributions and atrributes of some variables to get a deeper understanding of the dataset. The result is shown in the Univariate Analysis.
A paired analysis of 11 variables was conducted to reveal more on the correlation between them and to find out which has stronger correlation to APR. The result is shown in the Bivariate Analysis section.
Multivariate analysis and plots showed clearer relationship between three variables. The result is shown in the Multivariate Analysis.
The personal loan dataset contains information of 113,937 records of personal loan with 81 variables originated from 2005 to 2014. Since it is a large dataset, I started from learning the basics about the prosper loan while exploring the variables. And after some initial analysis, I focused on the borrower’s APR because I think that APR is the key variable in this dataset and it is also one of the most important ratio in any loan.
There are many variables that interact with APR, some variables have clear correlation like borrower rate, credit grade, prosper score and so on, and some of them show correlation after adjustment, like delinquencies last 7 years, current credit lines, and inquiries last 6 months. Among these variables, borrower rate should be the easiest one to think of because of the strong correlation score (almost 0.99), but after explored the plot of the two, I noticed that there are intervals at the same slope, meaning that the same borrower rate points to different levels of APR although the rate they interact is almost the same.
As I continued exploring the relationship of APR and borrower rate, I found out that the loan origination years and employment status reflect the interval. The service fee and related tax could be different in different years, and it could be the reason for the APR to borrwer rate gap, but the employment staus could mean that the fee may depend on some conditions of the borrower.
This report explores a dataset containing more than 100,000 loan records with more than 80 attributes for each loan.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 8 levels "AA","A","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2802 levels "2005-11-25 00:00:00",..: 1137 NA 1262 NA NA NA NA NA NA NA ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 7 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
## $ Occupation : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11585 levels "1947-08-24 00:00:00",..: 8638 6616 8926 2246 9497 496 8264 7684 5542 5542 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 C : 5649 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 D : 5153 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 B : 4389 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 AA : 3509 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 HR : 3508 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 (Other): 6745 Max. :60.00
## (Other) :113912 NA's :84984
## LoanStatus ClosedDate
## Current :56576 2014-03-04 00:00:00: 105
## Completed :38074 2014-02-19 00:00:00: 100
## Chargedoff :11992 2014-02-11 00:00:00: 92
## Defaulted : 5018 2012-10-30 00:00:00: 81
## Past Due (1-15 days) : 806 2013-02-26 00:00:00: 78
## Past Due (31-60 days): 363 (Other) :54633
## (Other) : 1108 NA's :58848
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 C :18345 Min. : 1.00
## 1st Qu.:3.000 B :15581 1st Qu.: 4.00
## Median :4.000 A :14551 Median : 6.00
## Mean :4.072 D :14274 Mean : 5.95
## 3rd Qu.:5.000 E : 9795 3rd Qu.: 8.00
## Max. :7.000 (Other):12307 Max. :11.00
## NA's :29084 NA's :29084 NA's :29084
## ListingCategory..numeric. BorrowerState Occupation
## Min. : 0.000 CA :14717 Other :28617
## 1st Qu.: 1.000 TX : 6842 Professional :13628
## Median : 1.000 NY : 6729 Computer Programmer: 4478
## Mean : 2.774 FL : 6720 Executive : 4311
## 3rd Qu.: 3.000 IL : 5921 Teacher : 3759
## Max. :20.000 (Other):67493 (Other) :55556
## NA's : 5515 NA's : 3588
## EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
## Employed :67322 Min. : 0.00 False:56459
## Full-time :26355 1st Qu.: 26.00 True :57478
## Self-employed: 6134 Median : 67.00
## Not available: 5347 Mean : 96.07
## Other : 3806 3rd Qu.:137.00
## (Other) : 2718 Max. :755.00
## NA's : 2255 NA's :7625
## CurrentlyInGroup GroupKey
## False:101218 783C3371218786870A73D20: 1140
## True : 12719 3D4D3366260257624AB272D: 916
## 6A3B336601725506917317E: 698
## FEF83377364176536637E50: 611
## C9643379247860156A00EC0: 342
## (Other) : 9634
## NA's :100596
## DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper
## 2013-12-23 09:38:12: 6 Min. : 0.0 Min. : 19.0
## 2013-11-21 09:09:41: 4 1st Qu.:660.0 1st Qu.:679.0
## 2013-12-06 05:43:16: 4 Median :680.0 Median :699.0
## 2014-01-14 20:17:49: 4 Mean :685.6 Mean :704.6
## 2014-02-09 12:14:41: 4 3rd Qu.:720.0 3rd Qu.:739.0
## 2013-09-27 22:04:54: 3 Max. :880.0 Max. :899.0
## (Other) :113912 NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## 1993-12-01 00:00:00: 185 Min. : 0.00 Min. : 0.00
## 1994-11-01 00:00:00: 178 1st Qu.: 7.00 1st Qu.: 6.00
## 1995-11-01 00:00:00: 168 Median :10.00 Median : 9.00
## 1990-04-01 00:00:00: 161 Mean :10.32 Mean : 9.26
## 1995-03-01 00:00:00: 159 3rd Qu.:13.00 3rd Qu.:12.00
## (Other) :112389 Max. :59.00 Max. :54.00
## NA's : 697 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
Our dataset consists of 81 variables, with more than 110,000 observations.
First I explored some attributes about the loan to get some basic idea about the Prosper loans in this dataset. From the loan amount plots, we can see that most of the loans are below $15,000, with a few records above $30,000.
The term is either 12, 36 or 60 months, most of them is with 36 months term.
I converted the origination date to POSIXlt format and then extract the year and convert it to factor. Here we can see that few records are in 2005, and highest number of loans is in 2013.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
Most APR range from 0.05% to 0.42% with peak at around 0.36%,
Most borrower rate range from 0.05% to 0.35% with peak at around 0.32%,
The top 3 categories are debt consolidation, home improvement and business.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Around half of the loans are current, note that there are 11992 charged off loans.
Here I started to look at the variables about the borrower. From the plot, the income range $25,000 - $49,999 and $50,000 - $74,999 have the higher count.
High percentage of the borrwoer are employed.
The number of home owner is slightly higher.
There is a high count for inquiries last 6 months at or close to 0, but note that the value of outliers range from around 3 to above 100.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.32 13.00 59.00 7604
The mean of current credit lines is 10.32, the 3rd quantile is 13, but there are some high value after 40, so I trimmed the credit lines below 40 to get more detail of these high values.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
Similar to the credit line, some extreme high delinquencies can be found in the Delinquencies last 7 years, I made a plot of the top 95% of the delinquency values to identify the range.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Income to debt ratio is another variable that is with extreme high value, the histogram plot is heavily skewed, and the extreme value is quite unusual since there are some records even with a 1000% debt to income ratio which is not likely to be true considering a 45% is risky enough for some lenders.
To find out more detail, I performed a plot with values that are above the 99% quantile value. The result ranges from around 0.8 (80%) to 10.0 (1000%).
Original credit grade was ranked in order A AA B C D E HR NC, I changed the AA to be the first one.
The propser score distribution looks like bell curve with peaks at 4, 6 and 8.
The prosper rating peaks at C, and the plot seems to be close to bell curve.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463881.0 7622
From the summary of the amount delinquent we can see that a large percentage of this data is at 0, and the plot is skewed and also shows a very high 0 value count and a long tail to the right which means the plot is skewed to the right.
In order to get more detail from the data, I transformed the amount to log10 scale, and filter out the NA and 0 value, the result shows normal distribution like curve with peak at around 800.
There are 113937 loan records in this dataset with 81 attributes, 20 of the variables are factors, they are categoried as follow.
ListingKey, GroupKey, LoanKey , ClosedDate, MemberKey
ListingCreationDate, DateCreditPulled, LoanOriginationDate, LoanOriginationQuarter, FirstRecordedCreditLine
CreditGrade,
original order:
A AA B C D E HR NC
ProsperRating..Alpha.
original order:
A AA B C D E HR
BorrowerState, Occupation, LoanStatus, EmploymentStatus,
IncomeVerifiable, IsBorrowerHomeowner, CurrentlyInGroup,
IncomeRange
I would like to divide the data into three sections, one is the data for the listing, another one is for the borrowers and the last one is for lenders.
I found that the debt to income and delinquencies in last 7 years have some unusual outliers, and the amount delinquent is with a long tail to the right and has a high percentage of 0 value.
In regards to tidying and adjustment, I transformed all the blank data to “NA” when I import the dataset by adding na.strings = c(“”, “NA”) in the loading function in order to exclude these data at once. And I also convert the loan origination date to POSIXlt format and to factor so that I can easily separate them to different bucket or get a certain date range.
First I start exploring a subset of the data which includes the following:
BorrowerAPR, BorrowerRate, ProsperRating..Numeric., ProsperScore, CreditScoreRangeLower, OpenCreditLines, InquiriesLast6Months, DelinquenciesLast7Years, OpenRevolvingAccounts, PublicRecordsLast12Months, ProsperPaymentsOneMonthPlusLate.
From the paired analysis, we can see that APR has very strong correlation with borrower rate (0.989) and prosper rating (-0.948), moderate correlation with prosper score (-0.574) and credit score range (-0.434), all the other variables have slight correlation with APR with absolute score below 0.2.
Among the other vairables, prosper rating, prosper score and credit score range seems to correlate with each other, open credit lines has strong correlation with open revoling accounts which makes sense.
So, we can tell that the variables that has strong correlation in the subset we picked with APR can be ranked in following order:
Borrower rate, prosper rating, prosper score, credit score range.
Since there are more than 80 variables in the dataset and it is not possible to list all of them in this paired analysis, so I would like to see how other variables interact with APR.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 22
The APR in year 2005 doesn’t show up in the plot, by running a summary of the APR in 2005, we can see that there is not any data execept NA in year 2005.
There is no obvious pattern found in the plot but there are some outliers in 2006 with APR higher than almost the rest of the data. I decided to take a closer look at these outliers.
First I performed a borrower rate by year analysis to see how borrower rate is like in 2006 compared to the others. It show the similar plot as APR with very high outliers above 0.4 in 2006, so the APR outliers could result in high interest rate, but what is the reason behind interest rate?
Since the prosper rating and score are not avaialbe for data earlier than 2009, so I decied to see how the credit grade is for these outliers in 2006. The plot here is for the credit grade with APR above 0.99 quantile, we can see the APR in the plot should be those outliers above 0.4, and the most of credit grades here is High Risk. So, that means the high risk credit grading is the reason behind the APR and interest rate outliers?
To answer the question, I took a closer look at the credit grade before 2009 and the “HR” credit grading in each year, it turned out that there are more “HR” credit grade in 2007 and there is not much clue that the credit grade in 2006 caused the outliers.
As per our preliminary analysis, credit score range also correlates with APR and interest rate, and the plot shows something interesting, there is no credit score range for these APR above 0.4 in 2006. But I want to compare this with the data in other year to see the difference.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 10
From this credit score range and APR plot with facet wrap from 2006 to 2014 we can see that there is no credit score range associated with the APR outliers (above 0.4) in 2006, and there seems not any credit score for those APR that are higher than 0.32 which is abnormal compared with the plots in other years. The summary of credit score range of these APR outliers in 2006 also shows NA.
I think it is appropriate to assume that the outliers in APR in 2006 could be related to the lack of credit score.
And then I started to look at how other variables interact with APR. Different term doesn’t have much impact on APR.
Credit grade has strong correlation with APR like prosper rating and prosper score.
Whether borrower is a home owner or not affect the APR slightly, it is reasonable that a home owner gets lower APR.
Whether the borrower’s income is verifiable affect the APR mean by almost 0.05.
The categorical variable employment status doesn’t show specific pattern, but it makes sense that full time and part time employed borrwer gets better APR.
The debt to income ratio group doesn’t show any clear relationship between two variables.
The amount delinquent to APR plot doesn’t show any obvious correaltion.
But I would like to perform further analysis on these two varaibles because I highly doubt that they should have some correlation somehow.
So I subset the data by AmountDelinquent groups and added the APR mean of each group to the subset. However, the result still doesn’t show any clear relationship.
Similarly to the amount delinquent, I decided to look closer at how inquiries will affect APR. The initial plot doesn’t show any clear relationship between the two variables.
However when I group the inquiries and added the APR mean of the group, the plot start to show something interesting, but the relationship is still not clear.
After I removed the outliers with quantile .99 of inquiries last 6 months, the result become clear, this is helpful if I were to make a model to predict the APR.
There is not a obvious pattern in the income range to APR plot, but the fact that $0 income range gets lower APR than most of the other income ranges is quite surprising. After examing the distribution of $0 income range in different years we can see that most of the $0 income range came from 2007 and 2008, the two years that has lower APR mean than others, and it explains the reason.
##
## Pearson's product-moment correlation
##
## data: data$BorrowerRate and data$BorrowerAPR
## t = 2347.7, df = 113910, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9897057 0.9899409
## sample estimates:
## cor
## 0.989824
At last, I would like to take a closer look at he relationship between borrower APR and interest rate. The plot shows that the correlation is very strong and positive as expected, the layers at the same slope may indicate different service fees and other fees charged, this is the part that needs further investigation. As the calculation result shows, the correlation score is almost 0.99, and it is the strongest relationship I have found.
My first observation is the open revolvng account and the open credit lines from the matrix, they have strong correlation above 0.9.
And then I explored more about the APR, it turns out that the credit grade, prosper rating, and prosper score all have strong correlation with borrower’s APR.
Is borrower a home owner, is income verifiable, employment status, and the loan term have moderate correlation with borrower APR.
The other variables I explored like inquieries in last 6 months, current credit lines, and delinquencies in last 7 years have correlation with the mean of borrower’s APR in the same group.
There are also variables that do not have a clear correlation or have little correlation with the APR, for example, income range, debt to income ratio.
One interest thing I noticed is that the income range and borrower APR plot, it shows that the income range at $0 gets a better APR than most of the other income ranges, which doesn’t make sense.
After examing the distribution of $0 income range in different years, I found the reason is that most of the $0 income range came from year 2007 and 2008, the two years which have lower APR mean than the most of the other years.
The strongest relationship I found is the borrower APR and the borrower interest rate, they have strong correlation almost at 0.99.
By analyzing the relationship of APR and prosper rating/credit grade in each year, we can get better idea of how year affect the APR. It is easy to understand that the difference of interest rate in each year could be the main reason why the APR is different in different period even you have the same rating/grade, but could there be any other reason? I would like to take a look at what is the relationship betweem APR and BorrowerRate (Interest rate).
From the plot above, it looks like the relationship of APR and interest rate varies in different years. Under the same interest rate, the APR seems to be higher in 2012 - 2014 compared to it is in 2006. I would like to perform another analysis to get more detail.
I created a new variable to measure the ratio of APR against interest rate in order to get more information on the relationship of the two. From the result, it is clear that the ratio is higher in 2011-2014 than it is in 2006 - 2008, meaning that the same interest rate paired with higher APR.
The APR x Rate x Prosper Score plot doesn’t show any relationship between prosper score and the different APR to Rate lines.
Credit score doesn’t seem to show any relationship between different APR to Rate lines too.
Although APR and interest rate doesn’t have clear correlation with credit rating, it seems that it has correlation with employment stauts.
The result matched the previous plot, and it indicates that the gap between APR and actual interest rate may not be only standard service fees, it looks like the people with a full time job more likely to get an APR closer to the actual interest rate.
However, when I run the similar plot to delinquencies in last 7 years to APR to interest rate ratio, the result is surprising, it turns out that the more delinquencies you made, the lower APR you will get compared to the interest rate.
I then took a look at the group plot but changed the APR to interest rate to only the mean of the interest rate and I found out that after around 50 delinquencies in last 7 years, the interest rate became not correlated with the delinquencies. I think the reason is that interest rate really doesn’t correlate with these attributes, thus this may not be a good method to analyze APR to interest rate relationship.
Whether a borrower is a home owner or not doesn’t affect the APR to rate ratio.
The income range doesn’t have a clear correlation with APR/Interest rate ratio.
Firstly I observed the relationship between APR and both prosper rating and credit grade on a yearly basis, the result is very consistant and as expected, as you get the better rating/grade, you will very likely get lower APR.
Another relationship is between APR and interest rate, the plots show that it varies by year, the same interest rate in 2012-2014 seems to match higher APR than it did in 2006-2008. Besides, it has moderate correlation with employment status, a person with a full time jon seems to have APR that is much closer to interest rate than other employment status.
The surprising interaction is the plot of delinquencies in last 7 years and the APR to interest rate ratio, the plot show that the more delinquencies you made, the lower APR/interest rate ratio you will get, it means that the APR will get closer to interest rate.
By analyzing the delinquencies in last 7 year groups and the their mean of the interest rate inside the each group, we can find out that the interest rate mean became uncorrelated after a certain point even with the extreme high values removed.
The original amount delinquent plot is highly skewed with extreme high count at 0, but when scaled the data to remove the extreme value and tranformed to log scale, it shows a bell curve with peak around 800.
It is hard to find any relationship between the two variables in the original borrower APR and inquiries last 6 months plot, after scaled the data to remove extreme high value and tranformed to log scale, the linear model line clearly tells the relationship.
This plot clearly tells the relationship between borrower APR and borrower rate, and surprisely it reveals how the borrower’s employment status interact with the two variables.
The prosper loan dataset contains information on 113,937 records of prosper loan with 81 variables originated from 2005 to 2014. Since it is a large dataset, I started from learning the basics about the prosper loan while exploring the variables. And after some initial analysis, I focused on the borrower’s APR because I think that APR is the key variable in this dataset and it is also one of the most important ratio in any loan.
There are many variables that interact with APR, some variables have clear correlation like borrower rate, credit grade, prosper score and so on, and some of them show correlation after adjustment, like delinquencies last 7 years, current credit lines, and inquiries last 6 months. Among these variables, borrower rate should be the easiest one to think of because of the strong correlation score (almost 0.99), but after explored the plot of the two, I noticed that there are intervals at the same slope, meaning that the same borrower rate points to different levels of APR although the rate they interact is almost the same.
As I continued exploring the relationship of APR and borrower rate, I found out that the loan origination years and employment status reflect the interval. The service fee and related tax could be different in different years, and it could be the reason for the APR to borrwer rate gap, but the employment staus could mean that the fee may depend on some conditions of the borrower.
I didn’t make a linear model to predict the APR, but I think more analysis should be made to reveal the variables that affect the relationship between borrower rate and APR in order to improve the model.