Personal Loan Analysis

This dataset contains more than 110,000 records of loan data with 81 variables, in this example I started from learning the basics about the data structure and variables, and then focus on exploring the relationship and correlation between APR and other variables.

Personal Loan Exploration

This report explores a dataset containing more than 100,000 loan records with more than 80 attributes for each loan.

Univariate Plots Section

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 8 levels "AA","A","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2802 levels "2005-11-25 00:00:00",..: 1137 NA 1262 NA NA NA NA NA NA NA ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 7 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
##  $ Occupation                         : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
##  $ EmploymentStatus                   : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11585 levels "1947-08-24 00:00:00",..: 8638 6616 8926 2246 9497 496 8264 7684 5542 5542 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
##                    ListingKey     ListingNumber    
##  17A93590655669644DB4C06:     6   Min.   :      4  
##  349D3587495831350F0F648:     4   1st Qu.: 400919  
##  47C1359638497431975670B:     4   Median : 600554  
##  8474358854651984137201C:     4   Mean   : 627886  
##  DE8535960513435199406CE:     4   3rd Qu.: 892634  
##  04C13599434217079754AEE:     3   Max.   :1255725  
##  (Other)                :113912                    
##                     ListingCreationDate  CreditGrade         Term      
##  2013-10-02 17:20:16.550000000:     6   C      : 5649   Min.   :12.00  
##  2013-08-28 20:31:41.107000000:     4   D      : 5153   1st Qu.:36.00  
##  2013-09-08 09:27:44.853000000:     4   B      : 4389   Median :36.00  
##  2013-12-06 05:43:13.830000000:     4   AA     : 3509   Mean   :40.83  
##  2013-12-06 11:44:58.283000000:     4   HR     : 3508   3rd Qu.:36.00  
##  2013-08-21 07:25:22.360000000:     3   (Other): 6745   Max.   :60.00  
##  (Other)                      :113912   NA's   :84984                  
##                  LoanStatus                  ClosedDate   
##  Current              :56576   2014-03-04 00:00:00:  105  
##  Completed            :38074   2014-02-19 00:00:00:  100  
##  Chargedoff           :11992   2014-02-11 00:00:00:   92  
##  Defaulted            : 5018   2012-10-30 00:00:00:   81  
##  Past Due (1-15 days) :  806   2013-02-26 00:00:00:   78  
##  Past Due (31-60 days):  363   (Other)            :54633  
##  (Other)              : 1108   NA's               :58848  
##   BorrowerAPR       BorrowerRate     LenderYield     
##  Min.   :0.00653   Min.   :0.0000   Min.   :-0.0100  
##  1st Qu.:0.15629   1st Qu.:0.1340   1st Qu.: 0.1242  
##  Median :0.20976   Median :0.1840   Median : 0.1730  
##  Mean   :0.21883   Mean   :0.1928   Mean   : 0.1827  
##  3rd Qu.:0.28381   3rd Qu.:0.2500   3rd Qu.: 0.2400  
##  Max.   :0.51229   Max.   :0.4975   Max.   : 0.4925  
##  NA's   :25                                          
##  EstimatedEffectiveYield EstimatedLoss   EstimatedReturn 
##  Min.   :-0.183          Min.   :0.005   Min.   :-0.183  
##  1st Qu.: 0.116          1st Qu.:0.042   1st Qu.: 0.074  
##  Median : 0.162          Median :0.072   Median : 0.092  
##  Mean   : 0.169          Mean   :0.080   Mean   : 0.096  
##  3rd Qu.: 0.224          3rd Qu.:0.112   3rd Qu.: 0.117  
##  Max.   : 0.320          Max.   :0.366   Max.   : 0.284  
##  NA's   :29084           NA's   :29084   NA's   :29084   
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000           C      :18345         Min.   : 1.00  
##  1st Qu.:3.000           B      :15581         1st Qu.: 4.00  
##  Median :4.000           A      :14551         Median : 6.00  
##  Mean   :4.072           D      :14274         Mean   : 5.95  
##  3rd Qu.:5.000           E      : 9795         3rd Qu.: 8.00  
##  Max.   :7.000           (Other):12307         Max.   :11.00  
##  NA's   :29084           NA's   :29084         NA's   :29084  
##  ListingCategory..numeric. BorrowerState                 Occupation   
##  Min.   : 0.000            CA     :14717   Other              :28617  
##  1st Qu.: 1.000            TX     : 6842   Professional       :13628  
##  Median : 1.000            NY     : 6729   Computer Programmer: 4478  
##  Mean   : 2.774            FL     : 6720   Executive          : 4311  
##  3rd Qu.: 3.000            IL     : 5921   Teacher            : 3759  
##  Max.   :20.000            (Other):67493   (Other)            :55556  
##                            NA's   : 5515   NA's               : 3588  
##       EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
##  Employed     :67322   Min.   :  0.00           False:56459        
##  Full-time    :26355   1st Qu.: 26.00           True :57478        
##  Self-employed: 6134   Median : 67.00                              
##  Not available: 5347   Mean   : 96.07                              
##  Other        : 3806   3rd Qu.:137.00                              
##  (Other)      : 2718   Max.   :755.00                              
##  NA's         : 2255   NA's   :7625                                
##  CurrentlyInGroup                    GroupKey     
##  False:101218     783C3371218786870A73D20:  1140  
##  True : 12719     3D4D3366260257624AB272D:   916  
##                   6A3B336601725506917317E:   698  
##                   FEF83377364176536637E50:   611  
##                   C9643379247860156A00EC0:   342  
##                   (Other)                :  9634  
##                   NA's                   :100596  
##             DateCreditPulled  CreditScoreRangeLower CreditScoreRangeUpper
##  2013-12-23 09:38:12:     6   Min.   :  0.0         Min.   : 19.0        
##  2013-11-21 09:09:41:     4   1st Qu.:660.0         1st Qu.:679.0        
##  2013-12-06 05:43:16:     4   Median :680.0         Median :699.0        
##  2014-01-14 20:17:49:     4   Mean   :685.6         Mean   :704.6        
##  2014-02-09 12:14:41:     4   3rd Qu.:720.0         3rd Qu.:739.0        
##  2013-09-27 22:04:54:     3   Max.   :880.0         Max.   :899.0        
##  (Other)            :113912   NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##  1993-12-01 00:00:00:   185     Min.   : 0.00      Min.   : 0.00  
##  1994-11-01 00:00:00:   178     1st Qu.: 7.00      1st Qu.: 6.00  
##  1995-11-01 00:00:00:   168     Median :10.00      Median : 9.00  
##  1990-04-01 00:00:00:   161     Mean   :10.32      Mean   : 9.26  
##  1995-03-01 00:00:00:   159     3rd Qu.:13.00      3rd Qu.:12.00  
##  (Other)            :112389     Max.   :59.00      Max.   :54.00  
##  NA's               :   697     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 25.00             Median : 6.00        
##  Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :136.00             Max.   :51.00        
##  NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.000                      Min.   : 0.000         
##  1st Qu.:0.820                      1st Qu.: 0.000         
##  Median :0.940                      Median : 0.000         
##  Mean   :0.886                      Mean   : 0.802         
##  3rd Qu.:1.000                      3rd Qu.: 1.000         
##  Max.   :1.000                      Max.   :20.000         
##  NA's   :7544                       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome                    LoanKey       TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:     6   Min.   :0.00     
##  1st Qu.:   3200     2DEE3698211017519D7333F:     4   1st Qu.:1.00     
##  Median :   4667     9F4B37043517554537C364C:     4   Median :1.00     
##  Mean   :   5608     D895370150591392337ED6D:     4   Mean   :1.42     
##  3rd Qu.:   6825     E6FB37073953690388BC56D:     4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:     3   Max.   :8.00     
##                      (Other)                :113912   NA's   :91852    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:  9.00             1st Qu.:  9.00       
##  Median : 16.00             Median : 15.00       
##  Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :141.00             Max.   :141.00       
##  NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2014-01-22 00:00:00:   491   Q4 2013:14450         
##  1st Qu.: 4000      2013-11-13 00:00:00:   490   Q1 2014:12172         
##  Median : 6500      2014-02-19 00:00:00:   439   Q3 2013: 9180         
##  Mean   : 8337      2013-10-16 00:00:00:   434   Q2 2013: 7099         
##  3rd Qu.:12000      2014-01-28 00:00:00:   339   Q3 2012: 5632         
##  Max.   :35000      2013-09-24 00:00:00:   316   Q2 2012: 5061         
##                     (Other)            :111428   (Other):60343         
##                    MemberKey      MonthlyLoanPayment LP_CustomerPayments
##  63CA34120866140639431C9:     9   Min.   :   0.0     Min.   :   -2.35   
##  16083364744933457E57FB9:     8   1st Qu.: 131.6     1st Qu.: 1005.76   
##  3A2F3380477699707C81385:     8   Median : 217.7     Median : 2583.83   
##  4D9C3403302047712AD0CDD:     8   Mean   : 272.5     Mean   : 4183.08   
##  739C338135235294782AE75:     8   3rd Qu.: 371.6     3rd Qu.: 5548.40   
##  7E1733653050264822FAA3D:     8   Max.   :2251.5     Max.   :40702.39   
##  (Other)                :113888                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 

Our dataset consists of 81 variables, with more than 110,000 observations.

First I explored some attributes about the loan to get some basic idea about the Prosper loans in this dataset. From the loan amount plots, we can see that most of the loans are below $15,000, with a few records above $30,000.

The term is either 12, 36 or 60 months, most of them is with 36 months term.

I converted the origination date to POSIXlt format and then extract the year and convert it to factor. Here we can see that few records are in 2005, and highest number of loans is in 2013.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

Most APR range from 0.05% to 0.42% with peak at around 0.36%,

Most borrower rate range from 0.05% to 0.35% with peak at around 0.32%,

The top 3 categories are debt consolidation, home improvement and business.

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Around half of the loans are current, note that there are 11992 charged off loans.

Here I started to look at the variables about the borrower. From the plot, the income range $25,000 - $49,999 and $50,000 - $74,999 have the higher count.

High percentage of the borrwoer are employed.

The number of home owner is slightly higher.

There is a high count for inquiries last 6 months at or close to 0, but note that the value of outliers range from around 3 to above 100.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.32   13.00   59.00    7604

The mean of current credit lines is 10.32, the 3rd quantile is 13, but there are some high value after 40, so I trimmed the credit lines below 40 to get more detail of these high values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990

Similar to the credit line, some extreme high delinquencies can be found in the Delinquencies last 7 years, I made a plot of the top 95% of the delinquency values to identify the range.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Income to debt ratio is another variable that is with extreme high value, the histogram plot is heavily skewed, and the extreme value is quite unusual since there are some records even with a 1000% debt to income ratio which is not likely to be true considering a 45% is risky enough for some lenders.

To find out more detail, I performed a plot with values that are above the 99% quantile value. The result ranges from around 0.8 (80%) to 10.0 (1000%).

Original credit grade was ranked in order A AA B C D E HR NC, I changed the AA to be the first one.

The propser score distribution looks like bell curve with peaks at 4, 6 and 8.

The prosper rating peaks at C, and the plot seems to be close to bell curve.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0      0.0    984.5      0.0 463881.0     7622

From the summary of the amount delinquent we can see that a large percentage of this data is at 0, and the plot is skewed and also shows a very high 0 value count and a long tail to the right which means the plot is skewed to the right.

In order to get more detail from the data, I transformed the amount to log10 scale, and filter out the NA and 0 value, the result shows normal distribution like curve with peak at around 800.





Univariate Analysis

Dataset Structure

There are 113937 loan records in this dataset with 81 attributes, 20 of the variables are factors, they are categoried as follow.

key:

ListingKey, GroupKey, LoanKey , ClosedDate, MemberKey

Date:

ListingCreationDate, DateCreditPulled, LoanOriginationDate, LoanOriginationQuarter, FirstRecordedCreditLine

Rating:

CreditGrade,
original order:
A AA B C D E HR NC

ProsperRating..Alpha.
original order:
A AA B C D E HR

Status:

BorrowerState, Occupation, LoanStatus, EmploymentStatus,

Boolin:

IncomeVerifiable, IsBorrowerHomeowner, CurrentlyInGroup,

range:

IncomeRange

I would like to divide the data into three sections, one is the data for the listing, another one is for the borrowers and the last one is for lenders.



Unusual distribution

I found that the debt to income and delinquencies in last 7 years have some unusual outliers, and the amount delinquent is with a long tail to the right and has a high percentage of 0 value.

In regards to tidying and adjustment, I transformed all the blank data to “NA” when I import the dataset by adding na.strings = c(“”, “NA”) in the loading function in order to exclude these data at once. And I also convert the loan origination date to POSIXlt format and to factor so that I can easily separate them to different bucket or get a certain date range.





Bivariate Plots Section

First I start exploring a subset of the data which includes the following:

BorrowerAPR, BorrowerRate, ProsperRating..Numeric., ProsperScore, CreditScoreRangeLower, OpenCreditLines, InquiriesLast6Months, DelinquenciesLast7Years, OpenRevolvingAccounts, PublicRecordsLast12Months, ProsperPaymentsOneMonthPlusLate.

From the paired analysis, we can see that APR has very strong correlation with borrower rate (0.989) and prosper rating (-0.948), moderate correlation with prosper score (-0.574) and credit score range (-0.434), all the other variables have slight correlation with APR with absolute score below 0.2.

Among the other vairables, prosper rating, prosper score and credit score range seems to correlate with each other, open credit lines has strong correlation with open revoling accounts which makes sense.

So, we can tell that the variables that has strong correlation in the subset we picked with APR can be ranked in following order:

Borrower rate, prosper rating, prosper score, credit score range.

Since there are more than 80 variables in the dataset and it is not possible to list all of them in this paired analysis, so I would like to see how other variables interact with APR.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA      22

The APR in year 2005 doesn’t show up in the plot, by running a summary of the APR in 2005, we can see that there is not any data execept NA in year 2005.

There is no obvious pattern found in the plot but there are some outliers in 2006 with APR higher than almost the rest of the data. I decided to take a closer look at these outliers.

First I performed a borrower rate by year analysis to see how borrower rate is like in 2006 compared to the others. It show the similar plot as APR with very high outliers above 0.4 in 2006, so the APR outliers could result in high interest rate, but what is the reason behind interest rate?

Since the prosper rating and score are not avaialbe for data earlier than 2009, so I decied to see how the credit grade is for these outliers in 2006. The plot here is for the credit grade with APR above 0.99 quantile, we can see the APR in the plot should be those outliers above 0.4, and the most of credit grades here is High Risk. So, that means the high risk credit grading is the reason behind the APR and interest rate outliers?

To answer the question, I took a closer look at the credit grade before 2009 and the “HR” credit grading in each year, it turned out that there are more “HR” credit grade in 2007 and there is not much clue that the credit grade in 2006 caused the outliers.

As per our preliminary analysis, credit score range also correlates with APR and interest rate, and the plot shows something interesting, there is no credit score range for these APR above 0.4 in 2006. But I want to compare this with the data in other year to see the difference.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA      10

From this credit score range and APR plot with facet wrap from 2006 to 2014 we can see that there is no credit score range associated with the APR outliers (above 0.4) in 2006, and there seems not any credit score for those APR that are higher than 0.32 which is abnormal compared with the plots in other years. The summary of credit score range of these APR outliers in 2006 also shows NA.

I think it is appropriate to assume that the outliers in APR in 2006 could be related to the lack of credit score.

And then I started to look at how other variables interact with APR. Different term doesn’t have much impact on APR.

Credit grade has strong correlation with APR like prosper rating and prosper score.

Whether borrower is a home owner or not affect the APR slightly, it is reasonable that a home owner gets lower APR.

Whether the borrower’s income is verifiable affect the APR mean by almost 0.05.

The categorical variable employment status doesn’t show specific pattern, but it makes sense that full time and part time employed borrwer gets better APR.

The debt to income ratio group doesn’t show any clear relationship between two variables.

The amount delinquent to APR plot doesn’t show any obvious correaltion.

But I would like to perform further analysis on these two varaibles because I highly doubt that they should have some correlation somehow.

So I subset the data by AmountDelinquent groups and added the APR mean of each group to the subset. However, the result still doesn’t show any clear relationship.

Similarly to the amount delinquent, I decided to look closer at how inquiries will affect APR. The initial plot doesn’t show any clear relationship between the two variables.

However when I group the inquiries and added the APR mean of the group, the plot start to show something interesting, but the relationship is still not clear.

After I removed the outliers with quantile .99 of inquiries last 6 months, the result become clear, this is helpful if I were to make a model to predict the APR.

There is not a obvious pattern in the income range to APR plot, but the fact that $0 income range gets lower APR than most of the other income ranges is quite surprising. After examing the distribution of $0 income range in different years we can see that most of the $0 income range came from 2007 and 2008, the two years that has lower APR mean than others, and it explains the reason.

## 
##  Pearson's product-moment correlation
## 
## data:  data$BorrowerRate and data$BorrowerAPR
## t = 2347.7, df = 113910, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9897057 0.9899409
## sample estimates:
##      cor 
## 0.989824

At last, I would like to take a closer look at he relationship between borrower APR and interest rate. The plot shows that the correlation is very strong and positive as expected, the layers at the same slope may indicate different service fees and other fees charged, this is the part that needs further investigation. As the calculation result shows, the correlation score is almost 0.99, and it is the strongest relationship I have found.





Bivariate Analysis



Oservations

My first observation is the open revolvng account and the open credit lines from the matrix, they have strong correlation above 0.9.

And then I explored more about the APR, it turns out that the credit grade, prosper rating, and prosper score all have strong correlation with borrower’s APR.

Is borrower a home owner, is income verifiable, employment status, and the loan term have moderate correlation with borrower APR.

The other variables I explored like inquieries in last 6 months, current credit lines, and delinquencies in last 7 years have correlation with the mean of borrower’s APR in the same group.

There are also variables that do not have a clear correlation or have little correlation with the APR, for example, income range, debt to income ratio.

One interest thing I noticed is that the income range and borrower APR plot, it shows that the income range at $0 gets a better APR than most of the other income ranges, which doesn’t make sense.

After examing the distribution of $0 income range in different years, I found the reason is that most of the $0 income range came from year 2007 and 2008, the two years which have lower APR mean than the most of the other years.



The strongest relationship

The strongest relationship I found is the borrower APR and the borrower interest rate, they have strong correlation almost at 0.99.





Multivariate Plots Section

By analyzing the relationship of APR and prosper rating/credit grade in each year, we can get better idea of how year affect the APR. It is easy to understand that the difference of interest rate in each year could be the main reason why the APR is different in different period even you have the same rating/grade, but could there be any other reason? I would like to take a look at what is the relationship betweem APR and BorrowerRate (Interest rate).

From the plot above, it looks like the relationship of APR and interest rate varies in different years. Under the same interest rate, the APR seems to be higher in 2012 - 2014 compared to it is in 2006. I would like to perform another analysis to get more detail.

I created a new variable to measure the ratio of APR against interest rate in order to get more information on the relationship of the two. From the result, it is clear that the ratio is higher in 2011-2014 than it is in 2006 - 2008, meaning that the same interest rate paired with higher APR.

The APR x Rate x Prosper Score plot doesn’t show any relationship between prosper score and the different APR to Rate lines.

Credit score doesn’t seem to show any relationship between different APR to Rate lines too.

Although APR and interest rate doesn’t have clear correlation with credit rating, it seems that it has correlation with employment stauts.

The result matched the previous plot, and it indicates that the gap between APR and actual interest rate may not be only standard service fees, it looks like the people with a full time job more likely to get an APR closer to the actual interest rate.

However, when I run the similar plot to delinquencies in last 7 years to APR to interest rate ratio, the result is surprising, it turns out that the more delinquencies you made, the lower APR you will get compared to the interest rate.

I then took a look at the group plot but changed the APR to interest rate to only the mean of the interest rate and I found out that after around 50 delinquencies in last 7 years, the interest rate became not correlated with the delinquencies. I think the reason is that interest rate really doesn’t correlate with these attributes, thus this may not be a good method to analyze APR to interest rate relationship.

Whether a borrower is a home owner or not doesn’t affect the APR to rate ratio.

The income range doesn’t have a clear correlation with APR/Interest rate ratio.






Multivariate Analysis

observation

Firstly I observed the relationship between APR and both prosper rating and credit grade on a yearly basis, the result is very consistant and as expected, as you get the better rating/grade, you will very likely get lower APR.

Another relationship is between APR and interest rate, the plots show that it varies by year, the same interest rate in 2012-2014 seems to match higher APR than it did in 2006-2008. Besides, it has moderate correlation with employment status, a person with a full time jon seems to have APR that is much closer to interest rate than other employment status.

The surprising interaction is the plot of delinquencies in last 7 years and the APR to interest rate ratio, the plot show that the more delinquencies you made, the lower APR/interest rate ratio you will get, it means that the APR will get closer to interest rate.

By analyzing the delinquencies in last 7 year groups and the their mean of the interest rate inside the each group, we can find out that the interest rate mean became uncorrelated after a certain point even with the extreme high values removed.







Final Plots and Summary

Plot One

Description One

The original amount delinquent plot is highly skewed with extreme high count at 0, but when scaled the data to remove the extreme value and tranformed to log scale, it shows a bell curve with peak around 800.

Plot Two

Description Two

It is hard to find any relationship between the two variables in the original borrower APR and inquiries last 6 months plot, after scaled the data to remove extreme high value and tranformed to log scale, the linear model line clearly tells the relationship.

Plot Three

Description Three

This plot clearly tells the relationship between borrower APR and borrower rate, and surprisely it reveals how the borrower’s employment status interact with the two variables.


Reflection

The prosper loan dataset contains information on 113,937 records of prosper loan with 81 variables originated from 2005 to 2014. Since it is a large dataset, I started from learning the basics about the prosper loan while exploring the variables. And after some initial analysis, I focused on the borrower’s APR because I think that APR is the key variable in this dataset and it is also one of the most important ratio in any loan.

There are many variables that interact with APR, some variables have clear correlation like borrower rate, credit grade, prosper score and so on, and some of them show correlation after adjustment, like delinquencies last 7 years, current credit lines, and inquiries last 6 months. Among these variables, borrower rate should be the easiest one to think of because of the strong correlation score (almost 0.99), but after explored the plot of the two, I noticed that there are intervals at the same slope, meaning that the same borrower rate points to different levels of APR although the rate they interact is almost the same.

As I continued exploring the relationship of APR and borrower rate, I found out that the loan origination years and employment status reflect the interval. The service fee and related tax could be different in different years, and it could be the reason for the APR to borrwer rate gap, but the employment staus could mean that the fee may depend on some conditions of the borrower.

I didn’t make a linear model to predict the APR, but I think more analysis should be made to reveal the variables that affect the relationship between borrower rate and APR in order to improve the model.