The data set explored in this project contains information on 113,937 loans obtained via Prosper - the first American company in the field of peer-to peer lending. Borrowers request personal loans on Prosper and investors can fund the loan amount partially or in full.
There are 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The time period for which the data were collected lasts from November, 2005 till March, 2014.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
The data set uses 4 fields of unique keys to identify each listing:
ListingKey
- unique key for each listing, same value as the ‘key’ used in the listing object in the API.
ListingNumber
- the number that uniquely identifies the listing to the public as displayed on the website.
LoanKey
- equals to the ListingKey
and the ‘key’ in the API.
LoanNumber
- unique numeric value associated with the loan.
Since all four variable can be used interchangeably for purposes of listings identification, three of them can be omitted. ListingKey
will be used in further analysis.
Though this data set may be considered tidy data in general, we can see from summary statistics, that there is a number of ListingKey
values which have several rows for each listing key.
##
## 1 2 3 4 6
## 112239 790 32 4 1
Here is one example with the maximum of 6 rows per key.
## ListingKey ListingCreationDate CreditGrade Term
## 13079 17A93590655669644DB4C06 2013-10-02 17:20:16 60
## 14889 17A93590655669644DB4C06 2013-10-02 17:20:16 60
## 20570 17A93590655669644DB4C06 2013-10-02 17:20:16 60
## 31451 17A93590655669644DB4C06 2013-10-02 17:20:16 60
## 42751 17A93590655669644DB4C06 2013-10-02 17:20:16 60
## 42752 17A93590655669644DB4C06 2013-10-02 17:20:16 60
## LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield
## 13079 Current <NA> 0.16662 0.1435 0.1335
## 14889 Current <NA> 0.16662 0.1435 0.1335
## 20570 Current <NA> 0.16662 0.1435 0.1335
## 31451 Current <NA> 0.16662 0.1435 0.1335
## 42751 Current <NA> 0.16662 0.1435 0.1335
## 42752 Current <NA> 0.16662 0.1435 0.1335
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## 13079 0.1264 0.0524 0.074
## 14889 0.1264 0.0524 0.074
## 20570 0.1264 0.0524 0.074
## 31451 0.1264 0.0524 0.074
## 42751 0.1264 0.0524 0.074
## 42752 0.1264 0.0524 0.074
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## 13079 5 B 4
## 14889 5 B 8
## 20570 5 B 7
## 31451 5 B 10
## 42751 5 B 5
## 42752 5 B 6
## ListingCategory..numeric. BorrowerState Occupation EmploymentStatus
## 13079 1 MD Other Employed
## 14889 1 MD Other Employed
## 20570 1 MD Other Employed
## 31451 1 MD Other Employed
## 42751 1 MD Other Employed
## 42752 1 MD Other Employed
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## 13079 26 False False
## 14889 26 False False
## 20570 26 False False
## 31451 26 False False
## 42751 26 False False
## 42752 26 False False
## GroupKey DateCreditPulled CreditScoreRangeLower
## 13079 2013-12-23 09:38:12 720
## 14889 2013-12-23 09:38:12 720
## 20570 2013-12-23 09:38:12 720
## 31451 2013-12-23 09:38:12 720
## 42751 2013-12-23 09:38:12 720
## 42752 2013-12-23 09:38:12 720
## CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines
## 13079 739 1986-12-26 12
## 14889 739 1986-12-26 12
## 20570 739 1986-12-26 12
## 31451 739 1986-12-26 12
## 42751 739 1986-12-26 12
## 42752 739 1986-12-26 12
## OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
## 13079 12 20 6
## 14889 12 20 6
## 20570 12 20 6
## 31451 12 20 6
## 42751 12 20 6
## 42752 12 20 6
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## 13079 348 0 5
## 14889 348 0 5
## 20570 348 0 5
## 31451 348 0 5
## 42751 348 0 5
## 42752 348 0 5
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## 13079 0 0 0
## 14889 0 0 0
## 20570 0 0 0
## 31451 0 0 0
## 42751 0 0 0
## 42752 0 0 0
## PublicRecordsLast10Years PublicRecordsLast12Months
## 13079 0 0
## 14889 0 0
## 20570 0 0
## 31451 0 0
## 42751 0 0
## 42752 0 0
## RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit
## 13079 14635 0.57 10865
## 14889 14635 0.57 10865
## 20570 14635 0.57 10865
## 31451 14635 0.57 10865
## 42751 14635 0.57 10865
## 42752 14635 0.57 10865
## TotalTrades TradesNeverDelinquent..percentage.
## 13079 17 1
## 14889 17 1
## 20570 17 1
## 31451 17 1
## 42751 17 1
## 42752 17 1
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange
## 13079 0 0.41 $25,000-49,999
## 14889 0 0.41 $25,000-49,999
## 20570 0 0.41 $25,000-49,999
## 31451 0 0.41 $25,000-49,999
## 42751 0 0.41 $25,000-49,999
## 42752 0 0.41 $25,000-49,999
## IncomeVerifiable StatedMonthlyIncome TotalProsperLoans
## 13079 True 3000 NA
## 14889 True 3000 NA
## 20570 True 3000 NA
## 31451 True 3000 NA
## 42751 True 3000 NA
## 42752 True 3000 NA
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 13079 NA NA
## 14889 NA NA
## 20570 NA NA
## 31451 NA NA
## 42751 NA NA
## 42752 NA NA
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 13079 NA NA
## 14889 NA NA
## 20570 NA NA
## 31451 NA NA
## 42751 NA NA
## 42752 NA NA
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 13079 NA NA
## 14889 NA NA
## 20570 NA NA
## 31451 NA NA
## 42751 NA NA
## 42752 NA NA
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 13079 NA 0
## 14889 NA 0
## 20570 NA 0
## 31451 NA 0
## 42751 NA 0
## 42752 NA 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination
## 13079 NA 2
## 14889 NA 2
## 20570 NA 2
## 31451 NA 2
## 42751 NA 2
## 42752 NA 2
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## 13079 10000 2014-01-13 Q1 2014
## 14889 10000 2014-01-13 Q1 2014
## 20570 10000 2014-01-13 Q1 2014
## 31451 10000 2014-01-13 Q1 2014
## 42751 10000 2014-01-13 Q1 2014
## 42752 10000 2014-01-13 Q1 2014
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 13079 F80D3694083622957BA09F2 234.5 234.5
## 14889 F80D3694083622957BA09F2 234.5 234.5
## 20570 F80D3694083622957BA09F2 234.5 234.5
## 31451 F80D3694083622957BA09F2 234.5 234.5
## 42751 F80D3694083622957BA09F2 234.5 234.5
## 42752 F80D3694083622957BA09F2 234.5 234.5
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## 13079 112.62 121.88 -8.49
## 14889 112.62 121.88 -8.49
## 20570 112.62 121.88 -8.49
## 31451 112.62 121.88 -8.49
## 42751 112.62 121.88 -8.49
## 42752 112.62 121.88 -8.49
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 13079 0 0 0
## 14889 0 0 0
## 20570 0 0 0
## 31451 0 0 0
## 42751 0 0 0
## 42752 0 0 0
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## 13079 0 1 0
## 14889 0 1 0
## 20570 0 1 0
## 31451 0 1 0
## 42751 0 1 0
## 42752 0 1 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## 13079 0 0 96
## 14889 0 0 96
## 20570 0 0 96
## 31451 0 0 96
## 42751 0 0 96
## 42752 0 0 96
As can be seen from the example above, the only variable that changes for this listing key is ProsperScore
which is a custom risk score built using historical Prosper data and is applicable for loans originated after July 2009. The distribution of the listings for which additional rows occurred because of the change in ProsperScore
across the time line shows that the practice of recording score changes was implemented only in most recent data.
Since only ProsperScore
changes, this leads to double count of these listings for other variables like LoanStatus
, ListingCategory
, etc. Of 113066 unique listing keys 827 produce 871 rows of such duplicates. There is no specific logging of changes in ProsperScore
in the data, and - taking into account the intent of omitting ProsperScore
from further analysis because of high proportion of NA values - the rows where listing keys appear for the second and time and more can be dropped to avoid double count.
The high proportion of NA values is an issue for 14 columns: in 5 columns the proportion of NA values exceeds 25%, in 10 it is over 50% and goes up to 80-85%.
## EstimatedEffectiveYield EstimatedLoss
## 0.2572303 0.2572303
## EstimatedReturn ProsperRating..numeric.
## 0.2572303 0.2572303
## ClosedDate TotalProsperLoans
## 0.5128863 0.8061044
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 0.8061044 0.8061044
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 0.8061044 0.8061044
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 0.8061044 0.8061044
## ScorexChangeAtTimeOfListing LoanFirstDefaultedCycleNumber
## 0.8327349 0.8500699
ClosedDate
is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses, so we can assume that about half of the loans in the data set are in progress.
EstimatedEffectiveYield
, EstimatedLoss
, EstimatedReturn
, ProsperRating..numeric.
, ProsperScore
were implemented in July 2009. It leads to missing values for earlier data, which are about 26% of the data set. ProsperRating..Alpha.
has a comparable number of empty string values for the same reason. On the other hand, CreditGrade
- the credit rating that was assigned at the time the listing went live - contains 84984 empty strings of 113937 rows of original data, because it is applicable only for listings before 2009.
TotalProsperLoans
, TotalProsperPaymentsBilled
, OnTimeProsperPayments
, ProsperPaymentsLessThanOneMonthLate
, ProsperPaymentsOneMonthPlusLate
, ProsperPrincipalBorrowed
, ProsperPrincipalOutstanding
, ScorexChangeAtTimeOfListing
have null values for cases when the borrower had no prior loans with Prosper. That means that about 81% of loans are first Prosper loans for borrowers.
LoanFirstDefaultedCycleNumber
is the cycle the loan was charged off, if ever. 85% of missing values result in less than 15% of loans that were charged off.
We can clearly see two periods in data - from the end of 2005 to the middle of 2009, corresponding to the relaunch the company conducted in 2009. The same pattern can be seen in loan origination dates.
Since it takes some time for listings to become loans, the distribution of loan originaion dates is slightly translated to the right, compared to the listing creation dates. We can also notice, that after the relauch the number of listing/loans in 2010-2011 was about half lower than it was in 2007-2009. However, the number of loans started growing in 2011 and, after some decrease to pre-relaunch levels in the beginning of 2013, by the end of the period in question achived the number of listing/loans, which is about 3 times higher, than it was before relaunch.
As for seasonal patterns, April is the month of lowest number of listings created and loans originated, while January is of highest. This results in the second quarter being the lowest in term of new loans, and the decrease during the months of the first quarter results in lower total in comparison with the fourth quarter, which holds the highest value. Still, as could be seen from the previous charts of number of listings and loans by months and years, the last months of 2013 and of the first months of 2014 have pronounced impact on total counts. This will affect the seasonal picture as well.
As can be seen from the chart above, the borrowes tend to ask for some rounded amounts: about 73.8% of loans are in thousands, another 15.9% are divisible by $500. The most popular loan amounts are $4,000, $10,000, and $15,000, though 75% of all loans are below $12,000. The overall distribution is right-skewed with range from $1,000 to $35,000, the median of $6,500 and the mean of $8,337.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8315 12000 35000
## 2005 2006 2007 2008 2009 2010 2011
## 3576.682 4763.325 7049.545 6021.628 4354.859 4766.540 6692.021
## 2012 2013 2014
## 7833.842 10540.158 11926.927
Loan amounts have also grown in the recent years, after some decrease in the middle of the period, following the total number of listings and loans.
The chart shows that loan terms are descrete and have a limited number of possible values, which can be presented in a shorter form of a table.
##
## 12 36 60
## 1614 87224 24228
##
## 12 36 60
## 1.427485 77.144323 21.428192
There are three loan terms used in Prosper: 12, 36 and 60 months, which equals to 1, 3 or 5 years. Most loans - 77.1% - are given for 3 years, about one fifth of loans have 5 year terms.
3-year loans on average tend to be for lower amounts, than 5-year loans. The number of 1-year loans is small, but it it closer to 3-year loans distribution with the prevalence of lower amounts.
As for loan origination dates for different terms, on the time line we can see, that 1-year loans were introduced for a period of time during 2011-2014, after which the company seems to seize this option. 5-year loans were also implemented around 2011, and the number of such loans has grown noticeably in the following years.
Since the longer terms on average tend to have highter loan amounts, this leads to higher mean loan amounts in more recent years. The loan amounts should be explored regarding not only time period, but loan terms as well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1929 0.2506 0.4975
Borrowers’ ARP is based on their interest rates, with some additional fees, so it is expectanly follows the same distribution, slightly translated to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20984 0.21898 0.28386 0.51229 25
Lenders’ yield is also based on borrower’s rate minus service fees of the platform, so it also follows the same distribution, but slightly translated to the left.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1245 0.1740 0.1829 0.2406 0.4925
The distribution of investors per listing is right-skewed, with the mode in the 1st percentile of the number range. We can change the scale to logariphmic to take a closer look into smaller values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.88 116.00 1189.00
## [1] "df$Investors == 1"
##
## FALSE TRUE
## 75.96 24.04
As can be seen from the histogram, a lot of loans are funded by a small number of investors, 24% - by only one. The average number of investors is 44 in terms of the median or about 80 in terms of the mean.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 130.9 217.4 271.9 370.6 2251.5
75% of loans have monthly payments lesser than $370.57, with average monthly payment of $271.9327422.
##
## Current Completed Chargedoff
## 0.493 0.337 0.106
## Defaulted Past Due (1-15 days) Past Due (31-60 days)
## 0.044 0.007 0.003
## Past Due (61-90 days) Past Due (91-120 days) FinalPaymentInProgress
## 0.003 0.003 0.002
## Past Due (16-30 days) Cancelled Past Due (>120 days)
## 0.002 0.000 0.000
Of all loans in the data set, about 51% are loans in progress, 1.82% are past due to some extent, 49.7% are paid on time. 33% of loans are fully paid by borroweds, less than 0.5% loans defaulted, about 10.5% were charged off. As for dynamics, the number of charged-off loans decreased in 2010-2012, getting back to levels of 2008 in 2014, while the number of defaulted loans noticeably decreased in 2010-2014 in comparison with the earlier period (see the chart below).
## Cancelled Chargedoff Defaulted
## 1700.000 6398.917 6486.799
## Completed FinalPaymentInProgress Current
## 6188.146 8344.606 10346.692
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 8491.334 8156.430 8504.055
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 7683.267 8003.977 8281.250
The average amount of current loans is higher than of those that are closed, corresponding to the trends mentioned above in Loan amount and Term sections.
As can be seen on the plot, the most frequent purpose of loans is debt consolidation. Still, it looks rather strange, that for so many options in listing categories the second most popular category is “Not Available”, meaning that either the borrowers aren’t willing to declare the purpose of their loans, or the list of categories has changed to a more detailed version only recently.
57.2% of listings were created by the borrowers, who declared their income to be higher than $50,000. However, the income range variable also seems to have undergone some changes over time, considering three levels having the meaning close to “zero income”.
##
## Not displayed Not employed $0 $1-24,999 $25,000-49,999
## 6.85 0.71 0.55 6.40 28.25
## $50,000-74,999 $75,000-99,999 $100,000+
## 27.20 14.84 15.20
##
## CA TX FL NY IL GA OH MI VA
## 12.9 6.0 5.9 5.9 5.2 4.9 4.4 3.7 3.2 2.9
Once again, the levels of
Employment Status
variable, some of which are of the same meaning, may be the result of some changes in the required information. However, definitely most borrowers are employed.
##
## Other Professional Computer Programmer
## 25.14 11.97 3.93
## Executive Teacher Administrative Assistant
## 3.79 3.30 3.25
## Analyst Sales - Commission
## 3.16 3.12 3.02
## Accountant/CPA
## 2.84
Each bar represents the number of listings by borrowers with a specific credit score range, encoded with two variables CredirScoreRangeLower
and CreditScoreRangeUpper
with the length of 20 points (for example, 660-679).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.5 720.0 880.0 591
For easier interpretation the credit score groups were added, based on FICO score intervals, published on CreditCarma.com
##
## Too low Poor Fair Good Very Good Excellent <NA>
## 0.12 4.87 32.88 39.29 18.03 4.29 0.52
61.61% of borrowers have credit scores in range of Good and better.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8472
For 99% of borrowers who reported the information on debt-to-income ratio, the ratio is below 0.8607, which makes us wonder who are the outliers whose debt is times higher than income. We can check how the borrowers with highest ratio differ in other aspects, for example, their loan status. Here are two plots on LoanStatus
for those who have debt-to-income ratio between 1 and 5 (on the left plot) and over 5 (on the right plot).
The data set containts information on 113066 unique listings created at Prosper.com to obtain loans. Each listing includes the following:
- its indentification numbers and time of creation;
- the information about the borrower - location, income, employment and occupation, credit score, debt to income ratio and other aspects of borrower’s financial situation, and their history with Prosper.com on the listing in question and previous loans, if any;
- the information about the loan - date of origination, amount, borrowers’ rate and APR, lenders’ yield, loan status and closed date (if closed) and the information about the most recent payment.
For all listing in the data set there were loans originated. The average loan amount is 8314.762307, but this number differs a lot from year to year, taking into account the pause in 2008-2009 and the growth of 2013-2014, the latter also marked by the increasing number of 5-year loans. The most frequent purpose of loans is debt consolidation. About half of the loans are closed, others are still in progress, of which less than 5% are past to some extent. The time interval is stretched from November, 09, 2005 to March, 10, 2014 in terms of listing creation time. 24% of loans are funded by one investor. There are three options of loan terms: 1 year, 3 and 5 year. 1-year loans appear to be a temporary option, available in 2011-2013, while 5-year loans were introduced in the middle of the time period in question and became a growing category.
Most borrowers are employed, have income range higher than $50,000 and credit score better than Good (670+). There are borrers from all states, the highest number - in California (13%), followed by Texas, Florida, New York (6%) and Illinois (5%). The average number of current credit lines is about 10 and the median debt-to-income ratio is 22%.
The main feature of interest in the data set for me is the company’s progress over time. We can see, that in the early period of data the company had progress in numbers of listings, yet it had a pause in 2009 (apparently the pause coinsides in time with the Great Recession of 2008-2009). After this pause it restarted with lower listing numbers, but made a great progress in the following years. From the variable dictionary and the behavior of some variables (Terms
and ListingCategory
, for example) we may assume, that some approaches and policies had changed, affecting the company’s performance and maybe the types of clients it is attracting (though we may have to take into account the overall improvement of the financial situation).
investigation into your feature(s) of interest?
I suppose that studying the characteristics of borrowers for any change the relauch was followed by, can also help in understanding of the effect of Prosper’s policy changes.
I added ListingYear
/ LoanYear
and ListingMonth
/ Loan Month
, based on ListingCreationDate
/ LoanOriginationDate
to make these time parameters more accessible. For borrowers’ credit scores I created a new variable to map the values to the general ranges, usually used to describe scores: “Poor”, “Fair”, “Good”, “Very Good” and “Excellent” (I also added “Too low” level for scores below “Poor”). Also I added a categorical equivalent for ListingCategory..numeric.
based on data dictionary for better readability.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
I converted variables that contain date and time information from character type to date time and modified LoanOriginationQuarter
for easier sorting. I also excluded duplicated rows for a set of listings keys (see more information in Data Preparation section).
The distribution of Term
variable showed the limited number of values, so I converted it to a factor with 3 levels: 12, 36 and 60 months.
The distribution of DebtToIncomeRatio
has some number of outliers, who demostate more charged-off loan status as the ratio grows. Also they have less current loans, which may reflect some changes in policies about this index, so the outliers may require further investigation. There is a significant proportion of loans - about 24%, that are funded by only one investor, as the distribution of investors per loan shows. Any other numbers doesn’t yield a comparable percentage, so this kind of loans may be specific in some way. Overall I found applying logariphmic scale to count axes useful for making existing small values in specific groups or periods more visible.
On the plots above the scale is changed to logariphmic to make all months visible. Here the interruption in data is more detailed - from October, 2008 to April, 2009 for creation of new listings and to May, 2009 for origination of new loans. Still, the scale of number may be confusing for perception of monthly results, which are more accessible without transformations.
Adding colors to the plot, based on year (and changing the scale of counts to logariphmic for better visibility of smaller values), we can see, that “Not Available” category was mostly used in listings of 2005-2007 (and seems to be excluded in 2008-2010). “Personal Loan” and “Student Use” were applicable only before 2009 and 2011, respectively. The most detailed categories, like “Vacation”, “Medical/Dental” or “Boat”, came in use only since 2011 or 2012.
As expected, missing values in employment status of borrowers refer to the earliest period of data. The categories Employed and Other were implemented in 2010. As we can see from the plot below, Employed became the most frequently used status in recent years.
Though some other categorial variables seem to have changed their levels over time, Occupation
has observations for each year almost in all its levels. The lack of observations in missing values for 2009-2012 may lead to an assumption, that in these years it was a required field in loan applications.
Here all levels, except “Not displayed”, were in use since 2007. There are no listings with $0 income or “Not employed” status, but since these are not very frequently used categories and the data of 2014 include less than 3 months, this is expectant.
Exploring income ranges with frequency polygons we can notice, that the most freqent range before relaunch and in the first few years after it was the range of $25,000-49,999, but in 2013 it changed to $50,000-74,999.
The charts above shows how the most frequent loan status change from comleted to current for the most recent data. We can also see that the number of loan which are past to some extent, are comparatively low. FinalPaymentInProgress
describes a rather specific state of loan “life”, so its number is expectantly low as well.
If we compare the distributions of listings creation by date with the distribution of dates when the listings were closed, we can see the 3-year translation of peaks and downfalls for most of the data: for the pause in the listing creation in 2009 there is a pit on the histogram for closed loans.
Of the earliest loans, which all were 3-year loans, we can see an increase of defaulted loans in the second half of 2007 and the inclease of charged-off loans by the beginning of 2008 and upto the first months of 2009. We should mention that the number of defaulted loans descreased pronouncedly in the following years, while the number of charged-off loans started to grow after 2012. But for the latter we should also take into account the growing number of current loans, in which the share of 5-year loans is also growing, so here the comparison of proportions between charged-off and completed loans will be accurate only by 2015-2017 years.
Having compared the loan status frequency over time with the distributions of listings creation and loan closing, we may assume, that the growing share of defaulted and charged-off loans led to the decreased number of loans after relaunch - this might be caused by more strict criteria for borrowers or the lack of funds for dealing with a larger number of loans, or, maybe, by the decreasing of the attractivity for investors and therefore their numbers. We can also suppose that the growing number of completed loans around 2011 created an impusle for growing number of new listings in 2012.
To get a deeper understading of listings’ lasting we can compare the listings creation dates to their closed dates (for the listings, that have a closing date).
There is a noticeable line marking the closing of 3-year loans that compose the majority of the data set. We can also see a similar line starting from 2011, which is 2 years lower, which may correspond to the period when 1-year loans were availablel. However there are many listings that were closed much earlier, than their term’s end.
## $Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02270 0.07086 0.07850 0.07262 0.09333 0.09772
##
## $Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4187 0.8775 1.3555 1.4853 1.9465 4.6102
##
## $Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.7877 0.7161 1.0940 1.3015 1.6893 4.2856
##
## $Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004109 0.795757 1.668924 1.742848 2.948581 5.530674
As can be seen from the summary above, it takes only a short time for a loan to be cancelled, and about a year and a half to reach a charged-off status. 75% of completed loans reach this status in less than 3 years, though the maximum exceed not only the 3-year term, but also 5-year one. The defaulted group has some negative values in the distribution, so it is necessary to study the data for possible errors.
## ListingKey ListingCreationDate LoanStatus ClosedDate
## 108298 DEAA359893047281162F432 2013-12-27 12:02:50 Defaulted 2010-03-16
There is only one listing where ClosedDate
is earlier than ListingCreationDate
, which is definitely an error. the exclusion of this listing doesn’t affect the distribution much, and the minimum is not meaningful.
## $Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2932 0.7163 1.0942 1.3025 1.6893 4.2856
Exploring the distribution of listing life (in years) over time and terms, we can see, that for most loans 75% of closed loans had listing life about the length of their term or lower. Аbout 25% of listing tend to have listing life longer than their loan term. For 1-year loans the median is closer to the end of the term, while for 3-year loans the median listing life is close to 2 years. Аbout 25% of listing tend to have listing life longer than their loan term.
Starting from 2011 the growing proportion of loans in progress starts affecting the range of the distributions - it includes only the loans that were closed earlier than their terms required. The proportion can be seen in the table and chart below.
##
## 2005 2006 2007 2008 2009 2010 2011
## Cancelled 0.00 0.07 0.00 0.01 0.00 0.00 0.00
## Chargedoff 0.00 16.12 25.64 23.87 11.33 13.27 16.01
## Defaulted 0.00 23.25 13.84 9.09 3.86 3.31 3.09
## Completed 100.00 60.57 60.52 67.03 84.81 82.94 48.88
## FinalPaymentInProgress 0.00 0.00 0.00 0.00 0.00 0.00 0.35
## Current 0.00 0.00 0.00 0.00 0.00 0.34 29.04
## Past Due (1-15 days) 0.00 0.00 0.00 0.00 0.00 0.05 1.13
## Past Due (16-30 days) 0.00 0.00 0.00 0.00 0.00 0.00 0.34
## Past Due (31-60 days) 0.00 0.00 0.00 0.00 0.00 0.00 0.39
## Past Due (61-90 days) 0.00 0.00 0.00 0.00 0.00 0.00 0.37
## Past Due (91-120 days) 0.00 0.00 0.00 0.00 0.00 0.09 0.37
## Past Due (>120 days) 0.00 0.00 0.00 0.00 0.00 0.00 0.04
##
## 2012 2013 2014
## Cancelled 0.00 0.00 0.00
## Chargedoff 11.58 0.88 0.00
## Defaulted 1.83 0.12 0.00
## Completed 28.22 6.74 0.58
## FinalPaymentInProgress 0.28 0.27 0.14
## Current 53.63 89.44 99.12
## Past Due (1-15 days) 1.56 1.04 0.11
## Past Due (16-30 days) 0.55 0.34 0.03
## Past Due (31-60 days) 0.78 0.48 0.01
## Past Due (61-90 days) 0.77 0.35 0.00
## Past Due (91-120 days) 0.76 0.32 0.00
## Past Due (>120 days) 0.04 0.01 0.00
As for loan amount, it seems not to have any relatioship with closed loan status, expect for cancelled loans, but their number is quite small. For completed, defaulted and charged-off loans the distribution are more or less the same (the outliers in completed loans may refer to the loans completed in 2013-2014, when such amount became available, but the proportion of other closed status is too low for these years due to the term lengths).
Another aspect of listing life is the time between the listing was created and the origination of the loan.
From the chart above we can state that waiting time is usually rather short, though in more recent years it may become longer. However, there were period there the period of waiting could reach years, though it referred to only a small number of loans. The distribution of days in waiting can help us to be more precise.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.405 4.574 8.053 11.619 12.376 1094.189
The overall distribution of days in waiting is centered at about 10 days. We can plot it over the years of listing creation to see if there are some changes corresponding to splashes on the scatterplot.
As can be seen from the plot above, the median waiting time was the highest in 2009, while the longest waiting time together with a lot of outliers could be seen in 2008. This period coincides with the Great recession, so we may assume either an increase of borrowers who might be concidered not very reliable or the decrease of the number of investors due to the overall economic situation. Also the relaunch of Prosper’s website in 2009 might have affected the waiting time for the listings created before the relaunch.
We can check if the duration of waiting may be affected by some characteristics of the borrowers, for example, their employment status or credit score.
For the employment status, though the median waiting time for “Employed” and “Self-Employed” categories is slightly lower, the largest number of outliers can be seen in “Full-time” category, which was the most numerous before “Employed” level was implemented.
We can also check if the ability to verify the income had any impact on the distribution of waiting days.
## [1] "Income Verifiable"
##
## False True
## 8587 104479
As for credit score ranges, again we see the largest number of outliers in the most numerous categories, but the “best” levels are also affected.
In terms of borrowers’ characteristic I’d say that the highest number of outliers can be observed in the categories that were the most frequently seen in time sections where the most listings with longest waiting periods occurred. This may be speaking in favor of other factors mentioned above: the website relaunch, the difficulties of finding investors in the situation of ecomonic recession or some others.
The loans in the data set have the terms of either 1 year, 3 years or 5 years, 3-year loans being the most frequent. However, they have differ noticeably in average loan amounts or average borrowers’ rates.
These differences may also affect the distribution of other variables if used as an additional dimension. It will be explored further in the Multivariate Plots Section.
On the scatterplot the popularity of round amounts is clearly visible. Also we can see, that after the beginning of 2013 the acceptable loan amount increased from $25,000 to $35,000. The minimum amount was also increased in 2011. Also we can assume that in the period before relauch the loans were usually lower than after the relaunch, but it is more accessible via a boxplot (see below).
Debt consolidation, being the most popular category, also has the widest range of loan amount distribution. Of comparable range are also Business category, Wedding Loans and Baby&Adoption, though the first two have lower medians. Auto and Vacation loans are characterised by comparatively lower loan amounts.
Though of employment status “Retired”, “Part-time” or “Not employed” have expectantly lower median loan amounts, the difference between “Employed” and “Full-time” may be caused mostly by the usage of the latter in the period when generally lower loans were accepted.
The median number of investors per loan grew in 2007-2010, but decreased in 2011, dropping to 1 investor in 2013. The change of the plot type or the scale can give a closer look into the data.
As we can see on the scatterplot, the number of investors per loan was growing from 2006 to 2008, but in the period of 2009-2011 there were almost no loans funded by less than 10 investors. The situation changed in 2012 and further in 2013 where a growing number of loans was funded by only 1 investor.
The following tables aimed at exploring the differences between loans funded by 1 investor or more than one.
##
## Not Available Debt Consolidation Home Improvement
## 0.89 76.02 5.12
## Business Personal Loan Student Use
## 3.31 0.10 0.03
## Auto Other Baby&Adoption
## 1.13 6.10 0.21
## Boat Cosmetic Procedure Engagement Ring
## 0.09 0.04 0.16
## Green Loans Household Expenses Large Purchases
## 0.05 1.65 1.17
## Medical/Dental Motorcycle RV
## 1.51 0.18 0.04
## Taxes Vacation Wedding Loans
## 0.75 0.65 0.79
##
## Not Available Debt Consolidation Home Improvement
## 0.01 76.79 5.70
## Business Personal Loan Student Use
## 3.22 0.00 0.00
## Auto Other Baby&Adoption
## 1.13 5.53 0.22
## Boat Cosmetic Procedure Engagement Ring
## 0.11 0.03 0.17
## Green Loans Household Expenses Large Purchases
## 0.05 1.88 1.17
## Medical/Dental Motorcycle RV
## 1.48 0.22 0.03
## Taxes Vacation Wedding Loans
## 0.72 0.74 0.79
##
## Not Available Debt Consolidation Home Improvement
## 0.02 70.05 7.19
## Business Personal Loan Student Use
## 4.50 0.00 0.00
## Auto Other Baby&Adoption
## 1.36 6.31 0.38
## Boat Cosmetic Procedure Engagement Ring
## 0.07 0.11 0.28
## Green Loans Household Expenses Large Purchases
## 0.08 2.36 1.21
## Medical/Dental Motorcycle RV
## 2.11 0.34 0.06
## Taxes Vacation Wedding Loans
## 1.35 1.10 1.12
As can be seen from the tables the loans with 1 investors are even more concentrated on Debt consolidation. As for correlation between loan amount and the number of investors, for loans with more than 1 investor it becomes stronger than for the whole data set.
## [1] "All loans: 0.383" "2+ investors: 0.668"
## [1] -0.2762578
It seems that the lower rates correspond to higher number of inverstors. It may feel slightly counter-intuitive, for the higher rates mean higher lender yield, yet higher rates are more often associated with higher risks involved. We can check for more details in the multivariate analysis. Also we can exclude the loans with one investor, for they are specific for only most recent period and check the correlation again.
## [1] -0.4173609
The relationships seems to be stronger for the loans where more than one investor is involved.
As for terms, the distributions are also affected by the growing number of loans with single investors in 2012-2013. If these loans are excluded, we can see, that the longer terms on average tend to attract slightly more investors (supposedly by the larger amount of loans asked by borrowers).
Credit score is considered one of the most important characteristics in US banking system, being a compound indicator of a person’s financial behavior and responsibility.
We can estimate how the Prosper’s preferences about the acceptable credit score changes over time. From the visualisation below we can see that in 2009 the company seized to accept the listings with the credit score worse than “Fair”. Also after the relaunch the most frequent range has changed from “Fair” to “Good”, which may mean that of potential borrowers the company start choosing more reliable, especially in the first years after relaunch.
As can be seen from the chart below, the proportion of borrowers with “excellent” credit score was the highest in 2009-2010, in comparison with other years (for the listings of 2005 the information about borrowers’ credit scores is not available).
As for borrower’s rates, the better the credit score, the lower is the interest the borrower can expect to pay.
As for reported monthly income (on the plot above), the borrowers’ with better credit score on average tend to have higher income (top 1% is excluded from the plot for better representation of most frequent values). This may be one of the reason, that they are on average approved for higher loan amounts (see the chart below).
investigation. How did the feature(s) of interest vary with other features in
the dataset?
The bivarite data exploration has shown that the company’s performance over time has gone through a set of stages, each of which having its specific characteristics in the variables:
- the early period from 2005 till the company’s website relaunch in the middle of 2009, with the growing number of listing and investors, but a lot of undetailed information on borrowers in the earliest years (with a number of borrowers with low credit score) and a relatively high proportion of loans that resulted later in defaulted and charged-off status;
- the recovery stage from 2009 till 2011 that was characterised by lower number of new listings and loan amounts, the completion of 3-year loans of the first stage, the higher proportion of borrowers with very good and excellent credit scores and the exclusion of credit ratings worse than fair;
- the growth stage from 2011 with dynamic increase of new listings and the number of investors per loan, the proporiton of borrowers with good and fair credit score also grew. In 2012-2013 the new type of investors might have been attracted, who were able to solely fund a loan. Also after the 1-year and 5-year loans were introduced in 2011 the necessity arose of adding one more dimention to the analysis, because the loans differ by term in average loan amounts, rates and number of investors.
(not the main feature(s) of interest)?
Exploring the borrowers’ characteristic I found that the credit score appear to be a reasonable reflection of a person’s financial well-being and behavoir, for it corresponds to the higher monthly income and higher proportion of completed loans in comparison with defaulted and charged-off loans, therefore the borrowers with higher credit score may expect to be approved for greater loan amounts with lower interest rates.
The strongest relationship was found between listing creation date and loan origination date, so if you are going to apply for a loan at Prosper you can expect the origination of the loan within 14 days in case of approval (or even earlier).
The other strong relationship is between the listing creation date and the loans’ closing date, which is determined first of all by the term of the loans. The introduction of 1-year loans caused some distortion to the model line, but for 3-years loans we may expect the closing date to come within the last a year and a half of the term. The first half of the second year seems to be crucial for the loan status of a 3-year loan: if it wasn’t defaulted or charged-off in this period, it is more likely to be completed.
There is also a moderate negative relationship between borrowers’ rates (which is equal to lenders’ yield minus fees) and number of investors, especially for the loans with more than 1 investor. Also the lower rates positively correlate with higher credit scores, which may be the reason the lenders found these listings more attractive.
The following two plots are made to confirm the assumptions made in Closed Loans section above. On the first one we can see the number of defaulted loans during 2006-2007 and the dynamics of charged-off loans - rather high density in 2006-2009 and - after several years of rather scarce occurrence - the growing density in 2011-2014.
The second plot confirms the assumption about the loan terms: the upper straight line refers to 3-year loans, the lower straight line, which begins in 2011, refers to 1-year loans. The number of closed 5-year loans appeared in 2012-2014 with no trend for most of such loans still have a few years before the end of the term.
Adding a trend line to each term separately, we can see, that for 1-year loan the closing date may be expected slightly earlier but still close to the end of the term, while for 3-year loans the closing date may be expected at about 1.5-2 years from the origination of the loan. However, the data on closed loans for most recent periods distort the model because of many loans that are still in progress. So, any modelling for 5-year loans are not quite reasonable.
It may be useful to truncate the period for 3-year loans to get a more accurate model, also taking into account the loan status.
Here we can see, that the trend line on the plot for all 3-year loans was a compomose between a higher line of completed loans and the lower lines of defaulted and charged-off loans.
The borrowers with the excellent credit were the only category who underwent olny slight decrease of the average loan amount in 2008-2010, while good and lower experienced a more noticeable slope from 2007 to 2009, followed by very good category in 2008.
Comparing the average borrower rates for different credit groups over time, we can state that the rates also would change: they grew for all groups in 2007-2008, then jumped higher for good and fair credit groups in comparison with very good and excellent. The increase stopped from 2011, and the rates started to decrease (for fair group - from 2010), getting back to the numbers of 2006.
As we can see from the chart above, better credit ratings tend to result in lower interest rates. The longer terms will result in higher rates for most credit score groups on average, though the variablity of the distribution is higher for fair and good groups for 3-year loans (though it may be the result of fluctuation in rates in 2009-2011).
Here we can see, that the listing are clustered by loan amount depending on the monthly income and credit score of the borrowers: the higher the credit score and income, the greater the loan amount that may be approved, though the relationship is rather non-linear. The similar clasterisation can be seen, if we change credit score to debt-to-income ratio.
The best combination for the loan amount of about $10,000-15000 seems to be the debt-to-income ratio of about 0.25-0. and monthly income close to $5,000. Higher income may result in greater amount, while lower income may be bound to lower amounts, no matter the ratio.
The plot above shows the difference in number of investior depending on the terms of loans. The 3-year and 5-year loans underwent the decrease in median number of investors in 2011-2012 with a drop in 2013. I would expect it to be related with the company efforts for establishing the audience of reliable borrowers in the prior two years. Another possible factor may be the overall improvement of economic situation and the appearence of investors on Prosper, who were not only willing to invest, but also capable of funding solely a relatively large amount of loans. The 1-year loans didn’t show the same trends, so these investors might have been be interested predominantly in longer terms. The other reason might have been the fact that 1-year loans were only accepted in the beginning of 2013, so the distribution is not describing the whole year.
The scatterplot above gives more detailed information about the distributions in the previous section. Here we can divide the data into the same three stages as the overall company’s performance. We can see, that during the recovery period the loans of lower amounts were funded mostly by higher number of inverstors. The situation changed around 2012, when the greater amounts began to be funded by lower number of investors. The change continued into 2013 when the number of investors appeared who were able to fund alone the loans that before relaunch were usually funded by several hundred people.
In addition to the relationshop between BorrowerRate
and the number of investors per loan discovered earlier, the loan amounts provide additional details. The loans with lower interest rates also tend to be of larger amount. This is the reflection of the fact, that lowest rates typically can be obtained mostly by borrowers with better credit ratings, that also tend to have higher income. This makes it more accessible for them to get a loan of greater amount. The higher loan amount in peer-to-peer lending may require a greater number of invertors, because of collective lending and limited funds of each lender involved, but the characteristics of borrowers of such loans make it a reliable investment. The higher amounts are more often seen in 3-year and 5-year loans, so the investment will be also long-term.
Also there is a dense number of invertors who are interested in loans with comparatively higher interest rates of about 0.28-0.35, but since the loan amounts with such rates are usually lower, the number of investors involved is also relatively small.
Adding another dimension of credit score to number of investors per loan, loan amounts and time, we can assume that the characteristics of borrowers is not the only important part in getting a loan. After the relaunch and recession mostly small loan amounts were funded, no matter the credit score (if not excellent). We may assume that it was the consequence not only of the recession, but the statistics of defaulted and charged-off loans, demonstrated by Prosper’s borrowers in the earlier years. As time moved forward the wider range of credit score became acceptable for the higher loan amounts.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
The multivariate analysis helped to distinguish the characteristics related to different stages of Prosper’s performance over time and to see how they influence each other. Thus, on the earlier stage the growing number of listings and loan amounts was simultaneous with the growing number of investors per loan. Then the recession took place, together with the company’s website relauch with changed requirements to the new borrowers after the relatively high proportion of loans in the first stage resulted in defaults and charged-off. This was also the time when the number of investors overall seems to have decreased (either as a result of the poor outcome of earlier loans or of the economic crisis in general, we can’t tell). The caution of listings approval and the increase of rates was followed by the growing number of loans (and, apparently investors). We must also mention here the overall improvement of economic situation.
One of the most surprising discoveries for me was the negative relation between loan interest rates and the number of investors, but adding the loan amount variable allowed to see this relationship in more detailed way, and I included it in the final section.
This plot describes several aspects of Prospers’ loans history at once, as the company went through three stages in its performance during the period in question:
- the early period from 2005 till the company’s website relaunch in the middle of 2009, with the growing number of listing and investors, but with a relatively high proportion of loans that resulted later in defaulted and charged-off status;
- the recovery stage from 2009 till 2011 that was characterised by lower number of new listings together with the growing proportion of completed loans and decreasing share of defaulted and charged-off loans;
- the growth stage from 2011 with dynamic increase of new listings and the intoduction of 2 additional terms: 1 year and 5 years, which resulted in higher proportions of loans in progress because of the growing number of 5-year loans.
We can also see that thecompany managed to decrease the number of defaulted loans since the earliest stage and that the loans that are past to some extend compose only small share of loans in progress.
The progress made through the recovery stage correlates with the temporary increase of rates for all credit ratings in 2008-2011 and the exclusion of borrowers with credit ratings below fair starting from 2009. While such measures are reasonable and may be called expectant in the situation of economic recession, in peer-to-peer lending the existance and attracting of investors becomes an important factor of financial processes. That is why I found the behavior of investors the most interesting aspect discovered.
This plot gives us some understanding of the other side in the peer-to-peer landing - the investors. Here we can visually divide the data into the same three stages as the overall company’s performance. We can see, that during the recovery period the loans of lower amounts were funded mostly by higher number of inverstors. I would also assume the overall decrease of the number of investors involved. The situation started to change in 2011. Around 2012 the greater loan amounts began to be funded by lower number of investors. Here I would also expect the overall increase in number of investors together with the number of loans. The change continued into 2013 when the number of investors appeared who were able to fund alone the loans that before the recession and the relaunch were usually funded by several hundred investors.
For this plot I used BorrowerRate
on x-axis to visualise the relationship between what the borrowers are expected to pay and what number of inverstors are interested in such loans, since the lenders’ yield is based on borrowers’ rate minus servicing fees.
As can be seen from the plot, there is a negative correlation between the number of investors per loan and the interest rate, paid by borrowers. The loans with lower interest rates also tend to be of larger amount. This is the reflection of the fact, that lowest rates typically can be obtained mostly by borrowers with better credit ratings, who also tend to have higher income and therefore may be approved for higher loan amount. The higher loan amount in peer-to-peer lending may require a greater number of invertors, because there is no single financial institution behind, but the characteristics of borrowers of such loans make it a reliable investment. The higher amounts are more often seen in 3-year and 5-year loans, so the investment will be also long-term.
Also there is a dense number of invertors who are interested in loans with comparatively higher interest rates of about 0.28-0.35, but since the loan amounts with such rates are usually lower, the number of investors involved is also relatively small.
There are several challenges this data set provides, in my opinion. First of all, the number of variables. Even the limiting of their number for further exploration requires a lot of efforts in distinguishing the most perspective ones. Another challenge comes from the rather long period of data. On the one hand, the data change over time due to different circumstances, which makes the explorations even more interesting, but it also makes, for example, the analysis of averages less meaningfull without time dimension (or less reliable without verification by time dimension). On the other hand, the approach to some variables also change over time, which is especially noticeable in changing levels of the categorical variables. Some of them would require rethinking and relabelling to be used in further analysis. This especially refers to the categorical characteristics of borrowers which may provide more additions to the trends discovered after rearranging the levels to combine those with similar meanings into broader groups. My favorite part was working with ggplot2
for visualisations, I appreciated the versatility it gives after you understand the structure of code used for plotting. My main struggle was to remeber to fit the code into 80 symbol limitation, mostly because of the long names of the variables in the data set.