Cross-Validation

Project Description The main goal of this project is to identify and predict the loan status of lenders. To figure out whether the dataset has time series characteristic, two cross-validation methods (K-fold, TimeSplit) are also used. In this project, RandomForest is the main model to predict and analyze the loan status. (The Dataset for model training includes 916567 rows and 10 columns from 2007 to 2017.) Data can be found in my GitHub Repository Response Variable Loan_Stat -> Including 3 status, Fully Paid, Charged Off, and Default Explanatory Variables Annual_Inc -> Annual income Emp_Length -> Employment length Dti -> The debt-to-income ratio of the borrower Delinq_2yrs -> The number of times the borrower had been 30+ days past due on a payment in the past 2 years Term -> Borrowing term Grade -> History credit grading Inq_Last_6mths -> The borrower’s number of inquiries by creditors in the last 6 months Purpose -> Purpose for borrowing Feature Engineering loan_stat Fully_Paid -> 0 Defult, Charged-Off` -> 1 Grade A, B, C, D, E, F, D -> 1, 2, 3, 4, 5, 6, 7 Purpose Debt_Consolidation -> 1 Other -> 0 1 2 3 4 5 6 7 8 #getting dummies for loan_status data_df['loan_status'] = data_df['loan_status']....