![](/uploads/1/2/5/5/125558917/588552946.jpg)
All Answers (8) Thanks Magda Tsintsadze. However, the data i need is on corporate or consumer risk. Corporate credit risk either from difference between yield on corporate debt and on matching Govt debt or, alternatively, from CDS data. A company called Markit sell CDS data, but it's quite expensive.
Case Study for a Credit Scorecard Analysis
This example shows how to create a
creditscorecard
object, bin data, display, and plot binned data information. This example also shows how to fit a logistic regression model, obtain a score for the scorecard model, and determine the probabilities of default and validate the credit scorecard model using three different metrics. Step 1. Create a creditscorecard object.
Use the
CreditCardData.mat
file to load the data
(using a dataset from Refaat 2011). If your data
contains many predictors, you can first use screenpredictors
from Risk Management Toolbox™ to pare down a potentially large set of predictors to a subset that is most predictive of the credit scorecard response variable. Use can then use this subset of predictors when creating the creditscorecard
object.When creating a
creditscorecard
object, by default, 'ResponseVar'
is set to the last column in the data ('status'
in this example) and the 'GoodLabel'
to the response value with the highest count (0
in this example). The syntax for creditscorecard
indicates that 'CustID'
is the 'IDVar'
to remove from the list of predictors. Also, while not demonstrated in this example, when creating a creditscorecard
object using creditscorecard
, you can use the optional name-value pair argument 'WeightsVar'
to specify observation (sample) weights or 'BinMissingData'
to bin missing data.Perform some initial data exploration. Inquire about predictor statistics for the categorical variable
'ResStatus'
and plot the bin information for 'ResStatus'
. This bin information contains the frequencies of “Good” and “Bad,” and bin statistics. Avoid having bins with frequencies of zero because they lead to infinite or undefined (
NaN
) statistics. Use the modifybins
or autobinning
functions to bin the data accordingly. For numeric data, a common first step is 'fine classing.' This means binning the data into several bins, defined with a regular grid. To illustrate this point, use the predictor
'CustIncome'
. Step 2a. Automatically bin the data.
Use the
autobinning
function to perform automatic binning for every predictor variable, using the default 'Monotone'
algorithm with default algorithm options. After the automatic binning step, every predictor bin must be reviewed using the
bininfo
and plotbins
functions and fine-tuned. A monotonic, ideally linear trend in the Weight of Evidence (WOE) is desirable for credit scorecards because this translates into linear points for a given predictor. The WOE trends can be visualized using plotbins
. Unlike the initial plot of
'ResStatus'
when the scorecard was created, the new plot for 'ResStatus'
shows an increasing WOE trend. This is because the autobinning
function, by default, sorts the order of the categories by increasing odds. These plots show that the
'Monotone'
algorithm does a good job finding monotone WOE trends for this dataset. To complete the binning process, it is necessary to make only a few manual adjustments for some predictors using the modifybins
function. Step 2b. Fine-tune the bins using manual binning.
Common steps to manually modify bins are:
- Use the
bininfo
function with two output arguments where the second argument contains binning rules. - Manually modify the binning rules using the second output argument from
bininfo
. - Set the updated binning rules with
modifybins
and then useplotbins
orbininfo
to review the updated bins.
For example, based on the plot for
'CustAge'
in Step 2a, bins number 1 and 2 have similar WOE's as do bins number 5 and 6. To merge these bins using the steps outlined above: For
'CustIncome'
, based on the plot above, it is best to merge bins 3, 4 and 5 because they have similar WOE's. To merge these bins: For
'TmWBank'
, based on the plot above, it is best to merge bins 2 and 3 because they have similar WOE's. To merge these bins: For
'AMBalance'
, based on the plot above, it is best to merge bins 2 and 3 because they have similar WOE's. To merge these bins: Now that the binning fine-tuning is completed, the bins for all predictors have close-to-linear WOE trends.
Step 3. Fit a logistic regression model.
The
fitmodel
function fits a logistic regression model to the WOE data. fitmodel
internally bins the training data, transforms it into WOE values, maps the response variable so that 'Good'
is 1
, and fits a linear logistic regression model. By default, fitmodel
uses a stepwise procedure to determine which predictors should be in the model. Step 4. Review and format scorecard points.
After fitting the logistic model, by default the points are unscaled and come directly from the combination of WOE values and model coefficients. The
displaypoints
function summarizes the scorecard points. This is a good time to modify the bin labels, if this is something of interest for cosmetic reasons. To do so, use
modifybins
to change the bin labels. Points are usually scaled and also often rounded. To do this, use the
formatpoints
function. For example, you can set a target level of points corresponding to a target odds level and also set the required points-to-double-the-odds (PDO). Step 5. Score the data.
The
score
function computes the scores for the training data. An optional data
input can also be passed to score
, for example, validation data. The points per predictor for each customer are provided as an optional output. Step 6. Calculate the probability of default.
To calculate the probability of default, use the
probdefault
function. Define the probability of being “Good” and plot the predicted odds versus the formatted scores. Visually analyze that the target points and target odds match and that the points-to-double-the-odds (PDO) relationship holds.
Step 7. Validate the credit scorecard model using the CAP, ROC, and Kolmogorov-Smirnov statistic
The
creditscorecard
class supports three validation methods, the Cumulative Accuracy Profile (CAP), the Receiver Operating Characteristic (ROC), and the Kolmogorov-Smirnov (K-S) statistic. For more information on CAP, ROC, and KS, see Cumulative Accuracy Profile (CAP), Receiver Operating Characteristic (ROC), and Kolmogorov-Smirnov statistic (KS). See Also
autobinning
| bindata
| bininfo
| compact
| creditscorecard
| displaypoints
| fitmodel
| formatpoints
| modifybins
| modifypredictor
| plotbins
| predictorinfo
| probdefault
| score
| setmodel
| validatemodel
Related Examples
- Credit Rating by Bagging Decision Trees (Statistics and Machine Learning Toolbox)
More About
External Websites
(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)
In our data science course, this morning, we’ve use random forrest to improve prediction on the German Credit Dataset. The dataset is
Almost all variables are treated a numeric, but actually, most of them are factors,
(etc). Let us convert categorical variables as factors,
Let us now create our training/calibration and validation/testing datasets, with proportion 1/3-2/3
The first model we can fit is a logistic regression, on selected covariates
Based on that model, it is possible to draw the ROC curve, and to compute the AUC (on ne validation dataset)
An alternative is to consider a logistic regression on all explanatory variables
We might overfit, here, and we should observe that on the ROC curve
There is a slight improvement here, compared with the previous model, where only five explanatory variables were considered.
Consider now some regression tree (on all covariates)
We can visualize the tree using
The ROC curve for that model is
As expected, a single has a lower performance, compared with a logistic regression. And a natural idea is to grow several trees using some boostrap procedure, and then to agregate those predictions.
Here this model is (slightly) better than the logistic regression. Actually, if we create many training/validation samples, and compare the AUC, we can observe that – on average – random forests perform better than logistic regressions,
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updatesfrom the site? Choose your flavor: e-mail, twitter, RSS, or facebook...
![](/uploads/1/2/5/5/125558917/588552946.jpg)