Bankruptcy Prediction Challenge 💸

Authors: Yassine Yousfi, Mostafa Bouziane

(a.k.a. Back-propagated boys)

👍

Model evaluation

In [28]:
def shuffle_split_pimped(clf, X_train, y_train, n):
    ...
    return r
  • Insures ~20% of positive samples in test fold
  • Insures ~3% of positive samples in train fold
  • Gives accurate estimates of the leaderboard score (real life score)
  • ~1000 positive samples in the original data set → limited number of splits (~7/8)

Feature engineering

In [27]:
f, (ax1, ax2) = plt.subplots(ncols=2,figsize=(10,5))
ax1.hist(feature_nan[y_train==1], bins = 30, normed=True, facecolor='red', alpha=0.55)
ax1.set_title("Banrkupt NAN")
ax2.hist(feature_nan[y_train==0], bins = 30, normed=True, facecolor='blue', alpha=0.55)
ax2.set_title("Healthy NAN")
plt.show()

Paramaters optimisation

In [7]:
def objective(params):
    params = {
        'max_depth': int(params['max_depth']),
        'gamma': "{:.4f}".format(params['gamma']),
        'colsample_bytree': '{:.4f}'.format(params['colsample_bytree']),
        'scale_pos_weight': int(params['scale_pos_weight']),
        'n_estimators': int(params['n_estimators']),
        'learning_rate': '{:.4f}'.format(params['learning_rate']),
        'subsample': '{:.4f}'.format(params['subsample']),
    }    
    clf = XGBClassifier(n_jobs=4, eval_metric="auc", **params)
    score = shuffle_split_pimped(clf, X_train, y_train, 8).mean()
    print("Score {:.4f} params {}".format(score, params))
    return -score

space = {
    'max_depth': hp.quniform('max_depth', 3, 12, 1),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.7, 1.0),
    'gamma': hp.uniform('gamma', 0.0, 0.6),
    'scale_pos_weight': hp.quniform('scale_pose_weight', 10, 100, 1),
    'n_estimators':  hp.quniform('n_estimators', 200, 400, 20),
    'learning_rate':  hp.uniform('learning_rate', 0.04, 0.15),
    'subsample': hp.uniform('subsample', 0.7, 1.0),
}

best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10)
Score 0.9196 params {'max_depth': 4, 'gamma': '0.0824', 'colsample_bytree': '0.8314', 'scale_pos_weight': 94, 'n_estimators': 380, 'learning_rate': '0.0641', 'eta': '0.0315', 'subsample': '0.7748', 'num_boost_round': 60}

Best params:

  • max_depth: 5
  • gamma: 0.5819
  • colsample_bytree: 0.8125
  • scale_pos_weight: 86
  • n_estimators: 240
  • learning_rate: 0.1027
  • subsample: 0.8511

Small improvement

Repeat N_rounds time:

  • Fit XGBOOST on X_train
  • Predict
  • Create N_feat random new features based on {+, -, x, ÷} operations and stack them to X_train
  • Fit XGBOOST on new X_train
  • Compute Feature importances (normed)
  • Keep only features with importances > Thresh
  • Update X_train

Why does it work?

Our dataset is composed of ratios, sometimes {+, -, x, ÷} of ratios make sens.

Draw-back?

Easy overfitting.

👎

SMOTE + ENN

SMOTE (Synthetic Minority Over-Sampling Technique)

For each point p in the minority class S:

  • Compute its k nearest neighbors in S

  • Randomly choose r ≤ k of the neighbors (with re-placement)

  • Choose a random point along the lines joining p and each of the r selected neighbors

  • Add these synthetic points to the dataset with class S

ENN (Edited Nearest Neighbor)

  • Remove any example whose class label differs from the class of at least two of its three nearest neighbors

Draw-backs

  • Curse of dimensionality (d = 65) neighbors become quite far
  • Removing/adding points near the decision boundary

🕵️

Potential improvements

  • Investigate folds where XGBOOST didn't score good and maybe try to fit another classifier
  • Further investigate the method with the new random features (tune N_rounds, N_new, thresh, etc.)
  • Feature selection > SMOTE