使用GBM进行欺诈检测

欢迎阅读本教程，了解如何使用 creditcard_fraud Ludwig 数据集训练 GBM 模型来检测信用卡欺诈！

在交互式笔记本中打开此示例：

加载数据¶

首先，让我们从 Kaggle 下载数据集。

!ludwig datasets download creditcard_fraud

此命令将从 Kaggle 下载信用卡欺诈数据集。

creditcard_fraud 数据集包含超过 28.4 万条记录，拥有 31 个特征和一个二元标签，指示交易是否为欺诈。

from ludwig.benchmarking.utils import load_from_module
from ludwig.datasets import creditcard_fraud

df = load_from_module(creditcard_fraud, {'name': 'Class', 'type': 'binary'})

这会将数据集加载到 Pandas DataFrame 中，并添加可重现的训练/验证/测试拆分，该拆分在输出特征上进行分层。

df.groupby('split').Class.value_counts()

拆分	类别	计数
0	0	178180
	1	349
1	0	19802
	1	34
2	0	86333
	1	109

训练¶

接下来，让我们创建一个 Ludwig 配置来定义我们的机器学习任务。在此配置中，我们将指定要使用 GBM 模型进行训练。我们还将指定输入和输出特征，并设置模型的一些超参数。有关可用训练器参数的更多详细信息，请参阅用户指南。

import yaml

config = yaml.safe_load(
"""
model_type: gbm

input_features:
  - name: Time
    type: number
  - name: V1
    type: number
  - name: V2
    type: number
  - name: V3
    type: number
  - name: V4
    type: number
  - name: V5
    type: number
  - name: V6
    type: number
  - name: V7
    type: number
  - name: V8
    type: number
  - name: V9
    type: number
  - name: V10
    type: number
  - name: V11
    type: number
  - name: V12
    type: number
  - name: V13
    type: number
  - name: V14
    type: number
  - name: V15
    type: number
  - name: V16
    type: number
  - name: V17
    type: number
  - name: V18
    type: number
  - name: V19
    type: number
  - name: V20
    type: number
  - name: V21
    type: number
  - name: V22
    type: number
  - name: V23
    type: number
  - name: V24
    type: number
  - name: V25
    type: number
  - name: V26
    type: number
  - name: V27
    type: number
  - name: V28
    type: number
  - name: Amount
    type: number

output_features:
  - name: Class
    type: binary

trainer:
  num_boost_round: 300
  lambda_l1: 0.00011379587942715957
  lambda_l2: 8.286477350867434
  bagging_fraction: 0.4868130193152093
  feature_fraction: 0.462444410839139
  evaluate_training_set: false
"""
)

现在我们已经设置好了数据和配置，可以使用以下命令训练我们的 GBM 模型

import logging
from ludwig.api import LudwigModel

model = LudwigModel(config, logging_level=logging.INFO)
train_stats, preprocessed_data, output_directory = model.train(df)

评估¶

训练完成后，我们可以使用 model.evaluate 命令评估模型的性能

train, valid, test, metadata = preprocessed_data

evaluation_statistics, predictions, output_directory = model.evaluate(test, collect_overall_stats=True)

ROC AUC

evaluation_statistics['Class']["roc_auc"]

0.9429567456245422

准确率

evaluation_statistics['Class']["accuracy"]

0.9995435476303101

精确率、召回率和 F1 分数

evaluation_statistics['Class']["overall_stats"]

{'token_accuracy': 0.9995435633656935,
 'avg_precision_macro': 0.9689512098036177,
 'avg_recall_macro': 0.8917086143188491,
 'avg_f1_score_macro': 0.9268520044110913,
 'avg_precision_micro': 0.9995435633656935,
 'avg_recall_micro': 0.9995435633656935,
 'avg_f1_score_micro': 0.9995435633656935,
 'avg_precision_weighted': 0.9995435633656935,
 'avg_recall_weighted': 0.9995435633656935,
 'avg_f1_score_weighted': 0.9995230814612599,
 'kappa_score': 0.8537058585299842}

可视化¶

除了使用准确率、精确率和召回率等指标评估模型性能外，可视化模型结果也很有帮助。Ludwig 提供了多种选项来可视化模型结果，包括混淆矩阵和 ROC 曲线。

from ludwig import visualize

混淆矩阵

我们可以使用 Ludwig 的 visualize.confusion_matrix 函数创建混淆矩阵，该矩阵显示模型做出的真阳性、真阴性、假阳性以及假阴性预测的数量。为此，我们可以使用以下代码，它将显示一个混淆矩阵图，展示模型的性能。

visualize.confusion_matrix(
    [evaluation_statistics],
    model.training_set_metadata,
    'Class',
    top_n_classes=[2],
    model_names=[''],
    normalize=True
)

Confusion Matrix Confusion Matrix Entropy

ROC 曲线

我们还可以创建 ROC 曲线，该曲线绘制了在不同分类阈值下的真阳性率与假阳性率的关系。为此，我们可以使用以下代码

visualize.roc_curves(
    [predictions['Class_probabilities']],
    test.to_df()['Class_mZFLky'],
    test.to_df(),
    'Class_mZFLky',
    '1',
    model_names=["Credit Card Fraud"],
    output_directory='visualization',
    file_format='png'
)

ROC Curve

我们希望这些可视化有助于您理解我们的 GBM 模型在检测信用卡欺诈方面的性能。有关 Ludwig 中可用各种可视化选项的更多信息，请参阅文档。

感谢您跟随我们的教程，学习在信用卡欺诈数据集上训练 GBM 模型。我们希望本教程对您有所帮助，并让您更好地理解如何在自己的机器学习项目中使用 GBM 模型。

如果您有任何问题或反馈，请随时联系我们的社区！