⇅ 数值特征

预处理¶

数值特征被直接转换为长度为 n 的浮点值向量（其中 n 是数据集的大小），并以反映数据集中列名称的键添加到 HDF5 中。JSON 元数据文件中没有关于它们的额外信息。

preprocessing:
    missing_value_strategy: fill_with_const
    normalization: zscore
    outlier_strategy: null
    fill_value: 0.0
    outlier_threshold: 3.0

参数

missing_value_strategy (默认值: fill_with_const) : 当数值列中存在缺失值时应遵循的策略。选项：fill_with_const（用常数填充），fill_with_mode（用众数填充），bfill（向后填充），ffill（向前填充），drop_row（丢弃行），fill_with_mean（用均值填充）。详情请参阅缺失值处理策略。
normalization (默认值: zscore) : 用于此数值特征的归一化策略。如果值为 null，则不执行归一化。选项：zscore（Z-score 标准化），minmax（最小-最大归一化），log1p（log1p 转换），iq（四分位范围归一化），null（无归一化）。详情请参阅归一化。
outlier_strategy (默认值: null) : 确定如何处理数据集中的异常值。在大多数情况下，用列均值替换异常值 (fill_with_mean) 就足够了，但在其他情况下，异常值可能具有足够的破坏性，需要丢弃整行数据 (drop_row)。在某些情况下，处理异常值的最佳方法是将其保留在数据中，当此参数保留为 null 时即是此行为。选项：fill_with_const（用常数填充），fill_with_mode（用众数填充），bfill（向后填充），ffill（向前填充），drop_row（丢弃行），fill_with_mean（用均值填充），null（不处理）。
fill_value (默认值: 0.0): 当 missing_value_strategy 为 fill_with_const 时，用于替换缺失值的值。
outlier_threshold (默认值: 3.0): 一个值被视为异常值的标准差距离均值。统计学中的 3-sigma 法则告诉我们，当数据呈正态分布时，95% 的数据将落在均值两个标准差范围内，超过 99% 的数据将落在均值三个标准差范围内（参阅：68–95–99.7 法则）。因此，任何偏离此范围更远的数据极有可能是一个异常值，并且可能通过不成比例地影响模型来扭曲学习过程。

预处理参数也可以在类型全局预处理部分中定义一次，并应用于所有数值输入特征。

归一化¶

数值特征类型归一化时使用的技术。

选项

null: 不执行归一化。
zscore: 计算均值和标准差，使值偏移后具有零均值和 1 标准差。
minmax: 从值中减去最小值，然后将结果除以最大值和最小值之间的差。
log1p: 返回的值是原始值加 1 的自然对数。注意：log1p 仅对正值定义。
iq: 从值中减去中位数，然后将结果除以四分位距 (IQR)，即第 75 百分位数减去第 25 百分位数。结果数据具有零均值、零中位数和 1 的标准差。如果您的特征包含大型异常值，这种方法会很有用，因为归一化不会被这些值扭曲。

使用哪种归一化技术取决于您数据的分布，但在许多情况下，zscore 是一个不错的起点。

输入特征¶

数值特征有两种编码器。一种编码器 (passthrough) 只返回来自输入占位符的原始数值作为输出。输入大小为 b，而输出大小为 b x 1，其中 b 是批量大小。另一种编码器 (dense) 将原始数值通过全连接层。在这种情况下，大小为 b 的输入被转换为大小为 b x h。

在特征级别指定的编码器参数有

tied (默认值 null): 用于绑定编码器权重的输入特征的名称。它必须是相同类型且具有相同编码器参数的特征的名称。

输入特征列表中的数值特征示例条目

name: number_column_name
type: number
tied: null
encoder: 
    type: dense

可用的编码器参数

type (默认值 passthrough): 可能的值有 passthrough（直通），dense（密集）和 sparse（稀疏）。passthrough 输出未经更改的原始整数值。dense 随机初始化一个可训练的嵌入矩阵，sparse 使用独热编码。

编码器类型和编码器参数也可以在类型全局编码器部分中定义一次，并应用于所有数值输入特征。

编码器¶

直通编码器¶

encoder:
    type: passthrough

passthrough 编码器没有额外的参数。

密集编码器¶

encoder:
    type: dense
    dropout: 0.0
    output_size: 256
    norm: null
    num_layers: 1
    activation: relu
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    fc_layers: null

参数

dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 比率。增加 dropout 是对抗过拟合的一种常见正则化形式。dropout 表示一个元素被置零的概率（0.0 表示没有 dropout）。
output_size (默认值: 256) : 特征的输出大小。
norm (默认值: null) : 应用于全连接层开始时的默认归一化。选项：batch（批量归一化），layer（层归一化），ghost（Ghost 归一化），null（无归一化）。详情请参阅归一化。
num_layers (默认值: 1) : 要应用的堆叠全连接层数。增加层数增加了模型的容量，使其能够学习更复杂的特征交互。
activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu，leakyRelu，logSigmoid，relu，sigmoid，tanh，softmax，null。
use_bias (默认值: true): 层是否使用偏置向量。选项：true（是），false（否）。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform，normal，constant，ones，zeros，eye，dirac，xavier_uniform，xavier_normal，kaiming_uniform，kaiming_normal，orthogonal，sparse，identity。或者，可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数的键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的说明，请参阅torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform，normal，constant，ones，zeros，eye，dirac，xavier_uniform，xavier_normal，kaiming_uniform，kaiming_normal，orthogonal，sparse，identity。或者，可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数的键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的说明，请参阅torch.nn.init。
norm_params (默认值: null): 传递给 norm 模块的默认参数。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每层可用的参数包括：activation，dropout，norm，norm_params，output_size，use_bias，bias_initializer 和 weights_initializer。如果字典中缺少任何这些值，将使用作为独立参数提供的默认值。

输出特征¶

当需要执行回归任务时，可以使用数值特征。数值特征只有一个可用的解码器：一个（可能为空的）全连接层堆栈，后接投影到一个单一数值。

使用默认参数的数值输出特征示例

name: number_column_name
type: number
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
    type: mean_squared_error
decoder:
    type: regressor

参数

reduce_input (默认值 sum): 定义了如何将不是向量而是矩阵或更高阶张量的输入，在第一维度（如果算上批量维度则是第二维度）进行缩减。可用值包括：sum（求和），mean 或 avg（求均值），max（求最大值），concat（沿第一维度连接），last（返回第一维度的最后一个向量）。
dependencies (默认值 []): 此输出特征所依赖的输出特征。详细说明请参阅输出特征依赖项。
reduce_dependencies (默认值 sum): 定义了如何将不是向量而是矩阵或更高阶张量的依赖特征的输出，在第一维度（如果算上批量维度则是第二维度）进行缩减。可用值包括：sum（求和），mean 或 avg（求均值），max（求最大值），concat（沿第一维度连接），last（返回第一维度的最后一个向量）。
loss (默认值 {type: mean_squared_error}): 一个包含损失函数 type 的字典。选项：mean_squared_error（均方误差），mean_absolute_error（平均绝对误差），root_mean_squared_error（均方根误差），root_mean_squared_percentage_error（均方根百分比误差）。详情请参阅损失函数。
decoder (默认值: {"type": "regressor"}): 所需任务的解码器。选项：regressor（回归器）。详情请参阅解码器。

解码器¶

回归器¶

decoder:
    type: regressor
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    fc_activation: relu
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null
    use_bias: true
    weights_initializer: xavier_uniform
    bias_initializer: zeros

参数

num_fc_layers (默认值: 0) : 如果未指定 fc_layers，则为全连接层数。增加层数增加了模型的容量，使其能够学习更复杂的特征交互。
fc_output_size (默认值: 256) : 全连接层堆栈的输出大小。
fc_norm (默认值: null) : 应用于全连接层开始时的默认归一化。选项：batch（批量归一化），layer（层归一化），ghost（Ghost 归一化），null（无归一化）。详情请参阅归一化。
fc_dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 比率。增加 dropout 是对抗过拟合的一种常见正则化形式。dropout 表示一个元素被置零的概率（0.0 表示没有 dropout）。
fc_activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu，leakyRelu，logSigmoid，relu，sigmoid，tanh，softmax，null。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每层可用的参数包括：activation，dropout，norm，norm_params，output_size，use_bias，bias_initializer 和 weights_initializer。如果字典中缺少任何这些值，将使用作为独立参数提供的默认值。
fc_use_bias (默认值: true): fc_stack 中的层是否使用偏置向量。选项：true（是），false（否）。
fc_weights_initializer (默认值: xavier_uniform): fc_stack 中层使用的权重初始化器
fc_bias_initializer (默认值: zeros): fc_stack 中层使用的偏置初始化器
fc_norm_params (默认值: null): 传递给 norm 模块的默认参数。
use_bias (默认值: true): 层是否使用偏置向量。选项：true（是），false（否）。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform，normal，constant，ones，zeros，eye，dirac，xavier_uniform，xavier_normal，kaiming_uniform，kaiming_normal，orthogonal，sparse，identity。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform，normal，constant，ones，zeros，eye，dirac，xavier_uniform，xavier_normal，kaiming_uniform，kaiming_normal，orthogonal，sparse，identity。

解码器类型和解码器参数也可以在类型全局解码器部分中定义一次，并应用于所有数值输出特征。

损失函数¶

均方误差 (MSE)¶

loss:
    type: mean_squared_error
    weight: 1.0

参数

weight (默认值: 1.0): 损失函数的权重。

平均绝对误差 (MAE)¶

loss:
    type: mean_absolute_error
    weight: 1.0

参数

weight (默认值: 1.0): 损失函数的权重。

平均绝对百分比误差 (MAPE)¶

loss:
    type: mean_absolute_percentage_error
    weight: 1.0

参数

weight (默认值: 1.0): 损失函数的权重。

均方根误差 (RMSE)¶

loss:
    type: root_mean_squared_error
    weight: 1.0

参数

weight (默认值: 1.0): 损失函数的权重。

均方根百分比误差 (RMSPE)¶

loss:
    type: root_mean_squared_percentage_error
    weight: 1.0

参数

weight (默认值: 1.0): 损失函数的权重。

Huber 损失¶

loss:
    type: huber
    weight: 1.0
    delta: 1.0

参数

weight (默认值: 1.0): 损失函数的权重。
delta (默认值: 1.0): 在 delta 缩放的 L1 和 L2 损失之间切换的阈值。

损失函数和相关参数也可以在类型全局损失部分中定义一次，并应用于所有数值输出特征。

指标¶

每个 epoch 计算并可用于数值特征的指标包括 mean_squared_error（均方误差）、mean_absolute_error（平均绝对误差）、root_mean_squared_error（均方根误差）、root_mean_squared_percentage_error（均方根百分比误差）以及 loss 本身。如果您将配置的 training 部分中的 validation_field 设置为数值特征的名称，则可以将其中任何一个设置为 validation_metric（验证指标）。