⇅ 文本特征

预处理¶

文本特征是序列特征的扩展。文本输入由 tokenizer 处理，该 tokenizer 将原始文本输入映射到 token 序列。每个唯一 token 都被分配一个整数 ID。使用此映射，每个文本字符串首先转换为 token 序列，然后转换为整数序列。

token 列表及其整数表示（词汇表）存储在模型的元数据中。对于文本输出特征，使用相同的映射对预测结果进行后处理以获得文本。

preprocessing:
    tokenizer: space_punct
    max_sequence_length: 256
    missing_value_strategy: fill_with_const
    most_common: 20000
    lowercase: false
    fill_value: <UNK>
    ngram_size: 2
    padding_symbol: <PAD>
    unknown_symbol: <UNK>
    padding: right
    cache_encoder_embeddings: false
    vocab_file: null
    sequence_length: null
    prompt:
        template: null
        task: null
        retrieval:
            type: null
            index_name: null
            model_name: null
            k: 0

参数

tokenizer (默认值: space_punct) : 定义如何将数据集列中的原始字符串内容映射到元素序列。选项：space, space_punct, ngram, characters, underscore, comma, untokenized, stripped, english_tokenize, english_tokenize_filter, english_tokenize_remove_stopwords, english_lemmatize, english_lemmatize_filter, english_lemmatize_remove_stopwords, italian_tokenize, italian_tokenize_filter, italian_tokenize_remove_stopwords, italian_lemmatize, italian_lemmatize_filter, italian_lemmatize_remove_stopwords, spanish_tokenize, spanish_tokenize_filter, spanish_tokenize_remove_stopwords, spanish_lemmatize, spanish_lemmatize_filter, spanish_lemmatize_remove_stopwords, german_tokenize, german_tokenize_filter, german_tokenize_remove_stopwords, german_lemmatize, german_lemmatize_filter, german_lemmatize_remove_stopwords, french_tokenize, french_tokenize_filter, french_tokenize_remove_stopwords, french_lemmatize, french_lemmatize_filter, french_lemmatize_remove_stopwords, portuguese_tokenize, portuguese_tokenize_filter, portuguese_tokenize_remove_stopwords, portuguese_lemmatize, portuguese_lemmatize_filter, portuguese_lemmatize_remove_stopwords, dutch_tokenize, dutch_tokenize_filter, dutch_tokenize_remove_stopwords, dutch_lemmatize, dutch_lemmatize_filter, dutch_lemmatize_remove_stopwords, greek_tokenize, greek_tokenize_filter, greek_tokenize_remove_stopwords, greek_lemmatize, greek_lemmatize_filter, greek_lemmatize_remove_stopwords, norwegian_tokenize, norwegian_tokenize_filter, norwegian_tokenize_remove_stopwords, norwegian_lemmatize, norwegian_lemmatize_filter, norwegian_lemmatize_remove_stopwords, lithuanian_tokenize, lithuanian_tokenize_filter, lithuanian_tokenize_remove_stopwords, lithuanian_lemmatize, lithuanian_lemmatize_filter, lithuanian_lemmatize_remove_stopwords, danish_tokenize, danish_tokenize_filter, danish_tokenize_remove_stopwords, danish_lemmatize, danish_lemmatize_filter, danish_lemmatize_remove_stopwords, polish_tokenize, polish_tokenize_filter, polish_tokenize_remove_stopwords, polish_lemmatize, polish_lemmatize_filter, polish_lemmatize_remove_stopwords, romanian_tokenize, romanian_tokenize_filter, romanian_tokenize_remove_stopwords, romanian_lemmatize, romanian_lemmatize_filter, romanian_lemmatize_remove_stopwords, japanese_tokenize, japanese_tokenize_filter, japanese_tokenize_remove_stopwords, japanese_lemmatize, japanese_lemmatize_filter, japanese_lemmatize_remove_stopwords, chinese_tokenize, chinese_tokenize_filter, chinese_tokenize_remove_stopwords, chinese_lemmatize, chinese_lemmatize_filter, chinese_lemmatize_remove_stopwords, multi_tokenize, multi_tokenize_filter, multi_tokenize_remove_stopwords, multi_lemmatize, multi_lemmatize_filter, multi_lemmatize_remove_stopwords, sentencepiece, clip, gpt2bpe, bert, hf_tokenizer。
max_sequence_length (默认值: 256) : 序列的最大长度（token 数）。超过此值的序列将被截断。如果 sequence_length 设置为 None，这是一个有用的权宜之计。如果为 None，则将从训练数据集中推断最大序列长度。
missing_value_strategy (默认值: fill_with_const) : 当文本列中存在缺失值时要遵循的策略。选项：fill_with_const, fill_with_mode, bfill, ffill, drop_row。详情请参见缺失值策略。
most_common (默认值: 20000): 词汇表中最多最常见的 token 数。如果数据量超过此数，则最不常见的符号将被视为未知。
lowercase (默认值: false): 如果为 true，则在 token 化之前将字符串转换为小写。选项：true, false。
fill_value (默认值: <UNK>): 如果 missing_value_strategy 是 fill_with_const，用于替换缺失值的值。
ngram_size (默认值: 2): 使用 ngram tokenizer 时 ngram 的大小（例如，2 = bigram，3 = trigram 等）。
padding_symbol (默认值: <PAD>): 用作序列特征的填充符号的字符串。对于使用 huggingface 编码器的特征，此参数将被忽略，因为它们有自己的词汇表。
unknown_symbol (默认值: <UNK>): 用作序列特征的未知符号的字符串。对于使用 huggingface 编码器的特征，此参数将被忽略，因为它们有自己的词汇表。
padding (默认值: right): 填充的方向。选项：left, right。
cache_encoder_embeddings (默认值: false): 对于预训练编码器，在预处理中计算编码器嵌入，大大加快训练时间。仅在 encoder.trainable=false 时支持。选项：true, false。
vocab_file (默认值: null): 指向包含序列词汇表的 UTF-8 编码文件的文件路径字符串。每行第一个字符串直到 \t 或 \n 被视为一个单词。
sequence_length (默认值: null): 序列的所需长度（token 数）。超过此值的序列将被截断，短于此值的序列将被填充。如果为 None，则将从训练数据集中推断序列长度。
prompt :
prompt.template (默认值: null) : 用于 prompt 的模板。必须包含输入数据集中至少一个列或 __sample__ 作为用花括号 {} 包围的变量，以指示在哪里插入当前特征。可以插入多个列，例如：The {color} {animal} jumped over the {size} {object}，其中花括号中的每个术语都是数据集中的一个列。如果指定了 task，则模板还必须包含 __task__ 变量。如果指定了 retrieval，则模板还必须包含 __context__ 变量。如果没有提供模板，则将根据检索设置使用默认模板，并且必须在配置中设置任务。
prompt.task (默认值: null) : 用于 prompt 的任务。如果未设置 template，则此参数为必需项。
prompt.retrieval (默认值: {"type": null})

预处理参数也可以定义一次并使用类型全局预处理部分应用于所有文本输入特征。

注意

如果文本特征的编码器指定了 huggingface 模型，则将自动使用该模型的 tokenizer。

输入特征¶

在特征级别指定的编码器参数为

tied (默认值 null): 要与之绑定编码器权重的另一个输入特征的名称。它必须是具有相同类型和相同编码器参数的特征名称。

输入特征列表中的文本特征条目示例

name: text_column_name
type: text
tied: null
encoder:
    type: bert
    trainable: true

参数

type (默认值 parallel_cnn): 用于输入文本特征的编码器。可用的编码器包括用于序列特征的编码器以及来自 huggingface transformers 库的预训练文本编码器：albert, auto_transformer, bert, camembert, ctrl, distilbert, electra, flaubert, gpt, gpt2, longformer, roberta, t5, mt5, transformer_xl, xlm, xlmroberta, xlnet。

编码器类型和编码器参数也可以定义一次，并使用类型全局编码器部分应用于所有文本输入特征。

编码器¶

嵌入编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Aggregation\n Reduce\n Operation"];
  C --> ...;

嵌入编码器仅将输入序列中的每个 token 映射到一个嵌入，创建一个 b x s x h 的张量，其中 b 是批量大小，s 是序列长度，h 是嵌入大小。张量沿 s 维度缩减，以获得批处理中每个元素的尺寸为 h 的单个向量。如果您想输出完整的 b x s x h 张量，可以将 reduce_output 指定为 null。

encoder:
    type: embed
    dropout: 0.0
    embedding_size: 256
    representation: dense
    weights_initializer: uniform
    reduce_output: sum
    embeddings_on_cpu: false
    embeddings_trainable: true
    pretrained_embeddings: null

参数

dropout (默认值: 0.0) : 应用于嵌入的 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
weights_initializer (默认值: uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
reduce_output (默认值: sum): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。

并行 CNN 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E1["Pool"];
  C --> D2["1D Conv\n Width 3"] --> E2["Pool"];
  C --> D3["1D Conv\n Width 4"] --> E3["Pool"];
  C --> D4["1D Conv\n Width 5"] --> E4["Pool"];
  E1 --> F["Concat"] --> G["Fully\n Connected\n Layers"] --> H["..."];
  E2 --> F;
  E3 --> F;
  E4 --> F;

并行 CNN 编码器受到 Yoon Kim 的用于句子分类的卷积神经网络的启发。其工作原理是：首先将输入 token 序列 b x s（其中 b 是批量大小，s 是序列长度）映射为嵌入序列，然后将嵌入通过多个并行的一维卷积层，这些层具有不同的滤波器大小（默认为 4 个层，滤波器大小分别为 2、3、4 和 5），接着进行最大池化和拼接。拼接并行卷积层输出的单个向量然后通过一个全连接层堆叠并作为 b x h 张量返回，其中 h 是最后一个全连接层的输出大小。如果您想输出完整的 b x s x h 张量，可以将 reduce_output 指定为 null。

encoder:
    type: parallel_cnn
    dropout: 0.0
    embedding_size: 256
    num_conv_layers: null
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    conv_layers: null
    pool_function: max
    pool_size: null
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    pretrained_embeddings: null
    num_filters: 256

参数

dropout (默认值: 0.0) : 应用于嵌入的 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
num_conv_layers (默认值: null) : 当 conv_layers 为 null 时，堆叠的卷积层数。
output_size (默认值: 256) : 将用于每一层的默认 output_size。
activation (默认值: relu): 将用于每一层的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
filter_size (默认值: 3): 一维卷积滤波器的尺寸。它指示了一维卷积滤波器的宽度。
norm (默认值: null): 将用于每一层的默认范数。选项：batch, layer, null。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
use_bias (默认值: true): 是否使用偏置向量。选项：true, false。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
reduce_output (默认值: sum): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
conv_layers (默认值: null): 包含所有卷积层参数的字典列表。列表的长度决定了堆叠的卷积层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer。如果字典中缺少这些值中的任何一个，将使用编码器参数中指定的默认值。如果 conv_layers 和 num_conv_layers 都为 null，则会为 conv_layers 分配一个默认列表，其值为 [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]。
pool_function (默认值: max): 使用的池化函数。max 将选择最大值。average, avg, 或 mean 中的任何一个将计算平均值。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
pool_size (默认值: null): 将用于每一层的默认 pool_size。如果 conv_layers 中尚未指定 pool_size，则这是将用于每一层的默认 pool_size。它表示在卷积操作后将沿 s 序列维度执行的最大池化的大小。
norm_params (默认值: null): 如果 norm 是 batch 或 layer 时使用的参数。
num_fc_layers (默认值: null): 使用的并行全连接层数。
fc_layers (默认值: null): 包含每个全连接层参数的字典列表。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。
num_filters (默认值: 256): 滤波器数量，因此也是一维卷积的输出通道数。

堆叠 CNN 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["1D Conv Layers\n Different Widths"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

堆叠 CNN 编码器受到 Xiang Zhang 等人的用于文本分类的字符级卷积网络的启发。其工作原理是：首先将输入 token 序列 b x s（其中 b 是批量大小，s 是序列长度）映射为嵌入序列，然后将嵌入通过具有不同滤波器大小的一维卷积层堆叠（默认为 6 个层，滤波器大小分别为 7、7、3、3、3 和 3），接着是可选的最终池化和 flatten 操作。这个 flatten 后的单个向量然后通过一个全连接层堆叠并作为 b x h 张量返回，其中 h 是最后一个全连接层的输出大小。如果您想输出完整的 b x s x h 张量，可以将所有 conv_layers 的 pool_size 指定为 null 并将 reduce_output 指定为 null，而如果 pool_size 的值与 null 不同且 reduce_output 为 null，则返回的张量的形状将为 b x s' x h，其中 s' 是最后一个卷积层输出的宽度。

encoder:
    type: stacked_cnn
    dropout: 0.0
    num_conv_layers: null
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    strides: 1
    norm: null
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    num_filters: 256
    padding: same
    pretrained_embeddings: null

参数

dropout (默认值: 0.0) : 应用于嵌入的 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
num_conv_layers (默认值: null) : 当 conv_layers 为 null 时，堆叠的卷积层数。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
output_size (默认值: 256) : 将用于每一层的默认 output_size。
activation (默认值: relu): 将用于每一层的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
filter_size (默认值: 3): 一维卷积滤波器的尺寸。它指示了一维卷积滤波器的宽度。
strides (默认值: 1): 卷积的步长。
norm (默认值: null): 将用于每一层的默认范数。选项：batch, layer, null。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
conv_layers (默认值: null): 包含所有卷积层参数的字典列表。列表的长度决定了堆叠的卷积层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer。如果字典中缺少这些值中的任何一个，将使用编码器参数中指定的默认值。如果 conv_layers 和 num_conv_layers 都为 null，则会为 conv_layers 分配一个默认列表，其值为 [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]。
pool_function (默认值: max): 使用的池化函数。max 将选择最大值。average, avg, 或 mean 中的任何一个将计算平均值。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
pool_size (默认值: null): 将用于每一层的默认 pool_size。如果 conv_layers 中尚未指定 pool_size，则这是将用于每一层的默认 pool_size。它表示在卷积操作后将沿 s 序列维度执行的最大池化的大小。
dilation_rate (默认值: 1): 用于膨胀卷积的膨胀率。
pool_strides (默认值: null): 缩小的因子。
pool_padding (默认值: same): 使用的填充。选项：valid, same。
use_bias (默认值: true): 是否使用偏置向量。选项：true, false。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
reduce_output (默认值: sum): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
norm_params (默认值: null): 如果 norm 是 batch 或 layer 时使用的参数。
num_fc_layers (默认值: null): 使用的并行全连接层数。
fc_layers (默认值: null): 包含每个全连接层参数的字典列表。
num_filters (默认值: 256): 滤波器数量，因此也是一维卷积的输出通道数。
padding (默认值: same): 使用的填充。选项：valid, same。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。

堆叠并行 CNN 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> C["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  C --> D1["1D Conv\n Width 2"] --> E["Concat"];
  C --> D2["1D Conv\n Width 3"] --> E;
  C --> D3["1D Conv\n Width 4"] --> E;
  C --> D4["1D Conv\n Width 5"] --> E;
  E --> F["..."];
  F --> G1["1D Conv\n Width 2"] --> H["Concat"];
  F --> G2["1D Conv\n Width 3"] --> H;
  F --> G3["1D Conv\n Width 4"] --> H;
  F --> G4["1D Conv\n Width 5"] --> H;
  H --> I["Pool"] --> J["Fully\n Connected\n Layers"] --> K["..."];

堆叠并行 CNN 编码器是并行 CNN 和堆叠 CNN 编码器的组合，其中堆叠的每一层都由并行卷积层组成。其工作原理是：首先将输入 token 序列 b x s（其中 b 是批量大小，s 是序列长度）映射为嵌入序列，然后将嵌入通过多个并行的一维卷积层堆叠，这些层具有不同的滤波器大小，接着是可选的最终池化和 flatten 操作。这个 flatten 后的单个向量然后通过一个全连接层堆叠并作为 b x h 张量返回，其中 h 是最后一个全连接层的输出大小。如果您想输出完整的 b x s x h 张量，可以将 reduce_output 指定为 null。

encoder:
    type: stacked_parallel_cnn
    dropout: 0.0
    embedding_size: 256
    output_size: 256
    activation: relu
    filter_size: 3
    norm: null
    representation: dense
    num_stacked_layers: null
    pool_function: max
    pool_size: null
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: sum
    norm_params: null
    num_fc_layers: null
    fc_layers: null
    stacked_layers: null
    num_filters: 256
    pretrained_embeddings: null

参数

dropout (默认值: 0.0) : 应用于嵌入的 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
output_size (默认值: 256) : 将用于每一层的默认 output_size。
activation (默认值: relu): 将用于每一层的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
filter_size (默认值: 3): 一维卷积滤波器的尺寸。它指示了一维卷积滤波器的宽度。
norm (默认值: null): 将用于每一层的默认范数。选项：batch, layer, null。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
num_stacked_layers (默认值: null): 如果 stacked_layers 为 null，则这是并行卷积层堆叠中的元素数量。
pool_function (默认值: max): 使用的池化函数。max 将选择最大值。average, avg, 或 mean 中的任何一个将计算平均值。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
pool_size (默认值: null): 将用于每一层的默认 pool_size。如果 conv_layers 中尚未指定 pool_size，则这是将用于每一层的默认 pool_size。它表示在卷积操作后将沿 s 序列维度执行的最大池化的大小。
use_bias (默认值: true): 是否使用偏置向量。选项：true, false。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
reduce_output (默认值: sum): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
norm_params (默认值: null): 如果 norm 是 batch 或 layer 时使用的参数。
num_fc_layers (默认值: null): 使用的并行全连接层数。
fc_layers (默认值: null): 包含每个全连接层参数的字典列表。
stacked_layers (默认值: null): 包含并行卷积层堆叠参数的嵌套列表。列表的长度决定了堆叠的并行卷积层数，子列表的长度决定了并行卷积层的数量，每个字典的内容决定了特定层的参数。
num_filters (默认值: 256): 滤波器数量，因此也是一维卷积的输出通道数。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。

RNN 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["RNN Layers"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

RNN 编码器的工作原理是：首先将输入 token 序列 b x s（其中 b 是批量大小，s 是序列长度）映射为嵌入序列，然后将嵌入通过循环层堆叠（默认为 1 层），接着是一个 reduce 操作，默认只返回最后一个输出，但可以执行其他 reduce 函数。如果您想输出完整的 b x s x h 张量（其中 h 是最后一个 RNN 层的输出大小），可以将 reduce_output 指定为 null。

encoder:
    type: rnn
    dropout: 0.0
    cell_type: rnn
    num_layers: 1
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    fc_activation: relu
    recurrent_activation: sigmoid
    representation: dense
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    bidirectional: false
    pretrained_embeddings: null

参数

dropout (默认值: 0.0) : Dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
cell_type (默认值: rnn) : 要使用的循环单元类型。可用值包括：rnn, lstm, gru。有关单元之间差异的参考，请参阅 torch.nn Recurrent Layers。选项：rnn, lstm, gru。
num_layers (默认值: 1) : 堆叠的循环层数。
state_size (默认值: 256) : RNN 的状态大小。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
output_size (默认值: 256) : 将用于每一层的默认 output_size。
norm (默认值: null) : 将用于每一层的默认范数。选项：batch, layer, ghost, null。
num_fc_layers (默认值: 0) : 使用的并行全连接层数。增加层数可以增加模型容量，使其能够学习更复杂的特征交互。
fc_dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
recurrent_dropout (默认值: 0.0): 循环状态的 dropout 率。
activation (默认值: tanh): 默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
fc_activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
recurrent_activation (默认值: sigmoid): 在循环步骤中使用的激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
unit_forget_bias (默认值: true): 如果为 true，则在初始化时将 1 添加到遗忘门的偏置中。选项：true, false。
recurrent_initializer (默认值: orthogonal): 循环矩阵权重的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。
use_bias (默认值: true): 是否使用偏置向量。选项：true, false。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
reduce_output (默认值: last): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
norm_params (默认值: null): 传递给 norm 模块的默认参数。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer 和 weights_initializer。如果字典中缺少这些值中的任何一个，将使用作为独立参数提供的默认值。
bidirectional (默认值: false): 如果为 true，两个循环网络将分别在向前和向后方向执行编码，并将其输出拼接。选项：true, false。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。

CNN RNN 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C1["CNN Layers"];
  C1 --> C2["RNN Layers"];
  C2 --> D["Fully\n Connected\n Layers"];
  D --> ...;

cnnrnn 编码器的工作原理是：首先将输入 token 序列 b x s（其中 b 是批量大小，s 是序列长度）映射为嵌入序列，然后将嵌入通过卷积层堆叠（默认为 2 层），接着是循环层堆叠（默认为 1 层），最后是一个 reduce 操作，默认只返回最后一个输出，但可以执行其他 reduce 函数。如果您想输出完整的 b x s x h 张量（其中 h 是最后一个 RNN 层的输出大小），可以将 reduce_output 指定为 null。

encoder:
    type: cnnrnn
    dropout: 0.0
    conv_dropout: 0.0
    cell_type: rnn
    num_conv_layers: null
    state_size: 256
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    recurrent_dropout: 0.0
    activation: tanh
    filter_size: 5
    strides: 1
    fc_activation: relu
    recurrent_activation: sigmoid
    conv_activation: relu
    representation: dense
    conv_layers: null
    pool_function: max
    pool_size: null
    dilation_rate: 1
    pool_strides: null
    pool_padding: same
    unit_forget_bias: true
    recurrent_initializer: orthogonal
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    num_filters: 256
    padding: same
    num_rec_layers: 1
    bidirectional: false
    pretrained_embeddings: null

参数

dropout (默认值: 0.0) : Dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
conv_dropout (默认值: 0.0) : 卷积层的 dropout 率。
cell_type (默认值: rnn) : 要使用的循环单元类型。可用值包括：rnn, lstm, gru。有关单元之间差异的参考，请参阅 torch.nn Recurrent Layers。选项：rnn, lstm, gru。
num_conv_layers (默认值: null) : 当 conv_layers 为 null 时，堆叠的卷积层数。
state_size (默认值: 256) : RNN 的状态大小。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
output_size (默认值: 256) : 将用于每一层的默认 output_size。
norm (默认值: null) : 将用于每一层的默认范数。选项：batch, layer, ghost, null。
num_fc_layers (默认值: 0) : 使用的并行全连接层数。增加层数可以增加模型容量，使其能够学习更复杂的特征交互。
fc_dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
recurrent_dropout (默认值: 0.0): 循环状态的 dropout 率。
activation (默认值: tanh): 默认使用的激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
filter_size (默认值: 5): 一维卷积滤波器的尺寸。它指示了一维卷积滤波器的宽度。
strides (默认值: 1): 卷积的步长。
fc_activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
recurrent_activation (默认值: sigmoid): 在循环步骤中使用的激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
conv_activation (默认值: relu): 将用于每个卷积层的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
conv_layers (默认值: null): 包含所有卷积层参数的字典列表。列表的长度决定了堆叠的卷积层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, num_filters, filter_size, strides, padding, dilation_rate, use_bias, pool_function, pool_padding, pool_size, pool_strides, bias_initializer, weights_initializer。如果字典中缺少这些值中的任何一个，将使用编码器参数中指定的默认值。如果 conv_layers 和 num_conv_layers 都为 null，则会为 conv_layers 分配一个默认列表，其值为 [{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]。
pool_function (默认值: max): 使用的池化函数。max 将选择最大值。average, avg, 或 mean 中的任何一个将计算平均值。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
pool_size (默认值: null): 将用于每一层的默认 pool_size。如果 conv_layers 中尚未指定 pool_size，则这是将用于每一层的默认 pool_size。它表示在卷积操作后将沿 s 序列维度执行的最大池化的大小。
dilation_rate (默认值: 1): 用于膨胀卷积的膨胀率。
pool_strides (默认值: null): 缩小的因子。
pool_padding (默认值: same): 使用的填充。选项：valid, same。
unit_forget_bias (默认值: true): 如果为 true，则在初始化时将 1 添加到遗忘门的偏置中。选项：true, false。
recurrent_initializer (默认值: orthogonal): 循环矩阵权重的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。
use_bias (默认值: true): 是否使用偏置向量。选项：true, false。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
reduce_output (默认值: last): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
norm_params (默认值: null): 传递给 norm 模块的默认参数。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer 和 weights_initializer。如果字典中缺少这些值中的任何一个，将使用作为独立参数提供的默认值。
num_filters (默认值: 256): 滤波器数量，因此也是一维卷积的输出通道数。
padding (默认值: same): 使用的填充。选项：valid, same。
num_rec_layers (默认值: 1): 堆叠的循环层数。
bidirectional (默认值: false): 如果为 true，两个循环网络将分别在向前和向后方向执行编码，并将其输出拼接。选项：true, false。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。

Transformer 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["emb_12\nemb__7\nemb_43\nemb_65\nemb_23\nemb__4\nemb__1"];
  B --> C["Transformer\n Blocks"];
  C --> D["Fully\n Connected\n Layers"];
  D --> ...;

transformer 编码器实现了 Transformer 块的堆叠，复制了 Attention is all you need 论文中介绍的架构，并在末尾添加了可选的全连接层堆叠。

encoder:
    type: transformer
    dropout: 0.1
    num_layers: 1
    embedding_size: 256
    output_size: 256
    norm: null
    num_fc_layers: 0
    fc_dropout: 0.0
    hidden_size: 256
    transformer_output_size: 256
    fc_activation: relu
    representation: dense
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    embeddings_on_cpu: false
    embeddings_trainable: true
    reduce_output: last
    norm_params: null
    fc_layers: null
    num_heads: 8
    pretrained_embeddings: null

参数

dropout (默认值: 0.1) : Transformer 块的 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
num_layers (默认值: 1) : Transformer 层数。
embedding_size (默认值: 256) : 最大嵌入大小。对于 dense 表示，实际大小将是 min(词汇表大小, 嵌入大小)，对于 sparse 编码，实际大小将正好是 词汇表大小，其中 词汇表大小 是训练集输入列中出现的唯一字符串数加上特殊 token 的数量（<UNK>, <PAD>, <SOS>, <EOS>）。
output_size (默认值: 256) : 将用于每一层的默认 output_size。
norm (默认值: null) : 将用于每一层的默认范数。选项：batch, layer, ghost, null。
num_fc_layers (默认值: 0) : 使用的并行全连接层数。增加层数可以增加模型容量，使其能够学习更复杂的特征交互。
fc_dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
hidden_size (默认值: 256): Transformer 块内隐藏表示的大小。通常与 embedding_size 相同，但如果两个值不同，则将在第一个 Transformer 块之前添加一个投影层。
transformer_output_size (默认值: 256): Transformer 块中自注意力后的全连接层的大小。通常与 hidden_size 和 embedding_size 相同。
fc_activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
representation (默认值: dense): 嵌入的表示。dense 表示嵌入是随机初始化的，sparse 表示嵌入被初始化为 one-hot 编码。选项：dense, sparse。
use_bias (默认值: true): 是否使用偏置向量。选项：true, false。
bias_initializer (默认值: zeros): 偏置向量的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
weights_initializer (默认值: xavier_uniform): 权重矩阵的初始化器。选项：uniform, normal, constant, ones, zeros, eye, dirac, xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal, orthogonal, sparse, identity。或者，也可以指定一个字典，其中包含一个标识初始化器类型的键 type 和其他参数键，例如 {type: normal, mean: 0, stddev: 0}。有关每个初始化器参数的描述，请参见 torch.nn.init。
embeddings_on_cpu (默认值: false): 是否强制将嵌入矩阵放在常规内存中，并由 CPU 解析它们。默认情况下，如果使用 GPU，嵌入矩阵存储在 GPU 内存中，因为它允许更快的访问，但在某些情况下，嵌入矩阵可能太大。此参数强制将嵌入矩阵放在常规内存中，并使用 CPU 进行嵌入查找，由于 CPU 和 GPU 内存之间的数据传输，这会略微降低速度。选项：true, false。
embeddings_trainable (默认值: true): 如果为 true，则在训练过程中训练嵌入，如果为 false，则嵌入固定。加载预训练嵌入时避免微调它们可能很有用。此参数仅在 representation 为 dense 时有效；sparse one-hot 编码不可训练。选项：true, false。
reduce_output (默认值: last): 如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
norm_params (默认值: null): 传递给 norm 模块的默认参数。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer 和 weights_initializer。如果字典中缺少这些值中的任何一个，将使用作为独立参数提供的默认值。
num_heads (默认值: 8): 每个 Transformer 块中的注意力头数。
pretrained_embeddings (默认值: null): 包含预训练嵌入的文件的路径。默认情况下，dense 嵌入是随机初始化的，但此参数允许指定包含 GloVe 格式嵌入的文件路径。加载包含嵌入的文件时，只保留词汇表中存在的标签的嵌入，其他嵌入将被丢弃。如果词汇表包含在嵌入文件中没有匹配项的字符串，它们的嵌入将使用所有其他嵌入的平均值加上一些随机噪声进行初始化，以使它们彼此不同。此参数仅在 representation 为 dense 时有效。

Huggingface 编码器¶

所有基于 huggingface 的文本编码器都使用以下参数进行配置

pretrained_model_name_or_path (默认值是 huggingface 指定编码器的默认模型路径，即 BERT 的 bert-base-uncased)。这可以是模型的名称，也可以是下载模型的路径。有关可用变体的详细信息，请参阅 Hugging Face 文档。
reduce_output (默认值 cls_pooled): 定义如果张量秩大于 2，如何沿着 s 序列长度维度缩减输出张量。可用值包括：cls_pooled, sum, mean 或 avg, max, concat (沿着第一个维度拼接), last (返回第一个维度的最后一个向量) 和 null (不进行缩减并返回完整张量)。
trainable (默认值 false): 如果为 true，则训练编码器的权重，否则将保持冻结。

注意

可以覆盖任何 huggingface 编码器的任何超参数。请查阅 huggingface 文档以了解哪些参数用于哪些模型。

name: text_column_name
type: text
encoder: bert
trainable: true
num_attention_heads: 16 # Instead of 12

AutoTransformer¶

auto_transformer 编码器自动实例化指定 pretrained_model_name_or_path 的模型架构。与其他 HF 编码器不同，auto_transformer 不为 pretrained_model_name_or_path 提供默认值，这是其唯一强制参数。有关更多详细信息，请参阅 Hugging Face AutoModels 文档。

encoder:
    type: auto_transformer
    pretrained_model_name_or_path: bert
    trainable: false
    reduce_output: sum
    pretrained_kwargs: null
    adapter: null

参数

pretrained_model_name_or_path (默认值: null) : 预训练模型的名称或路径。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。选项：last, sum, mean, avg, max, concat, attention, none, None, null。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。
adapter (默认值: null): 是否使用参数高效微调。

ALBERT¶

albert 编码器使用 Hugging Face transformers 包加载预训练的 ALBERT（默认 albert-base-v2）模型。Albert 与 BERT 类似，但内存使用显著较低，训练时间稍微更快：

encoder:
    type: albert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    embedding_size: 128
    hidden_size: 768
    num_hidden_layers: 12
    num_hidden_groups: 1
    num_attention_heads: 12
    intermediate_size: 3072
    inner_group_num: 1
    hidden_act: gelu_new
    hidden_dropout_prob: 0.0
    attention_probs_dropout_prob: 0.0
    max_position_embeddings: 512
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    classifier_dropout_prob: 0.1
    position_embedding_type: absolute
    pad_token_id: 0
    bos_token_id: 2
    eos_token_id: 3
    pretrained_kwargs: null
    adapter: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: albert-base-v2): 预训练模型的名称或路径。
reduce_output (默认值: cls_pooled): 将张量序列缩减为单个张量的方法。
embedding_size (默认值: 128): 词汇表嵌入的维度。
hidden_size (默认值: 768): 编码器层和 pooler 层的维度。
num_hidden_layers (默认值: 12): Transformer 编码器中的隐藏层数。
num_hidden_groups (默认值: 1): 隐藏层的组数，同一组中的参数共享。
num_attention_heads (默认值: 12): Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (默认值: 3072): Transformer 编码器中“中间”（通常称为前馈）层的维度。
inner_group_num (默认值: 1): 注意力和 ffn 的内部重复次数。
hidden_act (默认值: gelu_new): 编码器和 pooler 中的非线性激活函数（函数或字符串）。选项：gelu, relu, silu, gelu_new。
hidden_dropout_prob (默认值: 0.0): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (默认值: 0.0): 注意力概率的 dropout 率。
max_position_embeddings (默认值: 512): 该模型可能使用的最大序列长度。通常将其设置为较大的值（例如 512、1024 或 2048）。
type_vocab_size (默认值: 2): 调用 AlbertModel 或 TFAlbertModel 时传递的 token_type_ids 的词汇表大小。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (默认值: 1e-12): 层归一化层使用的 epsilon。
classifier_dropout_prob (默认值: 0.1): 附加分类器的 dropout 率。
position_embedding_type (默认值: absolute): 选项：absolute, relative_key, relative_key_query。
pad_token_id (默认值: 0): 用作填充的 token ID。
bos_token_id (默认值: 2): 序列开始 token ID。
eos_token_id (默认值: 3): 序列结束 token ID。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。
adapter (默认值: null): 是否使用参数高效微调。

BERT¶

bert 编码器使用 Hugging Face transformers 包加载预训练的 BERT（默认 bert-base-uncased）模型。BERT 是一个双向 Transformer，使用 masked language modeling 目标和 next sentence prediction 在大型语料库（包括 Toronto Book Corpus 和 Wikipedia）上进行预训练。

encoder:
    type: bert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    classifier_dropout: null
    reduce_output: cls_pooled
    hidden_size: 768
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    hidden_act: gelu
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    pad_token_id: 0
    gradient_checkpointing: false
    position_embedding_type: absolute
    pretrained_kwargs: null
    adapter: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: bert-base-uncased): 预训练模型的名称或路径。
hidden_dropout_prob (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (默认值: 0.1): 注意力概率的 dropout 率。
max_position_embeddings (默认值: 512): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
classifier_dropout (默认值: null): 分类头的 dropout 率。
reduce_output (默认值: cls_pooled): 将张量序列缩减为单个张量的方法。
hidden_size (默认值: 768): 编码器层和 pooler 层的维度。
num_hidden_layers (默认值: 12): Transformer 编码器中的隐藏层数。
num_attention_heads (默认值: 12): Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (默认值: 3072): Transformer 编码器中“中间”（通常称为前馈）层的维度。
hidden_act (默认值: gelu): 编码器和 pooler 中的非线性激活函数（函数或字符串）。选项：gelu, relu, silu, gelu_new。
type_vocab_size (默认值: 2): 调用 BertModel 或 TFBertModel 时传递的 token_type_ids 的词汇表大小。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (默认值: 1e-12): 层归一化层使用的 epsilon。
pad_token_id (默认值: 0): 用作填充的 token ID。
gradient_checkpointing (默认值: false): 是否使用梯度检查点。选项：true, false。
position_embedding_type (默认值: absolute): 位置嵌入类型。选项：absolute, relative_key, relative_key_query。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。
adapter (默认值: null): 是否使用参数高效微调。

CamemBERT¶

camembert 编码器使用 Hugging Face transformers 包加载预训练的 CamemBERT（默认 jplu/tf-camembert-base）模型。CamemBERT 在大型法语网络爬取文本语料库上进行预训练。

encoder:
    type: camembert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 514
    classifier_dropout: null
    reduce_output: sum
    hidden_size: 768
    hidden_act: gelu
    initializer_range: 0.02
    adapter: null
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    type_vocab_size: 1
    layer_norm_eps: 1.0e-05
    pad_token_id: 1
    gradient_checkpointing: false
    position_embedding_type: absolute
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: camembert-base): 预训练模型的名称或路径。
hidden_dropout_prob (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (默认值: 0.1): 注意力概率的 dropout 率。
max_position_embeddings (默认值: 514): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
classifier_dropout (默认值: null): 分类头的 dropout 率。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
hidden_size (默认值: 768): 编码器层和 pooler 层的维度。
hidden_act (默认值: gelu): 编码器和 pooler 中的非线性激活函数（函数或字符串）。选项：gelu, relu, silu, gelu_new。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
adapter (默认值: null): 是否使用参数高效微调。
num_hidden_layers (默认值: 12): Transformer 编码器中的隐藏层数。
num_attention_heads (默认值: 12): Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (默认值: 3072): Transformer 编码器中“中间”（通常称为前馈）层的维度。
type_vocab_size (默认值: 1): 调用 BertModel 或 TFBertModel 时传递的 token_type_ids 的词汇表大小。
layer_norm_eps (默认值: 1e-05): 层归一化层使用的 epsilon。
pad_token_id (默认值: 1): 用作填充的 token ID。
gradient_checkpointing (默认值: false): 是否使用梯度检查点。选项：true, false。
position_embedding_type (默认值: absolute): 位置嵌入类型。选项：absolute, relative_key, relative_key_query。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

DeBERTa¶

DeBERTa 编码器使用分离注意力（disentangled attention）和增强型掩码解码器（enhanced mask decoder）改进了 BERT 和 RoBERTa 模型。通过这两项改进，DeBERTa 在大多数 NLU 任务上使用 80GB 训练数据优于 RoBERTa。在 DeBERTa V3 中，作者使用 ELECTRA 风格的预训练和梯度分离嵌入共享进一步提高了 DeBERTa 的效率。与 DeBERTa 相比，V3 版本显著提高了模型在下游任务上的性能。

encoder:
    type: deberta
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_size: 1536
    num_hidden_layers: 24
    num_attention_heads: 24
    intermediate_size: 6144
    hidden_act: gelu
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    type_vocab_size: 0
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    relative_attention: true
    max_relative_positions: -1
    pad_token_id: 0
    position_biased_input: false
    pos_att_type:
    - p2c
    - c2p
    pooler_hidden_size: 1536
    pooler_dropout: 0
    pooler_hidden_act: gelu
    position_buckets: 256
    share_att_key: true
    norm_rel_ebd: layer_norm
    adapter: null
    pretrained_kwargs: null
    reduce_output: sum

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: sileod/deberta-v3-base-tasksource-nli): 预训练模型的名称或路径。
hidden_size (默认值: 1536): 编码器层和 pooler 层的维度。
num_hidden_layers (默认值: 24): Transformer 编码器中的隐藏层数。
num_attention_heads (默认值: 24): Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (默认值: 6144): Transformer 编码器中“中间”（通常称为前馈）层的维度。
hidden_act (默认值: gelu): 编码器和 pooler 中的非线性激活函数（函数或字符串）。选项：gelu, relu, silu, tanh, gelu_fast, mish, linear, sigmoid, gelu_new。
hidden_dropout_prob (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (默认值: 0.1): 注意力概率的 dropout 率。
max_position_embeddings (默认值: 512): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
type_vocab_size (默认值: 0): token_type_ids 的词汇表大小。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (默认值: 1e-12): 层归一化层使用的 epsilon。
relative_attention (默认值: true): 是否使用相对位置编码。选项：true, false。
max_relative_positions (默认值: -1): 相对位置的范围 [-max_position_embeddings, max_position_embeddings]。使用与 max_position_embeddings 相同的值。
pad_token_id (默认值: 0): 用于填充 input_ids 的值。
position_biased_input (默认值: false): 是否将绝对位置嵌入添加到内容嵌入中。选项：true, false。
pos_att_type (默认值: ["p2c", "c2p"]): 相对位置注意力的类型，可以是 ['p2c', 'c2p'] 的组合，例如 ['p2c'], ['p2c', 'c2p'], ['p2c', 'c2p']。
pooler_hidden_size (默认值: 1536): pooler 层的隐藏层大小。
pooler_dropout (默认值: 0): pooler 层的 dropout 率。
pooler_hidden_act (默认值: gelu): pooler 中的激活函数（函数或字符串）。选项：gelu, relu, silu, tanh, gelu_fast, mish, linear, sigmoid, gelu_new。
position_buckets (默认值: 256): 用于每个注意力层的桶数。
share_att_key (默认值: true): 是否跨层共享注意力键。选项：true, false。
norm_rel_ebd (默认值: layer_norm): 相对嵌入的归一化方法。选项：layer_norm, none。
adapter (默认值: null): 是否使用参数高效微调。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。选项：cls_pooled, last, sum, mean, max, concat, attention, null。

DistilBERT¶

distilbert 编码器使用 Hugging Face transformers 包加载预训练的 DistilBERT（默认 distilbert-base-uncased）模型。DistilBERT 是一种小巧、快速、廉价且轻量的 Transformer 模型，通过蒸馏 BERT base 训练而成。它比 bert-base-uncased 参数少 40%，运行速度快 60%，同时在 GLUE 语言理解基准上保持了 BERT 超过 95% 的性能。

encoder:
    type: distilbert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    max_position_embeddings: 512
    attention_dropout: 0.1
    activation: gelu
    reduce_output: sum
    initializer_range: 0.02
    qa_dropout: 0.1
    seq_classif_dropout: 0.2
    adapter: null
    sinusoidal_pos_embds: false
    n_layers: 6
    n_heads: 12
    dim: 768
    hidden_dim: 3072
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: distilbert-base-uncased): 预训练模型的名称或路径。
dropout (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
max_position_embeddings (默认值: 512): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
attention_dropout (默认值: 0.1): 注意力概率的 dropout 率。
activation (默认值: gelu): 编码器和 pooler 中的非线性激活函数（函数或字符串）。如果为字符串，支持 'gelu', 'relu', 'silu' 和 'gelu_new'。选项：gelu, relu, silu, gelu_new。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
qa_dropout (默认值: 0.1): 在问答模型 DistilBertForQuestionAnswering 中使用的 dropout 概率。
seq_classif_dropout (默认值: 0.2): 在序列分类和多项选择模型 DistilBertForSequenceClassification 中使用的 dropout 概率。
adapter (默认值: null): 是否使用参数高效微调。
sinusoidal_pos_embds (默认值: false): 是否使用正弦位置嵌入。选项：true, false。
n_layers (默认值: 6): Transformer 编码器中的隐藏层数。
n_heads (默认值: 12): Transformer 编码器中的隐藏层数。
dim (默认值: 768): 编码器层和 pooler 层的维度。
hidden_dim (默认值: 3072): Transformer 编码器中“中间”（通常称为前馈）层的大小。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

ELECTRA¶

electra 编码器使用 Hugging Face transformers 包加载预训练的 ELECTRA 模型。ELECTRA 是一种新的预训练方法，它训练两个 Transformer 模型：生成器和判别器。生成器的作用是替换序列中的 token，因此训练为一个 masked language model。我们感兴趣的判别器尝试识别序列中哪些 token 被生成器替换了。

encoder:
    type: electra
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 512
    classifier_dropout: null
    reduce_output: sum
    embedding_size: 128
    hidden_size: 256
    hidden_act: gelu
    initializer_range: 0.02
    adapter: null
    num_hidden_layers: 12
    num_attention_heads: 4
    intermediate_size: 1024
    type_vocab_size: 2
    layer_norm_eps: 1.0e-12
    position_embedding_type: absolute
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: google/electra-small-discriminator): 预训练模型的名称或路径。
hidden_dropout_prob (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (默认值: 0.1): 注意力概率的 dropout 率。
max_position_embeddings (默认值: 512): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
classifier_dropout (默认值: null): 分类头的 dropout 率。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
embedding_size (默认值: 128): 编码器层和 pooler 层的维度。
hidden_size (默认值: 256): 编码器层和 pooler 层的维度。
hidden_act (默认值: gelu): 编码器和 pooler 中的非线性激活函数（函数或字符串）。选项：gelu, relu, silu, gelu_new。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
adapter (默认值: null): 是否使用参数高效微调。
num_hidden_layers (默认值: 12): Transformer 编码器中的隐藏层数。
num_attention_heads (默认值: 4): Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (默认值: 1024): Transformer 编码器中“中间”（即前馈）层的维度。
type_vocab_size (默认值: 2): 调用 ElectraModel 或 TFElectraModel 时传递的 token_type_ids 的词汇表大小。
layer_norm_eps (默认值: 1e-12): 层归一化层使用的 epsilon。
position_embedding_type (默认值: absolute): 位置嵌入类型。选项：absolute, relative_key, relative_key_query。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

FlauBERT¶

flaubert 编码器使用 Hugging Face transformers 包加载预训练的 FlauBERT（默认 jplu/tf-flaubert-base-uncased）模型。FlauBERT 的架构类似于 BERT，并在大型法语语料库上进行预训练。

encoder:
    type: flaubert
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    pre_norm: true
    layerdrop: 0.2
    emb_dim: 512
    n_layers: 6
    n_heads: 8
    attention_dropout: 0.1
    gelu_activation: true
    sinusoidal_embeddings: false
    causal: false
    asm: false
    n_langs: 1
    use_lang_emb: true
    max_position_embeddings: 512
    embed_init_std: 0.02209708691207961
    init_std: 0.02
    layer_norm_eps: 1.0e-06
    bos_index: 0
    eos_index: 1
    pad_index: 2
    unk_index: 3
    mask_index: 5
    is_encoder: true
    mask_token_id: 0
    lang_id: 0
    pretrained_kwargs: null
    adapter: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: flaubert/flaubert_small_cased): 预训练模型的名称或路径。
dropout (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
pre_norm (默认值: true): 是否在每个层中自注意力后的前馈层之前或之后应用层归一化 (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)。选项：true, false。
layerdrop (默认值: 0.2): 训练期间丢弃层的概率 (Fan et al., Reducing Transformer Depth on Demand with Structured Dropout. ICLR 2020)。
emb_dim (默认值: 512): 编码器层和 pooler 层的维度。
n_layers (默认值: 6): Transformer 编码器中的隐藏层数。
n_heads (默认值: 8): Transformer 编码器中每个注意力层的注意力头数。
attention_dropout (默认值: 0.1): 注意力机制的 dropout 概率。
gelu_activation (默认值: true): 是否使用 gelu 激活函数而不是 relu。选项：true, false。
sinusoidal_embeddings (默认值: false): 是否使用正弦位置嵌入而不是绝对位置嵌入。选项：true, false。
causal (默认值: false): 模型是否应以因果方式工作。因果模型使用三角形注意力掩码，以便仅关注左侧上下文，而不是双向上下文。选项：true, false。
asm (默认值: false): 是否使用自适应对数 softmax 投影层而不是线性层作为预测层。选项：true, false。
n_langs (默认值: 1): 模型处理的语言数量。对于单语模型设置为 1。
use_lang_emb (默认值: true): 是否使用语言嵌入。某些模型使用附加的语言嵌入，有关如何使用它们的信息，请参阅多语言模型页面。选项：true, false。
max_position_embeddings (默认值: 512): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
embed_init_std (默认值: 0.02209708691207961): 用于初始化嵌入矩阵的 truncated_normal_initializer 的标准差。
init_std (默认值: 0.02): 用于初始化除嵌入矩阵外的所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (默认值: 1e-06): 层归一化层使用的 epsilon。
bos_index (默认值: 0): 词汇表中句子开始 token 的索引。
eos_index (默认值: 1): 词汇表中句子结束 token 的索引。
pad_index (默认值: 2): 词汇表中填充 token 的索引。
unk_index (默认值: 3): 词汇表中未知 token 的索引。
mask_index (默认值: 5): 词汇表中掩码 token 的索引。
is_encoder (默认值: true): 初始化模型是否应为 Transformer 编码器或解码器，如 Vaswani 等人所述。选项：true, false。
mask_token_id (默认值: 0): 在 MLM 上下文中生成文本时识别掩码 token 的模型无关参数。
lang_id (默认值: 0): 模型使用的语言 ID。此参数用于在给定语言中生成文本。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。
adapter (默认值: null): 是否使用参数高效微调。

GPT¶

gpt 编码器使用 Hugging Face transformers 包加载预训练的 GPT（默认 openai-gpt）模型。GPT 是一个因果（单向）Transformer，使用语言建模在具有长距离依赖关系的大型语料库（Toronto Book Corpus）上进行预训练。

encoder:
    type: gpt
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: sum
    initializer_range: 0.02
    adapter: null
    n_positions: 40478
    n_ctx: 512
    n_embd: 768
    n_layer: 12
    n_head: 12
    afn: gelu
    resid_pdrop: 0.1
    embd_pdrop: 0.1
    attn_pdrop: 0.1
    layer_norm_epsilon: 1.0e-05
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: openai-gpt): 预训练模型的名称或路径。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
adapter (默认值: null): 是否使用参数高效微调。
n_positions (默认值: 40478): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
n_ctx (默认值: 512): 因果掩码的维度（通常与 n_positions 相同）。
n_embd (默认值: 768): 嵌入和隐藏状态的维度。
n_layer (默认值: 12): Transformer 编码器中的隐藏层数。
n_head (默认值: 12): Transformer 编码器中每个注意力层的注意力头数。
afn (默认值: gelu): 编码器和 pooler 中的非线性激活函数（函数或字符串）。选项：gelu, relu, silu。
resid_pdrop (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
embd_pdrop (默认值: 0.1): 嵌入的 dropout 率。
attn_pdrop (默认值: 0.1): 注意力的 dropout 率。
layer_norm_epsilon (默认值: 1e-05): 层归一化层中使用的 epsilon。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

GPT2¶

gpt2 编码器使用 Hugging Face transformers 包加载预训练的 GPT-2（默认 gpt2）模型。GPT-2 是一个因果（单向）Transformer，使用语言建模在一个约 40 GB 的大型文本语料库上进行预训练。

encoder:
    type: gpt2
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: sum
    initializer_range: 0.02
    adapter: null
    n_positions: 1024
    n_ctx: 1024
    n_embd: 768
    n_layer: 12
    n_head: 12
    n_inner: null
    activation_function: gelu_new
    resid_pdrop: 0.1
    embd_pdrop: 0.1
    attn_pdrop: 0.1
    layer_norm_epsilon: 1.0e-05
    scale_attn_weights: true
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path (默认值: gpt2): 预训练模型的名称或路径。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
adapter (默认值: null): 是否使用参数高效微调。
n_positions (默认值: 1024): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
n_ctx (默认值: 1024): 因果掩码的维度（通常与 n_positions 相同）。
n_embd (默认值: 768): 嵌入和隐藏状态的维度。
n_layer (默认值: 12): Transformer 编码器中的隐藏层数。
n_head (默认值: 12): Transformer 编码器中每个注意力层的注意力头数。
n_inner (默认值: null): 内部前馈层的维度。如果为 None，则将其设置为 n_embd 的 4 倍。
activation_function（默认值：gelu_new）：激活函数，从列表 ['relu', 'silu', 'gelu', 'tanh', 'gelu_new'] 中选择。选项：relu, silu, gelu, tanh, gelu_new。
resid_pdrop (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
embd_pdrop (默认值: 0.1): 嵌入的 dropout 率。
attn_pdrop (默认值: 0.1): 注意力的 dropout 率。
layer_norm_epsilon（默认值：1e-05）：用于层归一化层的 epsilon。
scale_attn_weights（默认值：true）：通过除以 sqrt(hidden_size) 来缩放注意力权重。选项：true, false。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

Longformer¶

longformer 编码器使用 Hugging Face transformers 包加载预训练的 Longformer 模型（默认值 allenai/longformer-base-4096）。Longformer 是处理较长文本的不错选择，因为它支持长达 4096 个 token 的序列。

encoder:
    type: longformer
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    max_position_embeddings: 4098
    reduce_output: cls_pooled
    attention_window: 512
    sep_token_id: 2
    adapter: null
    type_vocab_size: 1
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path（默认值：allenai/longformer-base-4096）：预训练模型的名称或路径。
max_position_embeddings（默认值：4098）：此模型可能使用的最大序列长度。通常设置为一个较大的值以防万一（例如，512、1024 或 2048）。
reduce_output (默认值: cls_pooled): 将张量序列缩减为单个张量的方法。
attention_window（默认值：512）：每个 token 周围的注意力窗口大小。如果是一个整数，则所有层使用相同的大小。要为每个层指定不同的窗口大小，请使用 List[int]，其中 len(attention_window) == num_hidden_layers。
sep_token_id（默认值：2）：分隔 token 的 ID，用于从多个序列构建一个序列时使用。
adapter (默认值: null): 是否使用参数高效微调。
type_vocab_size（默认值：1）：调用 LongformerEncoder 时传入的 token_type_ids 的词汇表大小。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

RoBERTa¶

roberta 编码器使用 Hugging Face transformers 包加载预训练的 RoBERTa 模型（默认值 roberta-base）。RoBERTa 是 BERT 预训练的复制，其性能可能与 BERT 相匹配或超过 BERT。RoBERTa 基于 BERT 构建，修改了关键超参数，取消了 next-sentence 预训练目标，并使用更大的 mini-batches 和学习率进行训练。

encoder:
    type: roberta
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    eos_token_id: 2
    adapter: null
    pad_token_id: 1
    bos_token_id: 0
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path（默认值：roberta-base）：预训练模型的名称或路径。
reduce_output (默认值: cls_pooled): 将张量序列缩减为单个张量的方法。
eos_token_id（默认值：2）：序列结束 token ID。
adapter (默认值: null): 是否使用参数高效微调。
pad_token_id (默认值: 1): 用作填充的 token ID。
bos_token_id（默认值：0）：序列开始 token ID。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

T5¶

t5 编码器使用 Hugging Face transformers 包加载预训练的 T5 模型（默认值 t5-small）。T5 (Text-to-Text Transfer Transformer) 在从网络爬取的大规模文本数据集上进行预训练，并在多个任务上表现出良好的迁移性能。

encoder:
    type: t5
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    num_layers: 6
    dropout_rate: 0.1
    reduce_output: sum
    d_ff: 2048
    adapter: null
    d_model: 512
    d_kv: 64
    num_decoder_layers: 6
    num_heads: 8
    relative_attention_num_buckets: 32
    layer_norm_eps: 1.0e-06
    initializer_factor: 1
    feed_forward_proj: relu
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path（默认值：t5-small）：预训练模型的名称或路径。
num_layers（默认值：6）：Transformer 编码器中的隐藏层数量。
dropout_rate（默认值：0.1）：所有 dropout 层的比率。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
d_ff（默认值：2048）：每个 T5Block 中间前馈层的大小。
adapter (默认值: null): 是否使用参数高效微调。
d_model（默认值：512）：编码器层和池化层的大小。
d_kv（默认值：64）：每个注意力头的键、查询、值投影的大小。d_kv 必须等于 d_model // num_heads。
num_decoder_layers（默认值：6）：Transformer 解码器中的隐藏层数量。如果未设置，将使用 num_layers 的值。
num_heads（默认值：8）：Transformer 编码器中每个注意力层的注意力头数量。
relative_attention_num_buckets（默认值：32）：每个注意力层使用的桶数量。
layer_norm_eps (默认值: 1e-06): 层归一化层使用的 epsilon。
initializer_factor（默认值：1）：初始化所有权重矩阵的因子（应保持为 1，内部用于初始化测试）。
feed_forward_proj（默认值：relu）：使用的前馈层类型。应为 'relu' 或 'gated-gelu' 之一。T5v1.1 使用 'gated-gelu' 前馈投影。原始 T5 使用 'relu'。选项：relu, gated-gelu。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

TransformerXL¶

transformer_xl 编码器使用 Hugging Face transformers 包加载预训练的 Transformer-XL 模型（默认值 transfo-xl-wt103）。该模型添加了新颖的位置编码方案，改进了对长达数千个 token 的长文本的理解和生成。Transformer-XL 是一种因果（单向）Transformer，具有相对位置 (sinusoidal) 嵌入，可以重用先前计算的隐藏状态来关注更长的上下文（记忆）。该模型还使用自适应 softmax 输入和输出（绑定）。

encoder:
    type: transformer_xl
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    adaptive: true
    adapter: null
    cutoffs:
    - 20000
    - 40000
    - 200000
    d_model: 1024
    d_embed: 1024
    n_head: 16
    d_head: 64
    d_inner: 4096
    div_val: 4
    pre_lnorm: false
    n_layer: 18
    mem_len: 1600
    clamp_len: 1000
    same_length: true
    proj_share_all_but_first: true
    attn_type: 0
    sample_softmax: -1
    dropatt: 0.0
    untie_r: true
    init: normal
    init_range: 0.01
    proj_init_std: 0.01
    init_std: 0.02
    layer_norm_epsilon: 1.0e-05
    eos_token_id: 0
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path（默认值：transfo-xl-wt103）：预训练模型的名称或路径。
dropout (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
adaptive（默认值：true）：是否使用自适应 softmax。选项：true, false。
adapter (默认值: null): 是否使用参数高效微调。
cutoffs（默认值：[20000, 40000, 200000]）：自适应 softmax 的截止值。
d_model（默认值：1024）：模型隐藏状态的维度。
d_embed（默认值：1024）：嵌入的维度。
n_head（默认值：16）：Transformer 编码器中每个注意力层的注意力头数量。
d_head（默认值：64）：模型头部的维度。
d_inner（默认值：4096）：FF 中的内部维度。
div_val（默认值：4）：自适应输入和 softmax 的除数。
pre_lnorm（默认值：false）：是否在块中对输入而不是输出应用 LayerNorm。选项：true, false。
n_layer（默认值：18）：Transformer 编码器中的隐藏层数量。
mem_len（默认值：1600）：保留的先前头部的长度。
clamp_len（默认值：1000）：在 clamp_len 后使用相同的位置嵌入。
same_length（默认值：true）：是否对所有 token 使用相同的注意力长度。选项：true, false。
proj_share_all_but_first（默认值：true）：为 True 则共享除第一个以外的所有投影，为 False 则不共享。选项：true, false。
attn_type（默认值：0）：注意力类型。0 表示 Transformer-XL，1 表示 Shaw et al，2 表示 Vaswani et al，3 表示 Al Rfou et al。
sample_softmax（默认值：-1）：抽样 softmax 中的样本数量。
dropatt（默认值：0.0）：注意力概率的 dropout 比率。
untie_r（默认值：true）：是否解除相对位置偏差的绑定。选项：true, false。
init（默认值：normal）：要使用的参数初始化器。
init_range（默认值：0.01）：参数通过 U(-init_range, init_range) 初始化。
proj_init_std（默认值：0.01）：参数通过 N(0, init_std) 初始化。
init_std（默认值：0.02）：参数通过 N(0, init_std) 初始化。
layer_norm_epsilon (默认值: 1e-05): 层归一化层中使用的 epsilon。
eos_token_id（默认值：0）：序列结束 token ID。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

XLMRoBERTa¶

xlmroberta 编码器使用 Hugging Face transformers 包加载预训练的 XLM-RoBERTa 模型（默认值 jplu/tf-xlm-reoberta-base）。XLM-RoBERTa 是一个多语言模型，类似于 BERT，在 100 种语言上进行训练。XLM-RoBERTa 基于 Facebook 于 2019 年发布的 RoBERTa 模型。它是一个大型多语言模型，在 2.5TB 的过滤 CommonCrawl 数据上进行训练。

encoder:
    type: xlmroberta
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    reduce_output: cls_pooled
    max_position_embeddings: 514
    type_vocab_size: 1
    adapter: null
    pad_token_id: 1
    bos_token_id: 0
    eos_token_id: 2
    add_pooling_layer: true
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path（默认值：xlm-roberta-base）：预训练模型的名称或路径。
reduce_output (默认值: cls_pooled): 将张量序列缩减为单个张量的方法。
max_position_embeddings (默认值: 514): 该模型可能使用的最大序列长度。通常将其设置为较大的值以防万一（例如 512、1024 或 2048）。
type_vocab_size（默认值：1）：传入的 token_type_ids 的词汇表大小。
adapter (默认值: null): 是否使用参数高效微调。
pad_token_id (默认值: 1): 用作填充的 token ID。
bos_token_id（默认值：0）：序列开始 token ID。
eos_token_id（默认值：2）：序列结束 token ID。
add_pooling_layer（默认值：true）：是否向编码器添加池化层。选项：true, false。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

XLNet¶

xlnet 编码器使用 Hugging Face transformers 包加载预训练的 XLNet 模型（默认值 xlnet-base-cased）。XLNet 是 Transformer-XL 模型的一个扩展，使用自回归方法进行预训练，通过最大化输入序列分解顺序所有排列的期望似然来学习双向上下文。XLNet 在各种基准测试中优于 BERT。

encoder:
    type: xlnet
    use_pretrained: true
    trainable: false
    pretrained_model_name_or_path: bert
    dropout: 0.1
    reduce_output: sum
    ff_activation: gelu
    initializer_range: 0.02
    summary_activation: tanh
    summary_last_dropout: 0.1
    adapter: null
    d_model: 768
    n_layer: 12
    n_head: 12
    d_inner: 3072
    untie_r: true
    attn_type: bi
    layer_norm_eps: 1.0e-12
    mem_len: null
    reuse_len: null
    use_mems_eval: true
    use_mems_train: false
    bi_data: false
    clamp_len: -1
    same_length: false
    summary_type: last
    summary_use_proj: true
    start_n_top: 5
    end_n_top: 5
    pad_token_id: 5
    bos_token_id: 1
    eos_token_id: 2
    pretrained_kwargs: null

参数

use_pretrained (默认值: true) : 是否使用模型的预训练权重。如果为 false，模型将从头开始训练，这计算成本非常高。选项：true, false。
trainable (默认值: false) : 是否在您的数据集上微调模型。选项：true, false。
pretrained_model_name_or_path（默认值：xlnet-base-cased）：预训练模型的名称或路径。
dropout (默认值: 0.1): 嵌入、编码器和 pooler 中所有全连接层的 dropout 概率。
reduce_output (默认值: sum): 将张量序列缩减为单个张量的方法。
ff_activation（默认值：gelu）：编码器和池化器中的非线性激活函数（函数或字符串）。如果为字符串，支持 'gelu'、'relu'、'silu' 和 'gelu_new'。选项：gelu, relu, silu, gelu_new。
initializer_range (默认值: 0.02): 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
summary_activation（默认值：tanh）：进行序列摘要时使用的参数。用于序列分类和多项选择模型。
summary_last_dropout（默认值：0.1）：用于序列分类和多项选择模型。
adapter (默认值: null): 是否使用参数高效微调。
d_model（默认值：768）：编码器层和池化层的大小。
n_layer (默认值: 12): Transformer 编码器中的隐藏层数。
n_head (默认值: 12): Transformer 编码器中每个注意力层的注意力头数。
d_inner（默认值：3072）：Transformer 编码器中“中间”（通常称为前馈）层的大小。
untie_r（默认值：true）：是否解除相对位置偏差的绑定。选项：true, false。
attn_type（默认值：bi）：模型使用的注意力类型。目前仅支持 'bi'。选项：bi。
layer_norm_eps (默认值: 1e-12): 层归一化层使用的 epsilon。
mem_len（默认值：null）：要缓存的 token 数量。在先前的正向传播中已预计算的键/值对将不会被重新计算。
reuse_len（默认值：null）：当前批次中要缓存并在将来重用的 token 数量。
use_mems_eval（默认值：true）：模型在评估模式下是否应使用循环记忆机制。选项：`true`, `false`。
use_mems_train（默认值：false）：模型在训练模式下是否应使用循环记忆机制。选项：`true`, `false`。
bi_data（默认值：false）：是否使用双向输入管道。通常在预训练期间设置为 True，在微调期间设置为 False。选项：`true`, `false`。
clamp_len（默认值：-1）：截断所有大于 clamp_len 的相对距离。将此属性设置为 -1 表示不进行截断。
same_length（默认值：false）：是否对每个 token 使用相同的注意力长度。选项：`true`, `false`。
summary_type（默认值：last）：进行序列摘要时使用的参数。用于序列分类和多项选择模型。选项：`last`, `first`, `mean`, `cls_index`, `attn`。
summary_use_proj（默认值：true）：选项：`true`, `false`。
start_n_top（默认值：5）：用于 SQuAD 评估脚本。
end_n_top（默认值：5）：用于 SQuAD 评估脚本。
pad_token_id（默认值：5）：用作 padding 的 token ID。
bos_token_id（默认值：1）：序列开始 token ID。
eos_token_id（默认值：2）：序列结束 token ID。
pretrained_kwargs (默认值: null): 传递给预训练模型的附加 kwargs。

LLM 编码器¶

graph LR
  A["12\n7\n43\n65\n23\n4\n1"] --> B["Pretrained\n LLM"];
  B --> C["Last\n Hidden\n State"];
  C --> ...;

LLM 编码器使用预训练的 LLM（例如 llama-2-7b）处理文本，并将 LLM 的最后一个隐藏状态传递给组合器。与LLM 模型类型一样，可以配置基于适配器的微调和量化，并且任何组合器或解码器参数都将与适配器权重捆绑在一起。

示例配置

encoder:
  type: llm
  base_model: meta-llama/Llama-2-7b-hf
  adapter:
    type: lora
  quantization:
    bits: 4

参数

基础模型¶

base_model 参数指定用作自定义 LLM 基础的预训练大型语言模型。

有关 base_model 参数的更多信息，请参见此处。

适配器¶

LoRA¶

LoRA 是一种简单而有效的参数高效微调预训练语言模型的方法。它通过向模型添加少量可训练参数来实现，这些参数用于使预训练参数适应下游任务。这使得模型可以使用更少的训练示例进行微调，甚至可以用于在没有任何训练数据的任务上微调模型。

adapter:
    type: lora
    r: 8
    dropout: 0.05
    target_modules: null
    use_rslora: false
    use_dora: false
    alpha: 16
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    bias_type: none

r（默认值：8）：Lora 注意力维度。
dropout（默认值：0.05）：Lora 层的 dropout 概率。
target_modules（默认值：null）：要替换为 LoRA 的模块名称列表或模块名称的正则表达式。例如，['q', 'v'] 或 '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'。默认为定位所有自注意力和编码器-解码器注意力层的查询和值矩阵。
use_rslora（默认值：false）：设置为 True 时，使用 Rank-Stabilized LoRA，它将适配器缩放因子设置为 lora_alpha/math.sqrt(r)，因为它已被证明效果更好。否则，它将使用原始默认值 lora_alpha/r。论文：https://arxiv.org/abs/2312.03732。选项：true, false。
use_dora（默认值：false）：启用“权重分解的低秩适应”（DoRA）。这种技术将权重的更新分解为两部分：幅度和方向。方向由普通 LoRA 处理，而幅度由一个单独的可学习参数处理。这可以提高 LoRA 的性能，特别是在低秩时。目前，DoRA 仅支持非量化线性层。DoRA 引入的开销比纯粹的 LoRA 更大，因此建议合并权重进行推理。有关更多信息，请参见 https://arxiv.org/abs/2402.09353。选项：true, false。
alpha（默认值：null）：Lora 缩放的 alpha 参数。默认为 2 * r。
pretrained_adapter_weights（默认值：null）：预训练权重的路径。
后处理器 :
postprocessor.merge_adapter_into_base_model（默认值：false）：指示是否将微调的 LoRA 权重合并到基础 LLM 模型中，以便可以使用和/或持久化完整的微调模型，然后在加载时作为一个单一模型重用（而不是必须单独加载基础模型和微调模型）。选项：true, false。
postprocessor.progressbar（默认值：false）：指示是否显示表示卸载和合并过程的进度条。选项：true, false。
bias_type（默认值：none）：Lora 的偏差类型。选项：none, all, lora_only。

AdaLoRA¶

AdaLoRA 是 LoRA 的一个扩展，它允许模型以任务特定的方式将预训练参数适应下游任务。这是通过向模型添加少量可训练参数来实现的，这些参数用于使预训练参数适应下游任务。这使得模型可以使用更少的训练示例进行微调，甚至可以用于在没有任何训练数据的任务上微调模型。

adapter:
    type: adalora
    r: 8
    dropout: 0.05
    target_modules: null
    use_rslora: false
    use_dora: false
    alpha: 16
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false
    bias_type: none
    target_r: 8
    init_r: 12
    tinit: 0
    tfinal: 0
    delta_t: 1
    beta1: 0.85
    beta2: 0.85
    orth_reg_weight: 0.5
    total_step: null
    rank_pattern: null

r（默认值：8）：Lora 注意力维度。
dropout（默认值：0.05）：Lora 层的 dropout 概率。
target_modules（默认值：null）：要替换为 LoRA 的模块名称列表或模块名称的正则表达式。例如，['q', 'v'] 或 '.decoder.(SelfAttention|EncDecAttention).*(q|v)$'。默认为定位所有自注意力和编码器-解码器注意力层的查询和值矩阵。
use_rslora（默认值：false）：设置为 True 时，使用 Rank-Stabilized LoRA，它将适配器缩放因子设置为 lora_alpha/math.sqrt(r)，因为它已被证明效果更好。否则，它将使用原始默认值 lora_alpha/r。论文：https://arxiv.org/abs/2312.03732。选项：true, false。
use_dora（默认值：false）：启用“权重分解的低秩适应”（DoRA）。这种技术将权重的更新分解为两部分：幅度和方向。方向由普通 LoRA 处理，而幅度由一个单独的可学习参数处理。这可以提高 LoRA 的性能，特别是在低秩时。目前，DoRA 仅支持非量化线性层。DoRA 引入的开销比纯粹的 LoRA 更大，因此建议合并权重进行推理。有关更多信息，请参见 https://arxiv.org/abs/2402.09353。选项：true, false。
alpha（默认值：null）：Lora 缩放的 alpha 参数。默认为 2 * r。
pretrained_adapter_weights（默认值：null）：预训练权重的路径。
后处理器 :
postprocessor.merge_adapter_into_base_model（默认值：false）：指示是否将微调的 LoRA 权重合并到基础 LLM 模型中，以便可以使用和/或持久化完整的微调模型，然后在加载时作为一个单一模型重用（而不是必须单独加载基础模型和微调模型）。选项：true, false。
postprocessor.progressbar（默认值：false）：指示是否显示表示卸载和合并过程的进度条。选项：true, false。
bias_type（默认值：none）：Lora 的偏差类型。选项：none, all, lora_only。
target_r（默认值：8）：目标 Lora 矩阵维度。增量矩阵的目标平均秩。
init_r（默认值：12）：初始 Lora 矩阵维度。每个增量矩阵的初始秩。
tinit（默认值：0）：初始微调预热的步骤数。
tfinal（默认值：0）：最终微调预热的步骤数。
delta_t（默认值：1）：两次预算分配之间的时间间隔。秩分配的步骤间隔。
beta1（默认值：0.85）：用于敏感度平滑的 EMA 超参数。
beta2（默认值：0.85）：用于不确定性量化的 EMA 超参数。
orth_reg_weight（默认值：0.5）：正交性正则化的系数。
total_step（默认值：null）：训练前应指定的总训练步骤数。
rank_pattern（默认值：null）：由 RankAllocator 为每个权重矩阵分配的秩。

IA3¶

通过抑制和放大内部激活注入适配器（Infused Adapter by Inhibiting and Amplifying Inner Activations），简称 IA3，是一种添加三个学习向量 l_k、l_v 和 l_ff 的方法，分别用于缩放自注意力和编码器-解码器注意力层的键和值，以及位置前馈网络的中间激活。这些学习向量是微调期间唯一可训练的参数，因此原始权重保持冻结。处理学习向量（而不是像 LoRA 那样学习权重矩阵的低秩更新）可以大大减少可训练参数的数量。

adapter:
    type: ia3
    target_modules: null
    feedforward_modules: null
    fan_in_fan_out: false
    modules_to_save: null
    init_ia3_weights: true
    pretrained_adapter_weights: null
    postprocessor:
        merge_adapter_into_base_model: false
        progressbar: false

target_modules（默认值：null）：应用 (IA)^3 的模块名称。
feedforward_modules（默认值：null）：按照原始论文，被视为前馈模块的模块名称。这些模块将把 (IA)^3 向量乘以输入，而不是输出。feedforward_modules 必须是 target_modules 中存在的名称或其子集。
fan_in_fan_out（默认值：false）：如果要替换的层存储权重如 (fan_in, fan_out)，则将其设置为 True。例如，gpt-2 使用 Conv1D，其权重存储方式为 (fan_in, fan_out)，因此应将其设置为 True。选项：true, false。
modules_to_save（默认值：null）：除了 (IA)^3 层之外，要设置为可训练并保存在最终检查点中的模块列表。
init_ia3_weights（默认值：true）：是否初始化 (IA)^3 层中的向量，默认为 True。选项：true, false。
pretrained_adapter_weights（默认值：null）：预训练权重的路径。
后处理器 :
postprocessor.merge_adapter_into_base_model（默认值：false）：指示是否将微调的 LoRA 权重合并到基础 LLM 模型中，以便可以使用和/或持久化完整的微调模型，然后在加载时作为一个单一模型重用（而不是必须单独加载基础模型和微调模型）。选项：true, false。
postprocessor.progressbar（默认值：false）：指示是否显示表示卸载和合并过程的进度条。选项：true, false。

有关适配器配置的更多信息，请参见此处。

量化¶

注意

量化微调目前需要使用 adapter: lora。上下文学习没有此限制。

注意

量化目前仅支持 backend: local。

quantization:
    bits: 4
    llm_int8_threshold: 6.0
    llm_int8_has_fp16_weight: false
    bnb_4bit_compute_dtype: float16
    bnb_4bit_use_double_quant: true
    bnb_4bit_quant_type: nf4

bits（默认值：4）：加载时应用于权重的量化级别。选项：4, 8。
llm_int8_threshold（默认值：6.0）：这对应于 LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale 论文中描述的离群点检测阈值：https://arxiv.org/abs/2208.07339。任何高于此阈值的隐藏状态值将被视为离群点，并且对这些值的操作将以 fp16 进行。值通常呈正态分布，即大多数值在 [-3.5, 3.5] 范围内，但对于大型模型，存在一些异常的系统性离群点，其分布非常不同。这些离群点通常在 [-60, -6] 或 [6, 60] 区间内。Int8 量化对于幅度约为 5 的值效果很好，但超出此范围，性能会显著下降。一个好的默认阈值是 6，但对于更不稳定的模型（小型模型、微调），可能需要更低的阈值。
llm_int8_has_fp16_weight（默认值：false）：此标志使用 16 位主权重运行 LLM.int8()。这对于微调很有用，因为权重无需在后向传播时来回转换。选项：true, false。
bnb_4bit_compute_dtype（默认值：float16）：这设置了计算类型，可能与输入类型不同。例如，输入可以是 fp32，但计算可以设置为 bf16 以提高速度。选项：float32, float16, bfloat16。
bnb_4bit_use_double_quant（默认值：true）：此标志用于嵌套量化，其中将第一次量化的量化常数再次量化。选项：true, false。
bnb_4bit_quant_type（默认值：nf4）：这设置了 bnb.nn.Linear4Bit 层中的量化数据类型。选项：fp4, nf4。

有关量化参数的更多信息，请参见此处。

模型参数¶

有关模型初始化参数的更多信息，请参见此处。

输出特征¶

文本输出特征是序列特征的特例，因此序列特征的所有选项都适用于文本特征。

文本输出特征可用于标记（对输入序列的每个 token 进行分类）或文本生成（通过重复从模型采样生成文本）。对于这些任务，有两个可用的解码器，分别命名为 tagger 和 generator。

使用默认参数的文本输出特征示例

name: text_column_name
type: text
reduce_input: null
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
    confidence_penalty: 0
    robust_lambda: 0
    class_weights: 1
    class_similarities_temperature: 0
decoder:
    type: generator

参数

reduce_input（默认值 sum）：定义如何将非向量的输入（如矩阵或更高阶张量）在第一个维度（如果算上批处理维度，则为第二个）上进行规约。可用值包括：sum, mean 或 avg, max, concat（沿序列维度连接）, last（返回序列维度的最后一个向量）。
dependencies（默认值 []）：此输出特征依赖的其他输出特征。有关详细说明，请参阅输出特征依赖项。
reduce_dependencies（默认值 sum）：定义如何将非向量（如矩阵或更高阶张量）的依赖特征的输出在第一个维度（如果算上批处理维度，则为第二个）上进行规约。可用值包括：sum, mean 或 avg, max, concat（沿序列维度连接）, last（返回序列维度的最后一个向量）。
loss（默认值 {type: softmax_cross_entropy, class_similarities_temperature: 0, class_weights: 1, confidence_penalty: 0, robust_lambda: 0}）：是一个包含损失 type 的字典。文本特征唯一可用的损失 type 是 softmax_cross_entropy。详见损失。
decoder（默认值：{"type": "generator"}）：所需任务的解码器。选项：generator, tagger。详见解码器。

解码器类型和解码器参数也可以通过类型全局解码器部分一次性定义并应用于所有文本输出特征。损失和与损失相关的参数也可以以同样的方式一次性定义。

解码器¶

生成器¶

graph LR
  A["Combiner Output"] --> B["Fully\n Connected\n Layers"];
  B --> C1["RNN"] --> C2["RNN"] --> C3["RNN"];
  GO(["GO"]) -.-o C1;
  C1 -.-o O1("Output");
  O1 -.-o C2;
  C2 -.-o O2("Output");
  O2 -.-o C3;
  C3 -.-o END(["END"]);
  subgraph DEC["DECODER.."]
  B
  C1
  C2
  C3
  end

在 generator 的情况下，解码器是（可能为空的）全连接层堆栈，然后是一个 RNN，它利用自身的先前预测生成输出，生成大小为 b x s' x c 的张量，其中 b 是批次大小，s' 是生成序列的长度，c 是类别数量，最后是 softmax_cross_entropy。在训练期间采用教师强制（teacher forcing），这意味着将目标列表作为输入和输出（偏移 1）提供；而在评估时，通过束搜索（默认束大小为 1）执行贪婪解码（一次生成一个 token 并将其作为下一步的输入）。通常，生成器期望形状为 b x h 的输入张量，其中 h 是隐藏维度。这些 h 向量（在一个可选的全连接层堆栈之后）被送入 RNN 生成器。一个例外是当生成器使用注意力时，在这种情况下，期望的输入张量大小为 b x s x h，这是序列、文本或时间序列输入特征在未进行规约输出或基于序列的组合器的输出。如果向使用带有注意力的 RNN 的生成器解码器提供了 b x h 输入，则在模型构建期间将引发错误。

decoder:
    type: generator
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    cell_type: gru
    num_layers: 1
    fc_activation: relu
    reduce_input: sum
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null

参数

num_fc_layers（默认值：0）：如果未指定 fc_layers，则为全连接层数量。增加层数可以增加模型的容量，使其能够学习更复杂的特征交互。
fc_output_size（默认值：256）：全连接堆栈的输出大小。
fc_norm（默认值：null）：全连接层开始时应用的默认归一化。选项：batch, layer, ghost, null。
fc_dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
cell_type（默认值：gru）：要使用的循环单元类型。选项：rnn, lstm, gru。
num_layers (默认值: 1) : 堆叠的循环层数。
fc_activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
reduce_input（默认值：sum）：如何将非向量（如矩阵或更高阶张量）的输入在第一个维度（如果算上批处理维度，则为第二个）上进行规约。选项：sum, mean, avg, max, concat, last。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer 和 weights_initializer。如果字典中缺少这些值中的任何一个，将使用作为独立参数提供的默认值。
fc_use_bias（默认值：true）：层是否在 fc_stack 中使用偏差向量。选项：true, false。
fc_weights_initializer（默认值：xavier_uniform）：fc_stack 中层使用的权重初始化器。
fc_bias_initializer（默认值：zeros）：fc_stack 中层使用的偏差初始化器。
fc_norm_params（默认值：null）：传递给 norm 模块的默认参数。

标记器¶

graph LR
  A["emb[0]\n....\nemb[n]"] --> B["Fully\n Connected\n Layers"];
  B --> C["Projection\n....\nProjection"];
  C --> D["Softmax\n....\nSoftmax"];
  subgraph DEC["DECODER.."]
  B
  C
  D
  end
  subgraph COM["COMBINER OUT.."]
  A
  end

在 tagger 的情况下，解码器是（可能为空的）全连接层堆栈，然后投影到一个大小为 b x s x c 的张量，其中 b 是批次大小，s 是序列长度，c 是类别数量，最后是 softmax_cross_entropy。此解码器要求其输入形状为 b x s x h，其中 h 是隐藏维度，这是序列、文本或时间序列输入特征在未进行规约输出或基于序列的组合器的输出。如果提供了 b x h 输入，则在模型构建期间将引发错误。

decoder:
    type: tagger
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    fc_activation: relu
    attention_embedding_size: 256
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null
    use_attention: false
    use_bias: true
    attention_num_heads: 8

参数

num_fc_layers（默认值：0）：如果未指定 fc_layers，则为全连接层数量。增加层数可以增加模型的容量，使其能够学习更复杂的特征交互。
fc_output_size（默认值：256）：全连接堆栈的输出大小。
fc_norm（默认值：null）：全连接层开始时应用的默认归一化。选项：batch, layer, ghost, null。
fc_dropout (默认值: 0.0) : 应用于全连接层的默认 dropout 率。增加 dropout 是一种常见的正则化形式，用于对抗过拟合。dropout 表示元素被置零的概率（0.0 表示没有 dropout）。
fc_activation (默认值: relu): 应用于全连接层输出的默认激活函数。选项：elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null。
attention_embedding_size（默认值：256）：多头自注意力层的嵌入大小。
fc_layers (默认值: null): 包含所有全连接层参数的字典列表。列表的长度决定了堆叠的全连接层数，每个字典的内容决定了特定层的参数。每个可用的层参数包括：activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer 和 weights_initializer。如果字典中缺少这些值中的任何一个，将使用作为独立参数提供的默认值。
fc_use_bias（默认值：true）：层是否在 fc_stack 中使用偏差向量。选项：true, false。
fc_weights_initializer（默认值：xavier_uniform）：fc_stack 中层使用的权重初始化器。
fc_bias_initializer（默认值：zeros）：fc_stack 中层使用的偏差初始化器。
fc_norm_params（默认值：null）：传递给 norm 模块的默认参数。
use_attention（默认值：false）：预测前是否应用多头自注意力层。选项：true, false。
use_bias（默认值：true）：层是否使用偏差向量。选项：true, false。
attention_num_heads（默认值：8）：多头自注意力层中的注意力头数量。

损失¶

序列 Softmax 交叉熵¶

loss:
    type: sequence_softmax_cross_entropy
    class_weights: null
    weight: 1.0
    robust_lambda: 0
    confidence_penalty: 0
    class_similarities: null
    class_similarities_temperature: 0
    unique: false

参数

class_weights（默认值：null）：应用于损失中每个类别的权重。如果未指定，所有类别权重相等。该值可以是一个权重向量，每个类别一个权重，乘以具有该类别作为地面真值的数据点的损失。这是类别分布不平衡时过采样的替代方案。向量的顺序遵循 JSON 元数据文件中的类别到整数 ID 映射（需要包含 <UNK> 类别）。另外，该值也可以是一个字典，其中类别字符串作为键，权重作为值，例如 {class_a: 0.5, class_b: 0.7, ...}。
weight（默认值：1.0）：损失的权重。
robust_lambda（默认值：0）：将损失替换为 (1 - robust_lambda) * loss + robust_lambda / c，其中 c 是类别数量。在存在噪声标签的情况下很有用。
confidence_penalty（默认值：0）：通过添加一个额外的项来惩罚过度自信的预测（低熵），该项通过添加一个 a * (max_entropy - entropy) / max_entropy 项到损失中来惩罚过于自信的预测，其中 a 是此参数的值。在存在噪声标签的情况下很有用。
class_similarities（默认值：null）：如果不是 null，则是一个 c x c 矩阵，以列表的列表形式表示类别之间的相互相似性。如果在 class_similarities_temperature 大于 0 时使用。矩阵的顺序遵循 JSON 元数据文件中的类别到整数 ID 映射（需要包含 <UNK> 类别）。
class_similarities_temperature（默认值：0）：在 class_similarities 的每一行上执行的 softmax 的温度参数。softmax 的输出用于确定要提供的监督向量，而不是为每个数据点提供的 one-hot 向量。其背后的直觉是，相似类别之间的错误比完全不同类别之间的错误更容易容忍。
unique（默认值：false）：如果为 true，则仅计算序列中唯一元素的损失。选项：true, false。

评估指标¶

文本特征可用的评估指标与序列特征相同。

sequence_accuracy 模型预测正确序列的速率。
token_accuracy 正确预测的 token 数量除以所有序列中的 token 总数。
last_accuracy 仅考虑序列最后一个元素的准确率。用于确保生成或标记特殊的序列结束 token。
edit_distance Levenshtein 距离：将预测序列更改为地面真值所需的最小单 token 编辑（插入、删除或替换）次数。
perplexity 困惑度是地面真值序列预测概率的倒数，按 token 数量进行归一化。困惑度越低，预测真实序列的概率越高。
loss 损失函数的值。

如果 validation_field 指定了序列特征，则可以在配置的 training 部分将上述任何一项设置为 validation_metric。