跳到内容

数据集动物园

Ludwig 数据集动物园提供可以直接插入 Ludwig 模型中使用的数据集。

使用数据集最简单的方法是在指定训练集时将其引用为 URI

ludwig train --dataset ludwig://reuters ...

任何 Ludwig 数据集都可以指定为形式为 ludwig://<dataset> 的 URI。

数据集也可以通过 .load() 方法以编程方式导入并加载到 Pandas DataFrame 中

from ludwig.datasets import reuters

# Loads into single dataframe with a 'split' column:
dataset_df = reuters.load()

# Loads into split dataframes:
train_df, test_df, _ = reuters.load(split=True)

ludwig.datasets API 还提供了用于列出、描述和获取数据集的函数。例如

import ludwig.datasets

# Gets a list of all available dataset names.
dataset_names = ludwig.datasets.list_datasets()

# Prints the description of the titanic dataset.
print(ludwig.datasets.describe_dataset("titanic"))

titanic = ludwig.datasets.get_dataset("titanic")

# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()

# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)

Kaggle 数据集

一些数据集托管在 Kaggle 上,需要 Kaggle 账户。要使用这些数据集,您需要在您的环境中设置 Kaggle 凭据。如果数据集是 Kaggle 竞赛的一部分,您还需要在竞赛页面上接受相关条款。

要通过编程方式检查,数据集具有一个 .is_kaggle_dataset 属性。

下载、处理和导出

数据集首先会下载到 LUDWIG_CACHE 中,此变量可以通过环境变量设置,默认值为 $HOME/.ludwig_cache

数据集会自动加载、处理并重新保存为缓存中的 parquet 文件。

要导出已处理的数据集,包括其依赖的任何文件,请使用 .export(output_directory) 方法。如果数据集包含图像或音频文件等多媒体文件,建议使用此方法。文件路径是相对于训练过程的工作目录的。

from ludwig.datasets import twitter_bots

# Exports twitter bots dataset and image files to the current working directory.
twitter_bots.export(".")

端到端示例

这里有一个使用 MNIST 数据集训练模型的端到端示例

from ludwig.api import LudwigModel
from ludwig.datasets import mnist

# Initializes a Ludwig model
config = {
    "input_features": [{"name": "image_path", "type": "image"}],
    "output_features": [{"name": "label", "type": "category"}],
}
model = LudwigModel(config)

# Loads and splits MNIST dataset
training_set, test_set, _ = mnist.load(split=True)

# Exports the mnist image files to the current working directory.
mnist.export(".")

# Runs model training
train_stats, _, _ = model.train(training_set=training_set, test_set=test_set, model_name="mnist_model")

数据集分割

数据集动物园中的所有数据集都提供了默认的训练/验证/测试分割。当使用 split=False 加载时,将返回默认分割(并且保证每次都相同)。当使用 split=True 时,Ludwig 将随机重新分割数据集。

注意

一些基准或竞赛数据集发布时,保留了测试集的标签。换句话说,训练集和验证集有标签,但测试集没有。大多数 Kaggle 竞赛数据集都有这种无标签的测试集。

分割

  • 训练集 (train): 用于训练的数据。必需,必须有标签。
  • 验证集 (validation): 在训练期间用于评估的数据集子集。可选,必须有标签。
  • 测试集 (test): 在模型开发期间保留,用于后续测试。可选,可能没有标签。

动物园数据集

这是当前可用的数据集列表

数据集 托管方 描述
adult_census_income archive.ics.uci.edu https://archive.ics.uci.edu/ml/datasets/adult。关于一个人年收入是否超过 5 万美元。
allstate_claims_severity Kaggle https://www.kaggle.com/c/allstate-claims-severity
amazon_employee_access_challenge Kaggle https://www.kaggle.com/c/amazon-employee-access-challenge
agnews Github https://search.r-project.org/CRAN/refmans/textdata/html/dataset_ag_news.html
allstate_claims_severity Kaggle https://www.kaggle.com/c/allstate-claims-severity
amazon_employee_access_challenge Kaggle https://www.kaggle.com/c/amazon-employee-access-challenge
amazon_review_polarity S3 https://paperswithcode.com/sota/sentiment-analysis-on-amazon-review-polarity
amazon_reviews S3 https://s3.amazonaws.com/amazon-reviews-pds/readme.html
ames_housing Kaggle https://www.kaggle.com/c/ames-housing-data
bbc_news Kaggle https://www.kaggle.com/c/learn-ai-bbc
bnp_claims_management Kaggle https://www.kaggle.com/c/bnp-paribas-cardif-claims-management
connect4 Kaggle https://www.kaggle.com/c/connectx/discussion/124397
creditcard_fraud Kaggle https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
dbpedia S3 https://paperswithcode.com/dataset/dbpedia
electricity S3 根据星期和外部温度预测电力需求。
ethos_binary Github https://github.com/huggingface/datasets/blob/master/datasets/ethos/README.md
fever S3 https://arxiv.org/abs/1803.05355
flickr8k Github https://www.kaggle.com/adityajn105/flickr8k
forest_cover archive.ics.uci.edu https://archive.ics.uci.edu/ml/datasets/covertype
goemotions Github https://arxiv.org/abs/2005.00547
higgs archive.ics.uci.edu https://archive.ics.uci.edu/ml/datasets/HIGGS
ieee_fraud Kaggle https://www.kaggle.com/c/ieee-fraud-detection
imbalanced_insurance Kaggle https://www.kaggle.com/datasets/arashnic/imbalanced-data-practice
imdb Kaggle https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
insurance_lite Kaggle https://www.kaggle.com/infernape/fast-furious-and-insured
iris archive.ics.uci.edu https://archive.ics.uci.edu/ml/datasets/iris
irony Github https://github.com/bwallace/ACL-2014-irony
kdd_appetency kdd.org https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
kdd_churn kdd.org https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
kdd_upselling kdd.org https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
mnist yann.lecun.com http://yann.lecun.com/exdb/mnist/
mushroom_edibility archive.ics.uci.edu https://archive.ics.uci.edu/ml/datasets/mushroom
naval archive.ics.uci.edu https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/24098
noshow_appointments Kaggle https://www.kaggle.com/datasets/joniarroba/noshowappointments
numerai28pt6 Kaggle https://www.kaggle.com/numerai/encrypted-stock-market-data-from-numerai
ohsumed_7400 Kaggle https://www.kaggle.com/datasets/weipengfei/ohr8r52
ohsumed_cmu boston.lti.cs.cmu.edu http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/
otto_group_product Kaggle https://www.kaggle.com/c/otto-group-product-classification-challenge
poker_hand archive.ics.uci.edu https://archive.ics.uci.edu/ml/datasets/Poker+Hand
porto_seguro_safe_driver Kaggle https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
protein archive.ics.uci.edu https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2932-0
reuters_cmu boston.lti.cs.cmu.edu http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/
reuters_r8 Kaggle 来自 Kaggle 的 Reuters 21578 数据集的 Reuters R8 子集。
rossmann_store_sales Kaggle https://www.kaggle.com/c/rossmann-store-sales
santander_customer_satisfaction Kaggle https://www.kaggle.com/c/santander-customer-satisfaction
santander_customer_transaction_prediction Kaggle https://www.kaggle.com/c/santander-customer-transaction-prediction
santander_value_prediction Kaggle https://www.kaggle.com/c/santander-value-prediction-challenge
sarcos gaussianprocess.org http://www.gaussianprocess.org/gpml/data/
sst2 nlp.stanford.edu https://paperswithcode.com/dataset/sst
sst3 nlp.stanford.edu 合并极负面和负面类别,以及极正面和正面类别。
sst5 nlp.stanford.edu https://paperswithcode.com/dataset/sst
synthetic_fraud Kaggle https://www.kaggle.com/ealaxi/paysim1
temperature Kaggle https://www.kaggle.com/selfishgene/historical-hourly-weather-data
titanic Kaggle https://www.kaggle.com/c/titanic
walmart_recruiting Kaggle https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
wmt15 Kaggle https://www.kaggle.com/dhruvildave/en-fr-translation-dataset
yahoo_answers S3 问题分类。
yelp_review_polarity S3 https://www.yelp.com/dataset。预测 Yelp 评论的极性或情感。
yelp_reviews S3 https://www.yelp.com/dataset
yosemite Github https://github.com/ourownstory/neural_prophet Yosemite 温度数据集。

添加数据集

要将数据集添加到 Ludwig 数据集动物园,请参阅 添加数据集