This is the class for data sets served on OpenML.
This object can also be constructed using the sugar function odt()
.
mlr3 Integration
A mlr3::Task can be obtained by calling
mlr3::as_task()
. The target column must either be the default target (this is the default behaviour) or one of$feature_names
. In case the target is specified to be one of$feature_names
, the default target is added to the features of the task.A mlr3::DataBackend can be obtained by calling
mlr3::as_data_backend()
. Depending on the selected file-type, the returned backend is a mlr3::DataBackendDataTable (arff) or mlr3db::DataBackendDuckDB (parquet). Note that a converted backend can contain columns beyond the target and the features (id column or ignore columns).
Name conversion
Column names that don't comply with R's naming scheme are renamed (see base::make.names()
).
This means that the names can differ from those on OpenML.
File Format
The datasets stored on OpenML are either stored as (sparse) ARFF or parquet.
When creating a new OMLData
object, the constructor argument parquet
allows to switch
between arff and parquet. Note that not necessarily all data files are available as parquet.
The option mlr3oml.parquet
can be used to set a default.
If parquet
is TRUE
but not available, "arff"
will be used as a fallback.
ARFF Files
This package comes with an own reader for ARFF files, based on data.table::fread()
.
For sparse ARFF files and if the RWeka package is installed, the reader
automatically falls back to the implementation in (RWeka::read.arff()
).
References
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49–60. doi:10.1145/2641190.2641198 .
Super class
mlr3oml::OMLObject
-> OMLData
Active bindings
qualities
(
data.table()
)
Data set qualities (performance values), downloaded from the JSON API response and converted to adata.table::data.table()
with columns"name"
and"value"
.tags
(
character()
)
Returns all tags of the object.parquet
(
logical(1)
)
Whether to use parquet.data
(
data.table()
)
Returns the data (without the row identifier and ignore id columns).features
(
data.table()
)
Information about data set features (including target), downloaded from the JSON API response and converted to adata.table::data.table()
with columns:"index"
(integer()
): Column position."name"
(character()
): Name of the feature."data_type"
(factor()
): Type of the feature:"nominal"
or"numeric"
."nominal_value"
(list()
): Levels of the feature, orNULL
for numeric features."is_target"
(logical()
):TRUE
for target column,FALSE
otherwise."is_ignore"
(logical()
):TRUE
if this feature should be ignored. Ignored features are removed automatically from the data set."is_row_identifier"
(logical()
):TRUE
if the column encodes a row identifier. Row identifiers are removed automatically from the data set."number_of_missing_values"
(integer()
): Number of missing values in the column.
target_names
(
character()
)
Name of the default target, as extracted from the OpenML data set description.feature_names
(
character()
)
Name of the features, as extracted from the OpenML data set description.nrow
(
integer()
)
Number of observations, as extracted from the OpenML data set qualities.ncol
(
integer()
)
Number of features (including targets), as extracted from the table of data set features. This excludes row identifiers and ignored columns.license
(
character()
)
Returns all license of the dataset.parquet_path
(
character()
)
Downloads the parquet file (or loads from cache) and returns the path of the parquet file. Note that this also normalizes the names of the parquet file.
Methods
Inherited methods
Method new()
Creates a new instance of this R6 class.
Usage
OMLData$new(
id,
parquet = parquet_default(),
test_server = test_server_default()
)
Arguments
id
(
integer(1)
)
OpenML id for the object.parquet
(
logical(1)
)
Whether to use parquet instead of arff. If parquet is not available, it will fall back to arff. Defaults to value of option"mlr3oml.parquet"
orFALSE
if not set.test_server
(
character(1)
)
Whether to use the OpenML test server or public server. Defaults to value of option"mlr3oml.test_server"
, orFALSE
if not set.
Method print()
Prints the object.
For a more detailed printer, convert to a mlr3::Task via as_task()
.
Examples
# For technical reasons, examples cannot be included in this R package.
# Instead, these are some relevant resources:
#
# Large-Scale Benchmarking chapter in the mlr3book:
# https://mlr3book.mlr-org.com/chapters/chapter11/large-scale_benchmarking.html
#
# Package Article:
# https://mlr3oml.mlr-org.com/articles/tutorial.html