This is the class for data sets served on https://openml.org/d.

mlr3 Integration

A mlr3::Task is returned by the method $task. Alternatively, you can convert this object to a mlr3::DataBackend using mlr3::as_data_backend().

ARFF Files

This package comes with an own reader for ARFF files, based on data.table::fread(). For sparse ARFF files and if the RWeka package is installed, the reader automatically falls back to the implementation in (RWeka::read.arff()).

References

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi: 10.1145/2641190.2641198 .

Public fields

id

(integer(1))
OpenML data id.

cache_dir

(logical(1) | character(1))
Stores the location of the cache for objects retrieved from https://openml.org. If set to FALSE, caching is disabled.The package qs is required for caching.

Active bindings

name

(character(1))
Name of the data set, as extracted from the data set description.

desc

(list())
Data set description (meta information), downloaded and converted from the JSON API response.

qualities

(data.table())
Data set qualities (performance values), downloaded from the JSON API response and converted to a data.table::data.table() with columns "name" and "value".

features

(data.table())
Information about data set features (including target), downloaded from the JSON API response and converted to a data.table::data.table() with columns:

  • "index" (integer()): Column position.

  • "name" (character()): Name of the feature.

  • "data_type" (factor()): Type of the feature: "nominal" or "numeric".

  • "nominal_value" (list()): Levels of the feature, or NULL for numeric features.

  • "is_target" (logical()): TRUE for target column, FALSE otherwise.

  • "is_ignore" (logical()): TRUE if this feature should be ignored. Ignored features are removed automatically from the data set.

  • "is_row_identifier" (logical()): TRUE if the column encodes a row identifier. Row identifiers are removed automatically from the data set.

  • "number_of_missing_values" (integer()): Number of missing values in the column.

data

(data.table())
Data as data.table::data.table(). Columns marked as row identifiers or marked with the ignore flag are automatically removed.

target_names

(character())
Name of the default target, as extracted from the OpenML data set description.

feature_names

(character())
Name of the features, as extracted from the OpenML data set description.

nrow

(integer())
Number of observations, as extracted from the OpenML data set qualities.

ncol

(integer())
Number of features (including targets), as extracted from the table of data set features. This excludes row identifiers and ignored columns.

tags

(character())
Returns all tags of the data set.

Methods

Public methods


Method new()

Creates a new object of class OMLData.

Usage

OMLData$new(id, cache = getOption("mlr3oml.cache", FALSE))

Arguments

id

(integer(1))
OpenML data id.

cache

(logical(1) | character(1))
See field cache for an explanation of possible values. Defaults to value of option "mlr3oml.cache", or FALSE if not set.


Method print()

Prints the object. For a more detailed printer, convert to a mlr3::Task via $task().

Usage

OMLData$print()


Method quality()

Returns the value of a single OpenML data set quality.

Usage

OMLData$quality(name)

Arguments

name

(character(1))
Name of the quality to extract.


Method task()

Creates a mlr3::Task using the provided target column, defaulting to the default target attribute of the task description. Note that if the target column is incorrectly encoded, e.g. as numeric 0/1 for classification, this will result in a task of the wrong type.

Usage

OMLData$task(target_names = NULL)

Arguments

target_names

(character())
Name(s) of the target columns, or NULL for the default columns.


Method clone()

The objects of this class are cloneable with this method.

Usage

OMLData$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples

# \donttest{ odata = OMLData$new(id = 9) print(odata)
#> <OMLData:9:autos> (205x26)
print(odata$target_names)
#> [1] "symboling"
print(odata$feature_names)
#> [1] "normalized.losses" "make" "fuel.type" #> [4] "aspiration" "num.of.doors" "body.style" #> [7] "drive.wheels" "engine.location" "wheel.base" #> [10] "length" "width" "height" #> [13] "curb.weight" "engine.type" "num.of.cylinders" #> [16] "engine.size" "fuel.system" "bore" #> [19] "stroke" "compression.ratio" "horsepower" #> [22] "peak.rpm" "city.mpg" "highway.mpg" #> [25] "price"
print(odata$tags)
#> [1] "study_1" "study_41" "study_76" "uci"
print(odata$task())
#> <TaskClassif:autos> (205 x 26) #> * Target: symboling #> * Properties: multiclass #> * Features (25): #> - fct (10): aspiration, body.style, drive.wheels, engine.location, #> engine.type, fuel.system, fuel.type, make, num.of.cylinders, #> num.of.doors #> - int (8): city.mpg, curb.weight, engine.size, highway.mpg, #> horsepower, normalized.losses, peak.rpm, price #> - dbl (7): bore, compression.ratio, height, length, stroke, #> wheel.base, width
# get a task via tsk(): if (requireNamespace("mlr3")) { mlr3::tsk("oml", data_id = 9) }
#> <TaskClassif:autos> (205 x 26) #> * Target: symboling #> * Properties: multiclass #> * Features (25): #> - fct (10): aspiration, body.style, drive.wheels, engine.location, #> engine.type, fuel.system, fuel.type, make, num.of.cylinders, #> num.of.doors #> - int (8): city.mpg, curb.weight, engine.size, highway.mpg, #> horsepower, normalized.losses, peak.rpm, price #> - dbl (7): bore, compression.ratio, height, length, stroke, #> wheel.base, width
# }