Skip to contents

This is the class for data sets served on OpenML. This object can also be constructed using the sugar function oml_data().

mlr3 Integration

Name conversion

Column names that don't comply with R's naming scheme are renamed (see base::make.names()). This means that the names can differ from those on OpenML.

File Format

The datasets stored on OpenML are either stored as (sparse) ARFF or parquet. When creating a new OMLData object, the constructor argument parquet allows to switch between arff and parquet. Note that not necessarily all data files are available as parquet. The option mlr3oml.parquet can be used to set a default. If parquet is TRUE but not available, "arff" will be used as a fallback.

ARFF Files

This package comes with an own reader for ARFF files, based on data.table::fread(). For sparse ARFF files and if the RWeka package is installed, the reader automatically falls back to the implementation in (RWeka::read.arff()).

Parquet Files

For the handling of parquet files, we rely on duckdb and CRANpkg{DBI}.

References

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi:10.1145/2641190.2641198 .

Super class

mlr3oml::OMLObject -> OMLData

Active bindings

qualities

(data.table())
Data set qualities (performance values), downloaded from the JSON API response and converted to a data.table::data.table() with columns "name" and "value".

tags

(character())
Returns all tags of the object.

parquet

(logical(1))
Whether to use parquet.

data

(data.table())
Returns the data (without the row identifier and ignore id columns).

features

(data.table())
Information about data set features (including target), downloaded from the JSON API response and converted to a data.table::data.table() with columns:

  • "index" (integer()): Column position.

  • "name" (character()): Name of the feature.

  • "data_type" (factor()): Type of the feature: "nominal" or "numeric".

  • "nominal_value" (list()): Levels of the feature, or NULL for numeric features.

  • "is_target" (logical()): TRUE for target column, FALSE otherwise.

  • "is_ignore" (logical()): TRUE if this feature should be ignored. Ignored features are removed automatically from the data set.

  • "is_row_identifier" (logical()): TRUE if the column encodes a row identifier. Row identifiers are removed automatically from the data set.

  • "number_of_missing_values" (integer()): Number of missing values in the column.

target_names

(character())
Name of the default target, as extracted from the OpenML data set description.

feature_names

(character())
Name of the features, as extracted from the OpenML data set description.

nrow

(integer())
Number of observations, as extracted from the OpenML data set qualities.

ncol

(integer())
Number of features (including targets), as extracted from the table of data set features. This excludes row identifiers and ignored columns.

license

(character())
Returns all license of the dataset.

parquet_path

(character())
Downloads the parquet file (or loads from cache) and returns the path of the parquet file. Note that this also normalizes the names of the parquet file.

Methods

Inherited methods


Method new()

Creates a new instance of this R6 class.

Usage

OMLData$new(
  id,
  cache = cache_default(),
  parquet = parquet_default(),
  test_server = test_server_default()
)

Arguments

id

(integer(1))
OpenML id for the object.

cache

(logical(1) | character(1))
See field cache for an explanation of possible values. Defaults to value of option "mlr3oml.cache", or FALSE if not set.

parquet

(logical(1))
Whether to use parquet instead of arff. If parquet is not available, it will fall back to arff. Defaults to value of option "mlr3oml.parquet" or FALSE if not set.

test_server

(character(1))
Whether to use the OpenML test server or public server. Defaults to value of option "mlr3oml.test_server", or FALSE if not set.


Method print()

Prints the object. For a more detailed printer, convert to a mlr3::Task via as_task().

Usage

OMLData$print()


Method quality()

Returns the value of a single OpenML data set quality.

Usage

OMLData$quality(name)

Arguments

name

(character(1))
Name of the quality to extract.


Method clone()

The objects of this class are cloneable with this method.

Usage

OMLData$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples

try({
  library("mlr3")
  # OpenML Data object
  odata = OMLData$new(id = 9)
  # using sugar
  odata = odt(id = 9)
  print(odata)
  print(odata$target_names)
  print(odata$feature_names)
  print(odata$tags)

  # mlr3 conversion:
  task = as_task(odata)
  backend = as_data_backend(odata)
  class(backend)

  # get a task via tsk():
  tsk("oml", data_id = 9)

  # For parquet files
  if (requireNamespace("duckdb")) {
    odata = OMLData$new(id = 9, parquet = TRUE)
    # using sugar
    odata = odt(id = 9)

    print(odata)
    print(odata$target_names)
    print(odata$feature_names)
    print(odata$tags)

    backend = as_data_backend(odata)
    class(backend)
    task = as_task(odata)
    task = tsk("oml", data_id = 9, parquet = TRUE)
    class(task$backend)
  }
}, silent = TRUE)
#> <OMLData:9:autos> (205x26)
#>  * Default target: symboling
#> [1] "symboling"
#>  [1] "normalized.losses" "make"              "fuel.type"        
#>  [4] "aspiration"        "num.of.doors"      "body.style"       
#>  [7] "drive.wheels"      "engine.location"   "wheel.base"       
#> [10] "length"            "width"             "height"           
#> [13] "curb.weight"       "engine.type"       "num.of.cylinders" 
#> [16] "engine.size"       "fuel.system"       "bore"             
#> [19] "stroke"            "compression.ratio" "horsepower"       
#> [22] "peak.rpm"          "city.mpg"          "highway.mpg"      
#> [25] "price"            
#> [1] "study_1"  "study_41" "study_76" "uci"     
#> Loading required namespace: duckdb
#> <OMLData:9:autos> (205x26)
#>  * Default target: symboling
#> [1] "symboling"
#>  [1] "normalized.losses" "make"              "fuel.type"        
#>  [4] "aspiration"        "num.of.doors"      "body.style"       
#>  [7] "drive.wheels"      "engine.location"   "wheel.base"       
#> [10] "length"            "width"             "height"           
#> [13] "curb.weight"       "engine.type"       "num.of.cylinders" 
#> [16] "engine.size"       "fuel.system"       "bore"             
#> [19] "stroke"            "compression.ratio" "horsepower"       
#> [22] "peak.rpm"          "city.mpg"          "highway.mpg"      
#> [25] "price"            
#> [1] "study_1"  "study_41" "study_76" "uci"     
#> [1] "DataBackendDuckDB" "DataBackend"       "R6"