Skip to contents

This function allows to query data sets, tasks, flows, setups, runs, and evaluation measures from https://openml.org/d using some simple filter criteria.

Usage

list_oml_data(
  data_id = NULL,
  data_name = NULL,
  number_instances = NULL,
  number_features = NULL,
  number_classes = NULL,
  number_missing_values = NULL,
  tag = NULL,
  limit = getOption("mlr3oml.limit", 5000L),
  ...
)

list_oml_evaluations(
  run_id = NULL,
  task_id = NULL,
  measures = NULL,
  tag = NULL,
  limit = getOption("mlr3oml.limit", 5000L),
  ...
)

list_oml_flows(
  uploader = NULL,
  tag = NULL,
  limit = getOption("mlr3oml.limit", 5000L),
  ...
)

list_oml_measures()

list_oml_runs(
  run_id = NULL,
  task_id = NULL,
  tag = NULL,
  flow_id = NULL,
  limit = getOption("mlr3oml.limit", 5000L),
  ...
)

list_oml_setups(
  flow_id = NULL,
  setup_id = NULL,
  tag = NULL,
  limit = getOption("mlr3oml.limit", 5000L),
  ...
)

list_oml_tasks(
  task_id = NULL,
  data_id = NULL,
  number_instances = NULL,
  number_features = NULL,
  number_classes = NULL,
  number_missing_values = NULL,
  tag = NULL,
  limit = getOption("mlr3oml.limit", 5000L),
  ...
)

Arguments

data_id

(integer())
Vector of data ids to restrict to.

data_name

(character(1))
Filter for name of data set.

number_instances

(integer())
Filter for number of instances.

number_features

(integer())
Filter for number of features.

number_classes

(integer())
Filter for number of labels of the target (only classification tasks).

number_missing_values

(integer())
Filter for number of missing values.

tag

(character())
Filter for tags. You can provide multiple tags as character vector.

limit

(integer())
Limit the results to limit records. Default is the value of option "mlr3oml.limit", defaulting to 5000.

...

(any)
Additional (unsupported) filters, as named arguments.

run_id

(integer())
Vector of run ids to restrict to.

task_id

(integer())
Vector of task ids to restrict to.

measures

(character())
Vector of evaluation measures to restrict to.

uploader

(integer(1))
Filter for uploader.

flow_id

(integer(1))
Filter for flow id.

setup_id

(integer())
Vector of setup ids to restrict to.

Value

(data.table()) of results, or a null data.table if no data set matches the filter criteria.

Details

Filter values are usually provided as single atomic values (typically integer or character). Provide a numeric vector of length 2 (c(l, u)) to find matches in the range \([l, u]\).

Note that only a subset of filters is exposed here. For a more feature-complete package, see OpenML. Alternatively, you can pass additional filters via ... using the names of the official API, c.f. https://www.openml.org/api_docs.

References

Casalicchio G, Bossek J, Lang M, Kirchhoff D, Kerschke P, Hofner B, Seibold H, Vanschoren J, Bischl B (2017). “OpenML: An R Package to Connect to the Machine Learning Platform OpenML.” Computational Statistics, 1--15. doi:10.1007/s00180-017-0742-2 .

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi:10.1145/2641190.2641198 .

Examples

# \donttest{
### query data sets
# search for titanic data set
data_sets = list_oml_data(data_name = "titanic")
print(data_sets)
#>    data_id    name version status MajorityClassSize MaxNominalAttDistinctValues
#> 1:   40704 Titanic       2 active              1490                           2
#> 2:   40945 Titanic       1 active               809                           3
#> 3:   41265 Titanic       4 active                NA                          NA
#> 4:   42436 Titanic       5 active                NA                          NA
#> 5:   42437 titanic       6 active                NA                          NA
#> 6:   42438 Titanic       7 active                NA                          NA
#> 7:   42637 titanic       8 active                NA                          NA
#> 8:   42638 titanic       9 active               549                          NA
#>    MinorityClassSize NumberOfClasses NumberOfFeatures NumberOfInstances
#> 1:               711               2                4              2201
#> 2:               500               2               14              1309
#> 3:                NA               0                8              1307
#> 4:                NA               0                8               891
#> 5:                NA               0                8               891
#> 6:                NA               0                8               891
#> 7:                NA              NA                8               891
#> 8:               342               2                8               891
#>    NumberOfInstancesWithMissingValues NumberOfMissingValues
#> 1:                                  0                     0
#> 2:                               1309                  3855
#> 3:                                  0                     0
#> 4:                                  0                     0
#> 5:                                  0                     0
#> 6:                                  0                     0
#> 7:                                689                   689
#> 8:                                689                   689
#>    NumberOfNumericFeatures NumberOfSymbolicFeatures
#> 1:                       3                        1
#> 2:                       6                        3
#> 3:                       8                        0
#> 4:                       8                        0
#> 5:                       8                        0
#> 6:                       8                        0
#> 7:                       3                        5
#> 8:                       3                        5

# search for a reduced version
data_sets = list_oml_data(
  data_name = "titanic",
  number_instances = c(2200, 2300),
  number_features = 4
)
print(data_sets)
#>    data_id    name version status MajorityClassSize MaxNominalAttDistinctValues
#> 1:   40704 Titanic       2 active              1490                           2
#>    MinorityClassSize NumberOfClasses NumberOfFeatures NumberOfInstances
#> 1:               711               2                4              2201
#>    NumberOfInstancesWithMissingValues NumberOfMissingValues
#> 1:                                  0                     0
#>    NumberOfNumericFeatures NumberOfSymbolicFeatures
#> 1:                       3                        1

### search tasks for this data set
tasks = list_oml_tasks(data_id = data_sets$data_id)
print(tasks)
#>     task_id                             task_type data_id    name status
#>  1:  145769                            Clustering   40704 Titanic active
#>  2:  146230             Supervised Classification   40704 Titanic active
#>  3:  146528                        Learning Curve   40704 Titanic active
#>  4:  166588                            Clustering   40704 Titanic active
#>  5:  167099             Supervised Classification   40704 Titanic active
#>  6:  167486                        Learning Curve   40704 Titanic active
#>  7:  167844                        Learning Curve   40704 Titanic active
#>  8:  168202                        Learning Curve   40704 Titanic active
#>  9:  168615                        Learning Curve   40704 Titanic active
#> 10:  188610                            Clustering   40704 Titanic active
#> 11:  189623                        Learning Curve   40704 Titanic active
#> 12:  190235 Supervised Data Stream Classification   40704 Titanic active
#> 13:  210117                            Clustering   40704 Titanic active
#> 14:  211516                        Learning Curve   40704 Titanic active
#> 15:  231791                            Clustering   40704 Titanic active
#> 16:  252906                            Clustering   40704 Titanic active
#> 17:  293680                            Clustering   40704 Titanic active
#> 18:  293693                            Clustering   40704 Titanic active
#> 19:  316179                            Clustering   40704 Titanic active
#> 20:  337306                            Clustering   40704 Titanic active
#> 21:  358461                            Clustering   40704 Titanic active
#> 22:  360401                        Learning Curve   40704 Titanic active
#>     task_id                             task_type data_id    name status
#>     MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize
#>  1:              1490                           2               711
#>  2:              1490                           2               711
#>  3:              1490                           2               711
#>  4:              1490                           2               711
#>  5:              1490                           2               711
#>  6:              1490                           2               711
#>  7:              1490                           2               711
#>  8:              1490                           2               711
#>  9:              1490                           2               711
#> 10:              1490                           2               711
#> 11:              1490                           2               711
#> 12:              1490                           2               711
#> 13:              1490                           2               711
#> 14:              1490                           2               711
#> 15:              1490                           2               711
#> 16:              1490                           2               711
#> 17:              1490                           2               711
#> 18:              1490                           2               711
#> 19:              1490                           2               711
#> 20:              1490                           2               711
#> 21:              1490                           2               711
#> 22:              1490                           2               711
#>     MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize
#>     NumberOfClasses NumberOfFeatures NumberOfInstances
#>  1:               2                4              2201
#>  2:               2                4              2201
#>  3:               2                4              2201
#>  4:               2                4              2201
#>  5:               2                4              2201
#>  6:               2                4              2201
#>  7:               2                4              2201
#>  8:               2                4              2201
#>  9:               2                4              2201
#> 10:               2                4              2201
#> 11:               2                4              2201
#> 12:               2                4              2201
#> 13:               2                4              2201
#> 14:               2                4              2201
#> 15:               2                4              2201
#> 16:               2                4              2201
#> 17:               2                4              2201
#> 18:               2                4              2201
#> 19:               2                4              2201
#> 20:               2                4              2201
#> 21:               2                4              2201
#> 22:               2                4              2201
#>     NumberOfClasses NumberOfFeatures NumberOfInstances
#>     NumberOfInstancesWithMissingValues NumberOfMissingValues
#>  1:                                  0                     0
#>  2:                                  0                     0
#>  3:                                  0                     0
#>  4:                                  0                     0
#>  5:                                  0                     0
#>  6:                                  0                     0
#>  7:                                  0                     0
#>  8:                                  0                     0
#>  9:                                  0                     0
#> 10:                                  0                     0
#> 11:                                  0                     0
#> 12:                                  0                     0
#> 13:                                  0                     0
#> 14:                                  0                     0
#> 15:                                  0                     0
#> 16:                                  0                     0
#> 17:                                  0                     0
#> 18:                                  0                     0
#> 19:                                  0                     0
#> 20:                                  0                     0
#> 21:                                  0                     0
#> 22:                                  0                     0
#>     NumberOfInstancesWithMissingValues NumberOfMissingValues
#>     NumberOfNumericFeatures NumberOfSymbolicFeatures
#>  1:                       3                        1
#>  2:                       3                        1
#>  3:                       3                        1
#>  4:                       3                        1
#>  5:                       3                        1
#>  6:                       3                        1
#>  7:                       3                        1
#>  8:                       3                        1
#>  9:                       3                        1
#> 10:                       3                        1
#> 11:                       3                        1
#> 12:                       3                        1
#> 13:                       3                        1
#> 14:                       3                        1
#> 15:                       3                        1
#> 16:                       3                        1
#> 17:                       3                        1
#> 18:                       3                        1
#> 19:                       3                        1
#> 20:                       3                        1
#> 21:                       3                        1
#> 22:                       3                        1
#>     NumberOfNumericFeatures NumberOfSymbolicFeatures


# query runs, group by number of runs per task_id
runs = list_oml_runs(task_id = tasks$task_id)
runs[, .N, by = task_id]
#>    task_id  N
#> 1:  146230 35
# }