List Data from OpenML
Source:R/list_oml_data.R
, R/list_oml_evaluations.R
, R/list_oml_flows.R
, and 4 more
list_oml.Rd
This function allows to query data sets, tasks, flows, setups, runs, and evaluation measures from https://www.openml.org/search?type=data&sort=runs&status=active using some simple filter criteria.
To find datasets for a specific task type, use list_oml_tasks()
which supports filtering according to the task
type.
Usage
list_oml_data(
data_id = NULL,
data_name = NULL,
number_instances = NULL,
number_features = NULL,
number_classes = NULL,
number_missing_values = NULL,
tag = NULL,
limit = limit_default(),
test_server = test_server_default(),
...
)
list_oml_evaluations(
run_id = NULL,
task_id = NULL,
measures = NULL,
tag = NULL,
limit = limit_default(),
test_server = test_server_default(),
...
)
list_oml_flows(
uploader = NULL,
tag = NULL,
limit = limit_default(),
test_server = test_server_default(),
...
)
list_oml_measures(test_server = test_server_default())
list_oml_runs(
run_id = NULL,
task_id = NULL,
tag = NULL,
flow_id = NULL,
limit = limit_default(),
test_server = test_server_default(),
...
)
list_oml_setups(
flow_id = NULL,
setup_id = NULL,
tag = NULL,
limit = limit_default(),
test_server = test_server_default(),
...
)
list_oml_tasks(
task_id = NULL,
data_id = NULL,
number_instances = NULL,
number_features = NULL,
number_classes = NULL,
number_missing_values = NULL,
tag = NULL,
limit = limit_default(),
test_server = test_server_default(),
type = NULL,
...
)
Arguments
- data_id
(
integer()
)
Vector of data ids to restrict to.- data_name
(
character(1)
)
Filter for name of data set.- number_instances
(
integer()
)
Filter for number of instances.- number_features
(
integer()
)
Filter for number of features.- number_classes
(
integer()
)
Filter for number of labels of the target (only classification tasks).- number_missing_values
(
integer()
)
Filter for number of missing values.- tag
(
character()
)
Filter for tags. You can provide multiple tags as character vector.- limit
(
integer()
)
Limit the results tolimit
records. Default is the value of option"mlr3oml.limit"
, defaulting to 5000.- test_server
(
character(1)
)
Whether to use the OpenML test server or public server. Defaults to value of option"mlr3oml.test_server"
, orFALSE
if not set.- ...
(any)
Additional (unsupported) filters, as named arguments.- run_id
(
integer()
)
Vector of run ids to restrict to.- task_id
(
integer()
)
Vector of task ids to restrict to.- measures
(
character()
)
Vector of evaluation measures to restrict to.- uploader
(
integer(1)
)
Filter for uploader.- flow_id
(
integer(1)
)
Filter for flow id.- setup_id
(
integer()
)
Vector of setup ids to restrict to.- type
(
character(1)
)
The task type, supported values are:"clasisf"
,"regr"
,"surv"
and"clust"
.
Details
Filter values are usually provided as single atomic values (typically integer or character).
Provide a numeric vector of length 2 (c(l, u)
) to find matches in the range \([l, u]\).
Note that only a subset of filters is exposed here.
For a more feature-complete package, see OpenML.
Alternatively, you can pass additional filters via ...
using the names of the official API,
c.f. the REST tab of https://www.openml.org/apis.
References
Casalicchio G, Bossek J, Lang M, Kirchhoff D, Kerschke P, Hofner B, Seibold H, Vanschoren J, Bischl B (2017). “OpenML: An R Package to Connect to the Machine Learning Platform OpenML.” Computational Statistics, 1--15. doi:10.1007/s00180-017-0742-2 .
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi:10.1145/2641190.2641198 .
Examples
try({
### query data sets
# search for titanic data set
data_sets = list_oml_data(data_name = "titanic")
print(data_sets)
# search for a reduced version
data_sets = list_oml_data(
data_name = "titanic",
number_instances = c(2200, 2300),
number_features = 4
)
print(data_sets)
### search tasks for this data set
tasks = list_oml_tasks(data_id = data_sets$data_id)
print(tasks)
# query runs, group by number of runs per task_id
runs = list_oml_runs(task_id = tasks$task_id)
runs[, .N, by = task_id]
}, silent = TRUE)
#> data_id name version status MajorityClassSize MaxNominalAttDistinctValues
#> 1: 40704 Titanic 2 active 1490 2
#> 2: 40945 Titanic 1 active 809 3
#> 3: 41265 Titanic 4 active NA NA
#> 4: 42436 Titanic 5 active NA NA
#> 5: 42437 titanic 6 active NA NA
#> 6: 42438 Titanic 7 active NA NA
#> 7: 42637 titanic 8 active NA NA
#> 8: 42638 titanic 9 active 549 NA
#> MinorityClassSize NumberOfClasses NumberOfFeatures NumberOfInstances
#> 1: 711 2 4 2201
#> 2: 500 2 14 1309
#> 3: NA 0 8 1307
#> 4: NA 0 8 891
#> 5: NA 0 8 891
#> 6: NA 0 8 891
#> 7: NA NA 8 891
#> 8: 342 2 8 891
#> NumberOfInstancesWithMissingValues NumberOfMissingValues
#> 1: 0 0
#> 2: 1309 3855
#> 3: 0 0
#> 4: 0 0
#> 5: 0 0
#> 6: 0 0
#> 7: 689 689
#> 8: 689 689
#> NumberOfNumericFeatures NumberOfSymbolicFeatures
#> 1: 3 1
#> 2: 6 3
#> 3: 8 0
#> 4: 8 0
#> 5: 8 0
#> 6: 8 0
#> 7: 3 5
#> 8: 3 5
#> data_id name version status MajorityClassSize MaxNominalAttDistinctValues
#> 1: 40704 Titanic 2 active 1490 2
#> MinorityClassSize NumberOfClasses NumberOfFeatures NumberOfInstances
#> 1: 711 2 4 2201
#> NumberOfInstancesWithMissingValues NumberOfMissingValues
#> 1: 0 0
#> NumberOfNumericFeatures NumberOfSymbolicFeatures
#> 1: 3 1
#> task_id task_type data_id name status
#> 1: 145769 Clustering 40704 Titanic active
#> 2: 146230 Supervised Classification 40704 Titanic active
#> 3: 146528 Learning Curve 40704 Titanic active
#> 4: 166588 Clustering 40704 Titanic active
#> 5: 167099 Supervised Classification 40704 Titanic active
#> 6: 167486 Learning Curve 40704 Titanic active
#> 7: 167844 Learning Curve 40704 Titanic active
#> 8: 168202 Learning Curve 40704 Titanic active
#> 9: 168615 Learning Curve 40704 Titanic active
#> 10: 188610 Clustering 40704 Titanic active
#> 11: 189623 Learning Curve 40704 Titanic active
#> 12: 190235 Supervised Data Stream Classification 40704 Titanic active
#> 13: 210117 Clustering 40704 Titanic active
#> 14: 211516 Learning Curve 40704 Titanic active
#> 15: 231791 Clustering 40704 Titanic active
#> 16: 252906 Clustering 40704 Titanic active
#> 17: 293680 Clustering 40704 Titanic active
#> 18: 293693 Clustering 40704 Titanic active
#> 19: 316179 Clustering 40704 Titanic active
#> 20: 337306 Clustering 40704 Titanic active
#> 21: 358461 Clustering 40704 Titanic active
#> 22: 360401 Learning Curve 40704 Titanic active
#> task_id task_type data_id name status
#> MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize
#> 1: 1490 2 711
#> 2: 1490 2 711
#> 3: 1490 2 711
#> 4: 1490 2 711
#> 5: 1490 2 711
#> 6: 1490 2 711
#> 7: 1490 2 711
#> 8: 1490 2 711
#> 9: 1490 2 711
#> 10: 1490 2 711
#> 11: 1490 2 711
#> 12: 1490 2 711
#> 13: 1490 2 711
#> 14: 1490 2 711
#> 15: 1490 2 711
#> 16: 1490 2 711
#> 17: 1490 2 711
#> 18: 1490 2 711
#> 19: 1490 2 711
#> 20: 1490 2 711
#> 21: 1490 2 711
#> 22: 1490 2 711
#> MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize
#> NumberOfClasses NumberOfFeatures NumberOfInstances
#> 1: 2 4 2201
#> 2: 2 4 2201
#> 3: 2 4 2201
#> 4: 2 4 2201
#> 5: 2 4 2201
#> 6: 2 4 2201
#> 7: 2 4 2201
#> 8: 2 4 2201
#> 9: 2 4 2201
#> 10: 2 4 2201
#> 11: 2 4 2201
#> 12: 2 4 2201
#> 13: 2 4 2201
#> 14: 2 4 2201
#> 15: 2 4 2201
#> 16: 2 4 2201
#> 17: 2 4 2201
#> 18: 2 4 2201
#> 19: 2 4 2201
#> 20: 2 4 2201
#> 21: 2 4 2201
#> 22: 2 4 2201
#> NumberOfClasses NumberOfFeatures NumberOfInstances
#> NumberOfInstancesWithMissingValues NumberOfMissingValues
#> 1: 0 0
#> 2: 0 0
#> 3: 0 0
#> 4: 0 0
#> 5: 0 0
#> 6: 0 0
#> 7: 0 0
#> 8: 0 0
#> 9: 0 0
#> 10: 0 0
#> 11: 0 0
#> 12: 0 0
#> 13: 0 0
#> 14: 0 0
#> 15: 0 0
#> 16: 0 0
#> 17: 0 0
#> 18: 0 0
#> 19: 0 0
#> 20: 0 0
#> 21: 0 0
#> 22: 0 0
#> NumberOfInstancesWithMissingValues NumberOfMissingValues
#> NumberOfNumericFeatures NumberOfSymbolicFeatures
#> 1: 3 1
#> 2: 3 1
#> 3: 3 1
#> 4: 3 1
#> 5: 3 1
#> 6: 3 1
#> 7: 3 1
#> 8: 3 1
#> 9: 3 1
#> 10: 3 1
#> 11: 3 1
#> 12: 3 1
#> 13: 3 1
#> 14: 3 1
#> 15: 3 1
#> 16: 3 1
#> 17: 3 1
#> 18: 3 1
#> 19: 3 1
#> 20: 3 1
#> 21: 3 1
#> 22: 3 1
#> NumberOfNumericFeatures NumberOfSymbolicFeatures
#> task_id N
#> 1: 146230 35