mlr3oml
This tutorial will give you a quick overview of the main features of
mlr3oml. If you are not familiar with OpenML, we recommend to read its documentation first, as we will
not explain the OpenML concepts in detail here. Further coverage of some
selected mlr3oml features can be found in the Large-Scale
Benchmarking chapter of the mlr3book. Note that
mlr3oml currently only supports downloading objects from
OpenML. Uploading can for example be achieved through the website.
First, we will briefly cover the different OpenML objects that can be
downloaded using mlr3oml. Then we will show how to find
objects with certain properties on OpenML. Finally, we will quickly
discuss some further aspects of mlr3oml, which includes
caching, file formats, laziness, the logger, and the API key.
OpenML Objects
mlr3oml supports five different types of OpenML objects
that are listed below. All objects can be converted to their
corresponding mlr3 pandeaunt.
- 
OMLDatarepresents an OpenML dataset. These are (usually tabular) sets with additional meta-data, which includes e.g. a description of the dataset or a license. The most similarmlr3class is themlr3::DataBackend.
- 
OMLTaskrepresents an OpenML task. This is a concrete problem speficiation on top of an OpenML dataset. While being similar tomlr3::Taskobjects, a major difference is that the OpenML task also contains the resampling splits and can therefore also be converted to anmlr3::Resampling.
- 
OMLFlowrepresents an OpenML flow. This is a reusable and executable representation of a machine learning pipeline or workflow. The closestmlr3class is theLearner.
- 
OMLRunrepresents an OpenML run. An OpenML run refers to the execution of a specific machine learning flow on a particular task, recording all relevant information such as hyperparameters, performance metrics, and intermediate results. This is similar to anmlr3::ResampleResultobject.
- 
OMLCollectionrepresents an OpenML collection, which can either be a run collection or a task collection. These are container objects that allow to bundle tasks (resulting in benchmarking suites) or runs (which can be used to represent benchmark experiments). There is nomlr3pendant for the former (other than a list of tasks), while the latter would correspond to anmlr3::BenchmarkResult.
Each object on OpenML has a unique identifier, by which it can be retrieved. We will now briefly show how to access and work with these objects.
Data
Below, we retrieve the dataset with ID 31, which is the credit-g data
and can be viewed online here.
Like in other mlr3 packages, sugar functions exist for the
construction of R6 classes. We always show both ways to
construct the objects.
library(mlr3oml)
library(mlr3)
oml_data = OMLData$new(id = 31)
# is the same as
oml_data = odt(id = 31)
oml_data## <OMLData:31:credit-g> (1000x21)
##  * Default target: classThe full meta data can be accessed using the $desc
field. Some fields, such as the number of rows and columns can be
accessed directly.
# the usage licence
oml_data$desc$licence## [1] "Public"
# the data dimension
c(n_rows = oml_data$nrow, n_cols = oml_data$ncol)## n_rows n_cols 
##   1000     21Information about the features can be accessed through the
$features field. This includes information regarding the
data types, missing values, whether they should be ignored for learning
or whether they are the row identifier.
head(oml_data$features)## Key: <index>
##    index            name data_type
##    <int>          <char>    <fctr>
## 1:     0 checking_status   nominal
## 2:     1        duration   numeric
## 3:     2  credit_history   nominal
## 4:     3         purpose   nominal
## 5:     4   credit_amount   numeric
## 6:     5  savings_status   nominal
##                                                                                   nominal_value
##                                                                                          <list>
## 1:                                                                0<=X<200,<0,>=200,no checking
## 2:                                                                                       [NULL]
## 3: all paid,critical/other existing credit,delayed previously,existing paid,no credits/all paid
## 4:                  business,domestic appliance,education,furniture/equipment,new car,other,...
## 5:                                                                                       [NULL]
## 6:                                          100<=X<500,500<=X<1000,<100,>=1000,no known savings
##    is_target is_ignore is_row_identifier number_of_missing_values
##       <lgcl>    <lgcl>            <lgcl>                    <int>
## 1:     FALSE     FALSE             FALSE                        0
## 2:     FALSE     FALSE             FALSE                        0
## 3:     FALSE     FALSE             FALSE                        0
## 4:     FALSE     FALSE             FALSE                        0
## 5:     FALSE     FALSE             FALSE                        0
## 6:     FALSE     FALSE             FALSE                        0The data itself can be accessed using the $data field.
We only show a subset of the data here for readability.
oml_data$data[1:5, 1:3]##    checking_status duration                 credit_history
##             <fctr>    <int>                         <fctr>
## 1:              <0        6 critical/other existing credit
## 2:        0<=X<200       48                  existing paid
## 3:     no checking       12 critical/other existing credit
## 4:              <0       42                  existing paid
## 5:              <0       24             delayed previouslyWe can convert this object to an mlr3::DataBackend using
the as_data_backend() function.
backend = as_data_backend(oml_data)
backend## 
## ── <DataBackendDataTable> (1000x22) ────────────────────────────────────────────
##  checking_status duration                 credit_history             purpose
##           <fctr>    <int>                         <fctr>              <fctr>
##               <0        6 critical/other existing credit            radio/tv
##         0<=X<200       48                  existing paid            radio/tv
##      no checking       12 critical/other existing credit           education
##               <0       42                  existing paid furniture/equipment
##               <0       24             delayed previously             new car
##      no checking       36                  existing paid           education
##  credit_amount   savings_status employment installment_commitment
##          <int>           <fctr>     <fctr>                  <int>
##           1169 no known savings        >=7                      4
##           5951             <100     1<=X<4                      2
##           2096             <100     4<=X<7                      2
##           7882             <100     4<=X<7                      2
##           4870             <100     1<=X<4                      3
##           9055 no known savings     1<=X<4                      2
##     personal_status other_parties residence_since property_magnitude   age
##              <fctr>        <fctr>           <int>             <fctr> <int>
##         male single          none               4        real estate    67
##  female div/dep/mar          none               2        real estate    22
##         male single          none               3        real estate    49
##         male single     guarantor               4     life insurance    45
##         male single          none               4  no known property    53
##         male single          none               4  no known property    35
##  other_payment_plans  housing existing_credits                job
##               <fctr>   <fctr>            <int>             <fctr>
##                 none      own                2            skilled
##                 none      own                1            skilled
##                 none      own                1 unskilled resident
##                 none for free                1            skilled
##                 none for free                2            skilled
##                 none for free                1 unskilled resident
##  num_dependents own_telephone foreign_worker  class ..row_id
##           <int>        <fctr>         <fctr> <fctr>    <int>
##               1           yes            yes   good        1
##               1          none            yes    bad        2
##               2          none            yes   good        3
##               2          none            yes   good        4
##               2          none            yes    bad        5
##               2           yes            yes   good        6
## [...] (994 rows omitted)Because this specific dataset has a default target in its meta data,
we can also directly convert it to an mlr3::Task.
# the default target
oml_data$target_names## [1] "class"
# convert the OpenML data to an mlr3 task
task = as_task(oml_data)With either the backend or the task, we are
now in mlr3 land again, and can work with the objects as
usual:
Task
Below, we access the OpenML task with ID 261, which is a classification task built on top of the credit-g data used above. Its associated resampling is a 2/3 holdout split.
## <OMLTask:261>
##  * Type: Supervised Classification
##  * Data: credit-g (id: 31; dim: 1000x21)
##  * Target: class
##  * Estimation: holdout (id: 6; test size: 33%)The OpenML data that the task is built on top of can be accessed
through $data.
oml_task$data## <OMLData:31:credit-g> (1000x21)
##  * Default target: classWe can also access the target columns and the features. Note that this target can differ from the default target shown in the previous section.
oml_task$target_names## [1] "class"
oml_task$feature_names##  [1] "checking_status"        "duration"               "credit_history"        
##  [4] "purpose"                "credit_amount"          "savings_status"        
##  [7] "employment"             "installment_commitment" "personal_status"       
## [10] "other_parties"          "residence_since"        "property_magnitude"    
## [13] "age"                    "other_payment_plans"    "housing"               
## [16] "existing_credits"       "job"                    "num_dependents"        
## [19] "own_telephone"          "foreign_worker"The associated resampling splits can be accessed using
$task_splits.
oml_task$task_splits##         type rowid repeat.  fold
##       <fctr> <int>   <int> <int>
##    1:   TEST   490       0     0
##    2:   TEST   539       0     0
##    3:   TEST   694       0     0
##    4:   TEST   646       0     0
##    5:   TEST   150       0     0
##   ---                           
##  996:  TRAIN   875       0     0
##  997:  TRAIN   549       0     0
##  998:  TRAIN   195       0     0
##  999:  TRAIN   241       0     0
## 1000:  TRAIN   360       0     0The conversion to an mlr3::Task is possible using the
as_task() converter.
# Convert OpenML task to mlr3 task
task = as_task(oml_task)
task## 
## ── <TaskClassif> (1000x21) ─────────────────────────────────────────────────────
## • Target: class
## • Target classes: good (positive class, 70%), bad (30%)
## • Properties: twoclass
## • Features (20):
##   • fct (13): checking_status, credit_history, employment, foreign_worker,
##   housing, job, other_parties, other_payment_plans, own_telephone,
##   personal_status, property_magnitude, purpose, savings_status
##   • int (7): age, credit_amount, duration, existing_credits,
##   installment_commitment, num_dependents, residence_sinceThe associated resampling can be obtained by calling
as_resampling().
# Convert OpenML task to mlr3 resampling
resampling = as_resampling(oml_task)
resampling## 
## ── <ResamplingCustom> : Custom Splits ──────────────────────────────────────────
## • Iterations: 1
## • Instantiated: TRUE
## • Parameters: list()To simplify this, there exist "oml" tasks and
resamplings:
tsk("oml", task_id = 261)## 
## ── <TaskClassif> (1000x21) ─────────────────────────────────────────────────────
## • Target: class
## • Target classes: good (positive class, 70%), bad (30%)
## • Properties: twoclass
## • Features (20):
##   • fct (13): checking_status, credit_history, employment, foreign_worker,
##   housing, job, other_parties, other_payment_plans, own_telephone,
##   personal_status, property_magnitude, purpose, savings_status
##   • int (7): age, credit_amount, duration, existing_credits,
##   installment_commitment, num_dependents, residence_since
rsmp("oml", task_id = 261)## 
## ── <ResamplingCustom> : Custom Splits ──────────────────────────────────────────
## • Iterations: 1
## • Instantiated: TRUE
## • Parameters: list()Flows and Runs
We can access the flow with ID 1068 as shown below:
## <OMLFlow:1068>
##  * Name: weka.J48
##  * Dependencies: Weka_3.7.12Flows themself only become interesting once they are applied to a task, the result of which is an OpenML run.
For example, the run with ID 169061 contains the result of applying the above flow to task 261 from above:
## <OMLRun:169061>
##  * Task: credit-g (id: 261)
##  * Flow: weka.J48 (id: 1068)
##  * Estimation: holdout (id: 6; test size: 33%)
# the corresponding flow and and task can be accessed directly
run$flow## <OMLFlow:1068>
##  * Name: weka.J48
##  * Dependencies: Weka_3.7.12
run$task## <OMLTask:261>
##  * Type: Supervised Classification
##  * Data: credit-g (id: 31; dim: 1000x21)
##  * Target: class
##  * Estimation: holdout (id: 6; test size: 33%)The result of this experiment are the predictions, as well as the evaluation of these predictions.
head(run$prediction)##    repeat.  fold row_id confidence.good confidence.bad prediction correct
##      <int> <int>  <int>           <num>          <num>     <fctr>  <fctr>
## 1:       0     0    490        0.920000       0.080000       good    good
## 2:       0     0    539        0.885714       0.114286       good    good
## 3:       0     0    694        0.920000       0.080000       good    good
## 4:       0     0    646        0.727273       0.272727       good    good
## 5:       0     0    150        0.727273       0.272727       good    good
## 6:       0     0    970        0.166667       0.833333        bad    good
head(run$evaluation)##                             name     value        array_data repeat  fold
##                           <char>     <num>            <list> <char> <int>
## 1:          area_under_roc_curve  0.657856 0.657856,0.657856   <NA>    NA
## 2:                  average_cost  0.000000                NA   <NA>    NA
## 3:                     f_measure  0.705442 0.795745,0.494737   <NA>    NA
## 4:                         kappa  0.290990                NA   <NA>    NA
## 5: kb_relative_information_score 53.759911                NA   <NA>    NA
## 6:           mean_absolute_error  0.342722                NA   <NA>    NAOpenML runs can be converted to mlr3::ResampleResults
using the as_resample_result() function.
rr = as_resample_result(run)
rr## 
## ── <ResampleResult> with 2 resampling iterations ───────────────────────────────
##   task_id learner_id resampling_id iteration     prediction_test warnings
##  credit-g   oml.1068        custom         1 <PredictionClassif>        0
##  credit-g   oml.1068        custom         1 <PredictionClassif>        0
##  errors
##       0
##       0Collection
Below, we access the OpenML-CC18, which is a curated collection of 72 OpenML classification tasks, i.e. a task collection.
cc18 = OMLCollection$new(id = 99)
# is the same as
cc18 = ocl(id = 99)The ids of the tasks and datasets contained in this benchmarking
suite can be accessed through the fields $task_ids and
$data_ids respectively.
# the first 10 task ids
cc18$task_ids[1:10]##  [1]  3  6 11 12 14 15 16 18 22 23
# the first 10 data ids
cc18$data_ids[1:10]##  [1]  3  6 11 12 14 15 16 18 22 23We can, e.g., create an mlr3::Task from the first of
these tasks as follows:
task1 = tsk("oml", task_id = cc18$task_ids[1])
task1## 
## ── <TaskClassif> (3196x37) ─────────────────────────────────────────────────────
## • Target: class
## • Target classes: won (positive class, 52%), nowin (48%)
## • Properties: twoclass
## • Features (36):
##   • fct (36): bkblk, bknwy, bkon8, bkona, bkspr, bkxbq, bkxcr, bkxwp, blxwp,
##   bxqsq, cntxt, dsopp, dwipd, hdchk, katri, mulch, qxmsq, r2ar8, reskd, reskr,
##   rimmx, rkxwp, rxmsq, simpl, skach, skewr, skrxp, spcop, stlmt, thrsk, wkcti,
##   wkna8, wknck, wkovl, wkpos, wtoegListing
While we showed how to work with objects with known IDs, another
important question is how to find the relevant IDs. This can either be
achieved through the OpenML website
or through the REST API. To access the latter, mlr3oml
provides the following listing
functions:
- 
list_oml_data()- Find datasets
- 
list_oml_tasks()- Find tasks
- 
list_oml_flows()- Find flows
- 
list_oml_runs()- Find runs
As an example, we will only show the usage of the first function, but all others work analogously.
We can, for example, subset the datasets contained in the CC-18 even further. Below, we only select datasets that have between 0 and 10 features.
cc18_filtered = list_oml_tasks(
  data_id = cc18$data_ids,
  number_features = c(0, 10)
)
cc18_filtered[1:5, c("task_id", "name")]##    task_id                name
##      <int>              <char>
## 1:      11       balance-scale
## 2:      15            breast-w
## 3:      18 mfeat-morphological
## 4:      23                 cmc
## 5:      37            diabetesNote that not all possible property specifications can be directly
queried on OpenML. As the resulting tables are data.tables
containing information about the datasets, they can be further filtered
using the usual data.table syntax.
Uploading
You can currently upload datasets to OpenML or create tasks and collections using the functions:
- 
publish_data()to upload a dataset,
- 
publish_task()to create a task, and
- 
publish_collection()to create a collection.
For this, you need an API key.
API Key
All download operations supported by this package work without an API
key, but you might get rate limited without an API key. For uploading to
OpenML, you need an API key. The API key can be specified via the option
mlr3oml.api_key or the environment variable
OPENMLAPIKEY (where the former has precedence over the
latter). To obtain an API key, you must create an account on
OpenML.
Further Aspects
Logging
mlr3oml has its own logger, which can be accessed using
lgr::get_logger("mlr3oml"). For more information about
logging in general (such as chaning the logging threshold), we refer to
the corresponding section
in the mlr3book.
Laziness
All objects accessed through mlr3oml must be downloaded
from the OpenML server. This is done lazily, which means that data is
only downloaded when it is actually accessed. To show this, we change
the logging level, which was previously set to "warn" (to
keep the output clean), to "info".
logger = lgr::get_logger("mlr3oml")
logger$set_threshold("info")
oml_data = odt(31)
# to print the object, some meta data must be downloaded
oml_data## <OMLData:31:credit-g> (1000x21)
##  * Default target: classTo download all information associated with an object, the
$download() method can be called. This can be useful to
ensure that all information is available offline. In this case, only the
actual underlying data is downloaded, as everything else was already
implicityly accessed above.
oml_data$download()Caching
Caching of OpenML objects can be enabled by setting the
mlr3oml.cache option to either TRUE or
FALSE (default), or to a specific folder to be used as the
cache directory. When this is enabled, many OpenML objects are also
available offline. Note that OpenML collections are not cached, as IDs
can be added or removed.
# Set a temporary directy as the cache folder
cache_dir = tempfile()
options(mlr3oml.cache = cache_dir)
odata = odt(31)
odata## <OMLData:31:credit-g> (1000x21)
##  * Default target: class
# When accessing the data again, nothing has to be downloaded
# because the information is loaded from the cache
odata_again = odt(31)
odata_again## <OMLData:31:credit-g> (1000x21)
##  * Default target: class
# set back the logger
logger$set_threshold("warn")Data Types
The datasets on OpenML are available in two different formats, namely
arff
and parquet. The
former is used by default, but this default can be changed by setting
the mlr3oml.parquet option to TRUE. It is also
possible to specify this during construction of a specific OpenML
object.
While the parquet format is more efficient, arff was the original format and might therefore considered to be more stable. Moreover, minor differences for the two different formats for a given data ID can occur, e.g. regarding the data type.
When converting an OMLData object to an
mlr3::DataBackend using the parquet file type, the
resulting backend is an mlr3db::DataBackendDuckDB
object. For the arff file format, the resulting backend is a mlr3::DataBackendDataTable.
library(mlr3db)
odata_pq = odt(id = 31, parquet = TRUE)
backend_pq = as_data_backend(odata_pq)
class(backend_pq)## [1] "DataBackendDuckDB" "DataBackend"       "R6"
# compare with arff
odata_arff = odt(id = 31, parquet = FALSE)
backend_arff = as_data_backend(odata_arff)
class(backend_arff)## [1] "DataBackendDataTable" "DataBackend"          "R6"For more information on data backends, see the corresponding section
in the mlr3book.