= TaskClassif$new("xxx") # Objects
task $new() # Methods
task$feature_names # Fields task
Contents
- Introduction
- Syntax
- Basic modeling
- Resampling
- Benchmarking
Introduction
Who am I?
- Graduate School of Public Health, SNU (2019.03 ~ 2021.02)
- Seoul Nastional University Bundang Hospital (2021.06 ~ )
- Data (NHIS, MIMIC-IV, Registry data, KNHNAES …)
- Comento mentor (R for healthcare) (2022.07 ~ )
ML framework in R
What is mlr3
?
mlr3
: Machine Learning in R 3
mlr3
& mlr3verse
Why mlr3
?
National Health Insurance System Data (NHIS-HEALS, NHIS-NSC)
dplyr
\(\rightarrow\)data.table
Python
:scikit-learn
=R
:??
mlr3
:data.table
based package
Syntax
mlr3
vs tidymodels
Core 1. R6
Object Oriented Programming (OOP)
- Objects:
foo = bar$new()
- Methods:
$new()
- Fields:
$baz
Core 2. data.table
>= 10] # filter rows
DT[i # select columns
DT[, .(X,Y)] mean(X), by=Y] # aggregate by group DT[,
Utils 1. Dictionary
# Getting a specific object with `$get(key)`
$get("regr.rpart") mlr_learners
<LearnerRegrRpart:regr.rpart>: Regression Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types: [response]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, selected_features, weights
# Searching objects with $keys()
$keys() |> head() mlr_measures
[1] "aic" "bic" "classif.acc" "classif.auc"
[5] "classif.bacc" "classif.bbrier"
# OR with `as.data.table()`
as.data.table(mlr_learners) |> head()
key | label | task_type | feature_types | packages | properties | predict_types |
---|---|---|---|---|---|---|
classif.cv_glmnet | NA | classif | logical, integer, numeric | mlr3 , mlr3learners, glmnet | multiclass , selected_features, twoclass , weights | response, prob |
classif.debug | Debug Learner for Classification | classif | logical , integer , numeric , character, factor , ordered | mlr3 | hotstart_forward, missings , multiclass , twoclass | response, prob |
classif.featureless | Featureless Classification Learner | classif | logical , integer , numeric , character, factor , ordered , POSIXct | mlr3 | featureless , importance , missings , multiclass , selected_features, twoclass | response, prob |
classif.glmnet | NA | classif | logical, integer, numeric | mlr3 , mlr3learners, glmnet | multiclass, twoclass , weights | response, prob |
classif.kknn | NA | classif | logical, integer, numeric, factor , ordered | mlr3 , mlr3learners, kknn | multiclass, twoclass | response, prob |
classif.lda | NA | classif | logical, integer, numeric, factor , ordered | mlr3 , mlr3learners, MASS | multiclass, twoclass , weights | response, prob |
Utils 2. Sugar functions
R6
class \(\rightarrow\)S3
type functions
Utils 3. mlr3viz
autoplot()
visualization
autoplot(pred)
autoplot(pred, type="roc")
Basic modeling
Ask ChatGPT!
1. Tasks
- Objects with data and metadata
- Default datasets
- Dictionary:
mlr_tasks
- Sugar function:
tsk()
# R6 methods
# mlr_tasks$get("titanic")
# Sugar function
= tsk("german_credit") task
Or External data as task
as_task_regr()
: regressionas_task_classif()
: classificationas_task_clust()
: clustering
= as_task_regr(mtcars,
task_mtcars target = "mpg")
task_mtcars
<TaskRegr:mtcars> (32 x 11)
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
Fields of tasks
::: {.fragment}
- Feature names
$feature_names task
[1] "age" "amount"
[3] "credit_history" "duration"
[5] "employment_duration" "foreign_worker"
[7] "housing" "installment_rate"
[9] "job" "number_credits"
[11] "other_debtors" "other_installment_plans"
[13] "people_liable" "personal_status_sex"
[15] "present_residence" "property"
[17] "purpose" "savings"
[19] "status" "telephone"
:::
- Target names
$target_names task
[1] "credit_risk"
- Target classes
$class_names task
[1] "good" "bad"
2. Learners
- ML algorithms
- Dictionary:
mlr_learners
- Sugar function:
lrn()
- regression (
regr.~
), classification(classif.~
), and clustering (clust.~
) library(mlr3learners)
Extra learners
- only through
github
not CRAN - e.g.,
lightGBM
# remotes::install_github("mlr-org/mlr3extralearners@*release")
library(mlr3extralearners)
$train()
,$predict()
confusion matrix
$confusion prediction
truth
response good bad
good 184 45
bad 26 45
Or with mlr3viz
autoplot(prediction)
Hyperparameter
# with learner
learner = lrn("classif.rpart", maxdepth = 1)
# Or
learner$param_set$set_values(xval = 2, maxdepth=3, cp=.5)
learner$param_set$values
$xval
[1] 2
$maxdepth
[1] 3
$cp
[1] 0.5
Setting hyperparameters
$param_set
of learners- setting class, lower, upper
as.data.table(learner$param_set) |> head()
id | class | lower | upper | levels | nlevels | is_bounded | special_vals | default | storage_type | tags |
---|---|---|---|---|---|---|---|---|---|---|
cp | ParamDbl | 0 | 1 | NULL | Inf | TRUE | NULL | 0.01 | numeric | train |
keep_model | ParamLgl | NA | NA | TRUE, FALSE | 2 | TRUE | NULL | FALSE | logical | train |
maxcompete | ParamInt | 0 | Inf | NULL | Inf | FALSE | NULL | 4 | integer | train |
maxdepth | ParamInt | 1 | 30 | NULL | 30 | TRUE | NULL | 30 | integer | train |
maxsurrogate | ParamInt | 0 | Inf | NULL | Inf | FALSE | NULL | 5 | integer | train |
minbucket | ParamInt | 1 | Inf | NULL | Inf | FALSE | NULL | <environment: 0x10e5fea30> | integer | train |
3. Measures
- Evaluation of performances
- Dictionary:
mlr_measures
- Sugar function:
msr()
,msrs()
classif.~
,regr.~
$score()
as.data.table(mlr_measures) |> head()
key | label | task_type | packages | predict_type | task_properties |
---|---|---|---|---|---|
aic | Akaike Information Criterion | NA | mlr3 | NA | |
bic | Bayesian Information Criterion | NA | mlr3 | NA | |
classif.acc | Classification Accuracy | classif | mlr3 , mlr3measures | response | |
classif.auc | Area Under the ROC Curve | classif | mlr3 , mlr3measures | prob | twoclass |
classif.bacc | Balanced Accuracy | classif | mlr3 , mlr3measures | response | |
classif.bbrier | Binary Brier Score | classif | mlr3 , mlr3measures | prob | twoclass |
msr()
: a single performance
= msr("classif.acc")
measure $score(measure) prediction
classif.acc
0.7633333
msrs()
: multiple performances
# Multiple measurements
= msrs(c("classif.acc","classif.ppv","classif.npv","classif.auc"))
measures $score(measures) prediction
classif.acc classif.ppv classif.npv classif.auc
0.7633333 0.8034934 0.6338028 0.7558730
Resampling
Concept of Resampling
- Split available data into multiple training and test sets
- Reliable performance
- Prevent Overfitting
tidymodels
vs mlr3
- Dictionary:
mlr_resamplings
- Sugar function:
rsmp()
as.data.table(mlr_resamplings)
key | label | params | iters |
---|---|---|---|
bootstrap | Bootstrap | ratio , repeats | 30 |
custom | Custom Splits | NA | |
custom_cv | Custom Split Cross-Validation | NA | |
cv | Cross-Validation | folds | 10 |
holdout | Holdout | ratio | 1 |
insample | Insample Resampling | 1 | |
loo | Leave-One-Out | NA | |
repeated_cv | Repeated Cross-Validation | folds , repeats | 100 |
subsampling | Subsampling | ratio , repeats | 30 |
resample()
: initiate resampling$aggregate()
: aggregate resampling performance
task = tsk("german_credit")
learner = lrn("classif.ranger", predict_type="prob")
resample = rsmp("cv", folds=10)
rr = resample(task, learner, resample, store_model=T)
measures = msrs(c("classif.acc","classif.ppv","classif.npv","classif.auc"))
rr$aggregate(measures)
classif.acc classif.ppv classif.npv classif.auc
0.7710000 0.7890524 0.6956910 0.7979774
Resampling result
autoplot(rr, type="boxplot", measure = msr("classif.acc"))
autoplot(rr, type="histogram", measure = msr("classif.acc"))
Benchmarking
tidymodels
vs mlr3
Benchmarking
- Comparison of multiple learners on a single task (or multiple tasks).
benchmark_grid()
: design a benchmarking
tasks = tsks(c("german_credit", "sonar", "breast_cancer"))
learners = list(
lrn("classif.log_reg", predict_type="prob", id="LR"),
lrn("classif.rpart", predict_type="prob", id="DT"),
lrn("classif.ranger", predict_type="prob", id="RF")
)
rsmp = rsmp("cv", folds=5)
design = benchmark_grid(
tasks = tasks,
learners = learners,
resamplings = rsmp)
benchmark()
: execute benchmarking
= benchmark(design)
bmr = msrs(c("classif.acc","classif.ppv", "classif.npv", "classif.auc"))
measures as.data.table(bmr$aggregate(measures))[,-c("nr","resample_result","resampling_id","iters")] |> DT()
task_id | learner_id | classif.acc | classif.ppv | classif.npv | classif.auc |
---|---|---|---|---|---|
german_credit | LR | 0.7540000 | 0.7959935 | 0.6128794 | 0.7682786 |
german_credit | DT | 0.7220000 | 0.7720000 | 0.5715187 | 0.7009023 |
german_credit | RF | 0.7670000 | 0.7866093 | 0.6820459 | 0.7916496 |
sonar | LR | 0.7027875 | 0.7229497 | 0.6805154 | 0.7122449 |
sonar | DT | 0.7262485 | 0.7250771 | 0.7382659 | 0.7524838 |
sonar | RF | 0.8174216 | 0.8101012 | 0.8425397 | 0.9232502 |
breast_cancer | LR | 0.9252791 | 0.9361270 | 0.9195608 | 0.9418515 |
breast_cancer | DT | 0.9502362 | 0.9167371 | 0.9675106 | 0.9543396 |
breast_cancer | RF | 0.9751181 | 0.9549859 | 0.9860113 | 0.9938067 |
Result
task = tsk("german_credit")
learners = list(
lrn("classif.log_reg", predict_type="prob"),
lrn("classif.rpart", predict_type="prob"),
lrn("classif.ranger", predict_type="prob")
)
cv10 = rsmp("cv", folds=10)
design = benchmark_grid(
task = task,
learners = learners,
resamplings = cv10)
bmr = benchmark(design)
autoplot(bmr, measure = msr("classif.auc"))
ROC & PRC
autoplot(bmr, type = "roc")
autoplot(bmr, type = "prc")
More about mlr3
- Hyperparameter optimization
- Feature selection
- ML pipelines
Summary
mlr3
R6
,data.table
based ML framework- Sugar function + Dictionary
- Task, Learner, Measure
- Resampling
- Benchmarking
- Still in development (ver 0.16.0)
- A great textbook: mlr3book