Chapter 9 Notes

writing output

To write messages, cat() or message() can be used. Their differences:

cat() goes to standard out and message() goes to standard error. When used inside of some parallel processing backends, cat() output is obscured inside of the RStudio IDE.
Unless coded otherwise, cat() uses the usual formatting whereas the default formatting for message() is different.

For this code:

library(emojifont)
message("Help! I'm trapped in the well!!")
cat("No, you're not", emoji('roll_eyes'), "\n")

the Rstudio IDE output is:

and basic terminal output is:

This post may also be helpful in deciding.

↩️

decoupling functions

For example, for some modeling method, the function foo would be the top-level api that users experience and some other function (say compute_foo_fit) is used to do the computations. This allows for different interfaces to be used to specify the model that pass common data structures to compute_foo_fit.

↩️

appropriate data structures

For example:

Categorical data should be in factor variables (as opposed to binary or integer representations).
Rectangular data structures that allows for mixed data types should always be used even when the envisioned application would only make sense when the data are single type.
For strictly univariate response models, vector input is acceptable for the outcome argument.
Censored data should follow the survival::Surv convention.

↩️

top-level design examples

For example:

Suppose a model can only fit numeric or two-class outcomes and uses maximum likelihood. Instead of providing the user with a distribution option that is either “Gaussian” or “Binomial”, determine this from the type of the data object (numeric or factor) and set internally. This prevents the user from making a mistake that could haven been avoided.
If a model parameter is bounded by some aspect of the data, such as the number of rows or columns, coerce bad values to this range (e.g. mtry = min(mtry, ncol(x))) with an accompanying warning when this is critical information.

↩️

avoid common parameter names

For example, control is the type of argument used in many functions so have a function specific argument (e.g. foo_control) is advisable in these situations.

↩️

Checking of ...

The conflicted package can be used to solve this issue.

↩️

standardize names

For example, many functions use some variation of level for confidence levels (as opposed to alpha). The names in Chapter 7 are preferable for the specific context.

↩️

discussing new names

A good venue for this discussion is RStudio Community

↩️

avoid computing on call objects

Historically, the call object that results from a model fit was considered a good method to capture details of the existing model fit. This object could be parsed and manipulated for the purpose of continuing or revising the model between function calls.

This can be problematic because the call object is relative to the environment and the call itself is agnostic to its original environment. If the call is changed and re-evaluated, the current environment may be inappropriate. This could result in errors due to the required objects not being exposed in the current environment.

Note that the internals of common modeling functions, such as lm, do exactly this. However, these manipulations occur within the function call so that the environment is the same.

↩️

minimal object retention

For example, if summary statistics are computed on the model object, derive them on-the-fly during the execution of the summary method. Avoid saving these statistics to the model object unless they are related to ephemeral values generated during the model fit but do not need to be retained afterwards.

↩️

consistent number of rows

This should enable the ability to bind the results with the original data:

cbind(data, predict(model, data)) # or
predict(model, data) %>% bind_cols(data)

↩️

extra tibble classes

Adding a new class to a tibble object might cause subsequent errors to dplyr operations. This code is a good example for how to maintain dplyr compatibility.

↩️

list column output

Some examples:

posterior distirbutions

If a posterior distribution is returned for each sample, each element of the list column can be a tibble with as many rows as samples in the distribution.

multiple hyperparameters

When a predict method produces values over multiple tuning parameter values (e.g. glmnet), the list column elements have rows for every tuning parameter value combination that is being changed (e.g. lambda in glmnet).

survivor probability predictions

In time to event models, where survivor probabilities are produced for values of time, the return tibble has a column for .time. Other column names should conform to the standards here (e.g. .pred_lower if intervals are also returned). As an example from ?flexsurv::flexsurvreg:

library(flexsurv)
data(ovarian)
fitg <- flexsurvreg(formula = Surv(futime, fustat) ~ age, data = ovarian, dist = "gengamma")

For each new sample, this model can make probabilistic predictions at a number of user-specified time points.

preds <- summary(fitg, newdata = ovarian[1:2, "age", drop = FALSE], t = c(100, 200, 300))
preds

## age=72.3315 
##   time       est        lcl       ucl
## 1  100 0.7988680 0.51091386 0.9648924
## 2  200 0.4849180 0.15909221 0.7913010
## 3  300 0.2784226 0.02375807 0.6402882
## 
## age=74.4932 
##   time       est          lcl       ucl
## 1  100 0.7278338 0.3557049000 0.9594533
## 2  200 0.3856737 0.0408961893 0.7646142
## 3  300 0.1964611 0.0003635201 0.6010698

A list of data frames is not very helpful. Better would be a data frame with a list column, where every element is a data frame with the predictions in the tidy format:

tidy_preds

## # A tibble: 2 x 1
##   .pred           
##   <named list>    
## 1 <tibble [3 × 4]>
## 2 <tibble [3 × 4]>

# predictions on the first subject at 3 time points
tidy_preds$.pred[[1]]

## # A tibble: 3 x 4
##   .time .pred .pred_lower .pred_upper
##   <dbl> <dbl>       <dbl>       <dbl>
## 1   100 0.799      0.511        0.965
## 2   200 0.485      0.159        0.791
## 3   300 0.278      0.0238       0.640

If a single data frame is needed, it is easy to make the conversion:

tidyr::unnest(tidy_preds)

## # A tibble: 6 x 4
##   .time .pred .pred_lower .pred_upper
##   <dbl> <dbl>       <dbl>       <dbl>
## 1   100 0.799    0.511          0.965
## 2   200 0.485    0.159          0.791
## 3   300 0.278    0.0238         0.640
## 4   100 0.728    0.356          0.959
## 5   200 0.386    0.0409         0.765
## 6   300 0.196    0.000364       0.601

quantile predictions

When using a quantile regression, one might make the median the default that is predicted. If multiple percentiles are requested, then .pred would be a tibble with a column for the predictions and another for the quantile (named .quantile).

↩️

where to parallelize?

For example, if a random forest model is being created the parallelism could simultaneously fit the trees in the ensemble or parallelize the search for optimal splits within each tree. All other things being equal, the former is preferred.

↩️

controlling random numbers in parallel workers

One simple approach when parallelism is conducted using R subprocesses (e.g. via foreach or futures) is to pre-compute seeds and have a seed argument to the function being invoked. The function can immediately set the seed from the argument when it begins execution.

↩️

predictions shouldn’t depend on new_data

Sometimes spline expansions or similar terms can behave oddly when packages haven’t been fully loaded. For example:

fit <- lm(mpg ~ disp + hp + splines::ns(drat, 2), mtcars)

hp <- head(predict(fit, mtcars))
ph <- predict(fit, head(mtcars))

all.equal(hp, ph)

## [1] TRUE

Loading the splines package fixes this:

library(splines)
fit <- lm(mpg ~ disp + hp + ns(drat, 2), mtcars)

hp <- head(predict(fit, mtcars))
ph <- predict(fit, head(mtcars))

all.equal(hp, ph)

## [1] TRUE

A general way to check for errors like this is to check that predict() and head() commute.

↩️

models should have meaningful classes

This makes it easier to provide methods that might be meaningfully defined for only some models accessible via the convenience interface. For example, consider generalized linear models. It makes sense to define hard class predictions for logistic regression, but not linear regression. It is easiest to define prediction methods specific to linear and logistic regression if each is represented by an R object with different classes.

↩️