Chapter 6 Model Predictions
- To be consistent with snake_case,
new_data
should be used instead ofnewdata
.
The function to produce predictions should be a class-specific
predict
method with argumentsobject
,new_data
, and possiblytype
. Other arguments, such aslevel
, should be standardized. {note}The main predict method can internally defer to separate, unexported functions (
predict_class
, etc).type
should also come from a set of pre-defined values such as
type | application |
---|---|
numeric |
numeric predictions |
class |
hard class predictions |
prob |
class probabilities, survivor probabilities |
link |
glm linear predictor |
conf_int |
confidence intervals |
pred_int |
prediction intervals |
raw |
direct access to prediction function |
param_pred |
predictions across tuning parameters |
quantile |
quantile predictions |
and should be validated using match.arg()
.
- To determine whether or not to return standard errors for predictions, use a
std_error
argument that takes onTRUE/FALSE
value. By default, do not report standard error or other measures of uncertainty, as these can be expensive to compute. Clearly document whether any standard errors are for confidence or prediction intervals.
Other values should be assigned with consensus.
6.1 Input Data
If
new_data
is not supplied, an error should be thrown. It should not default to an archived version of the training set contained in the model object.The data requirements for
new_data
should be the same as those for the orginal model fit function.The model outcome should never be required to be in
new_data
.new_data
should be tolerant of extra columns. For example, if all variables are in some data framedataset
,predict(object, dataset)
should immediately know which variables are required for prediction, check for their presence, and select only those fromdataset
before proceeding.The prediction code should work whether
new_data
has multiple rows or a single row.
Predictions should not depend on which observations are present in
new_data
. {note}.When novel factor levels appear in the test set for factor predictors, the default behavior should be to throw an informative error. For models where this is a reasonable way to make predictions on novel factor levels, users need to explicitly specify that they want this behavior, and it’s good practice to
message()
for these prediction cases.
6.2 Return Values
- By default,
new_data
should not be returned by the prediction function.
The return value is a tibble with the same number of rows as the data being predicted and in the same order. This implies that
na.action
should not affect the dimensions of the outcome object (i.e., it should be ignored). {note} The class of the tibble can be overloaded to accommodate specialized methods as long as basic data frame functionality is maintained. {note}. For observations with missing data such that a prediction cannot be generated, we recommend returningNA
.The return tibble can contain extra attributes for values relevant to the prediction (e.g.
level
for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g.arrange
,filter
, etc.). Columns of constant values (e.g. addinglevel
as a column) should be avoided.
Specific cases:
For univariate, numeric point estimates, the column should be named
.pred
. For multivariate numeric predictions (excluding probabilities), the columns should be named.pred_{outcome name}
.Class predictions should be factors with the same levels as the original outcome and named
.pred_class
.For class probability predictions, the columns should be named the same as the factor levels, e.g.,
.pred_{level}
, and there should be as many columns as factor levels.If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be
.pred_lower
and.pred_upper
. If a standard error is produced, the column should be named.std_error
. If intervals are produced for class probabilities, the levels should be included (e.g.,.pred_lower_{level}
),
For predictions that are not simple scalars, such as distributions or non-rectangular structures, the
.pred
column should be a list-column {note}In cases where the outcome is being directly predicted, the predictions should be on the same scale as the outcome. The same would apply to associated interval estimates. This is equivalent to
type = "response"
for generalized linear models and the like. Reasonable exceptions include estimation of the standard error of prediction (perhaps occurring on the link-level/scale of the linear predictors).