Chapter 3 Function Interfaces

We distinguish between “top-level”/“user-facing” api’s and “low-level”/“computational” api’s. The former being the interface between the users of the function (with their needs) and the code that does the estimation/training activities.

When creating model objects, the computational code that fits the model should be decoupled from the interface code to specify the model. This allows for different interfaces to be used to specify the model that then pass common data structures on to the computational code. {note}

3.1 User-Facing Functions

  • At a minimum, the top-level model function should be a generic with methods for data frames, formulas, and possibly recipes. These methods error trap bad arguments, format/encode the data, them pass it along to the lower-level computational code.

  • Do not require users to create dummy variables from their categorical predictors. Provide a formula and/or recipe interface to your model to do this (see the next item) and other methods should error trap if qualitative data should not be subsequently used.

  • These top-level functions should use proper S3 dispatch (e.g. appropriately invoke to foo.data.frame, foo.formula and so on).

  • Only user-appropriate data structures should be accommodated for the user-facing function. The underlying computational code should make appropriate transformations to computationally appropriate formats/encodings. {note}
  • Design the top-level code for humans. This includes using sensible defaults and protecting against common errors. Design your top-level interface code so that people will not hate you. {note}

  • Parameters that users will commonly modify should be main arguments to the top-level function. Others, especially those that control computational aspects of the fit, should be contained in a control object.

  • If your model passes ... to another modeling function, consider the names of your top-level function arguments to avoid conflicts with the argument names of the underlying function. {note}
  • If possible, check to see if the arguments passed to ... are real arguments to the function(s) receiving the .... {note}
  • Common arguments should use standardized names. {note}

3.2 Computational Functions

  • A test set should never be required when fitting a model.

  • If internal resampling is involved in the fitting process, there is a strong preference for using tidymodels infrastructure so that a common interface (and set of choices) can be used. If this cannot be done (e.g. the resampling occurs in C code), there should be some ability to pass in integer values that define the resamples. In this way, the internal sampling is reproducible. The same is true for other infrastructure (e.g. yardstick for performance measures, etc.)

  • When possible, do not reimplement computations that have been done well elsewhere (tidy or not). For example, kernel methods should use the infrastructure in kernlab, exponential family distribution computations should use those in ?family etc.

  • For modeling packages that use random numbers, setting the seed in R should control how random numbers are generated internally. At worst, a random number seed for non-R code (e.g. C, Java) should be an argument to the main modeling function.

  • Computational code should (almost) always use X[ , ,drop = FALSE] when subsetting matrices (and non-tibble data frames) to make sure that the 2D structure is maintained.

  • Use try or similar methods to avoid uncontrolled errors in order to return intelligible errors.

  • Using the call object for post-estimation activies is discouraged. {note}