It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean calculations, expressed in dplyr formulas until an the optimal centers are found.

simple_kmeans_db(df, ..., centers = 3, max_repeats = 100,
  initial_kmeans = NULL, safeguard_file = "kmeans.csv",
  verbose = TRUE)

Arguments

df

A Local or remote data frame

...

A list of variables to be used in the kmeans algorithm

centers

The number of centers. Defaults to 3.

max_repeats

The maximum number of cycles to run. Defaults to 100.

initial_kmeans

A local dataframe with initial centroid values. Defaults to NULL.

safeguard_file

Each cycle will update a file specified in this argument with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is desired.

verbose

Indicates if the progress bar will be displayed during the model's fitting.

Details

Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source, the latest centroid data frame is saved to the parent environment in case the process needs to be canceled and then restarted at a later point. Passing the `current_kmeans` as the `initial_kmeans` will allow the operation to pick up where it left off.

Examples

library(dplyr) mtcars %>% simple_kmeans_db(mpg, qsec, wt) %>% glimpse()
#> Observations: 32 #> Variables: 15 #> $ k_center <chr> "center_1", "center_1", "center_1", "center_1", "center_1"... #> $ k_mpg <dbl> 20.64286, 20.64286, 20.64286, 20.64286, 20.64286, 20.64286... #> $ k_qsec <dbl> 18.57357, 18.57357, 18.57357, 18.57357, 18.57357, 18.57357... #> $ k_wt <dbl> 3.072143, 3.072143, 3.072143, 3.072143, 3.072143, 3.072143... #> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2... #> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4... #> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 14... #> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 1... #> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92... #> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.... #> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22... #> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1... #> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1... #> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4... #> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1...