K-Means Clustering Model
spark.kmeans.Rd
Fits a k-means clustering model against a SparkDataFrame, similarly to R's kmeans().
Users can call summary
to print a summary of the fitted model, predict
to make
predictions on new data, and write.ml
/read.ml
to save/load fitted models.
Usage
spark.kmeans(data, formula, ...)
# S4 method for SparkDataFrame,formula
spark.kmeans(
data,
formula,
k = 2,
maxIter = 20,
initMode = c("k-means||", "random"),
seed = NULL,
initSteps = 2,
tol = 1e-04
)
# S4 method for KMeansModel
summary(object)
# S4 method for KMeansModel
predict(object, newData)
# S4 method for KMeansModel,character
write.ml(object, path, overwrite = FALSE)
Arguments
- data
a SparkDataFrame for training.
- formula
a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and '-'. Note that the response variable of formula is empty in spark.kmeans.
- ...
additional argument(s) passed to the method.
- k
number of centers.
- maxIter
maximum iteration number.
- initMode
the initialization algorithm chosen to fit the model.
- seed
the random seed for cluster initialization.
- initSteps
the number of steps for the k-means|| initialization mode. This is an advanced setting, the default of 2 is almost always enough. Must be > 0.
- tol
convergence tolerance of iterations.
- object
a fitted k-means model.
- newData
a SparkDataFrame for testing.
- path
the directory where the model is saved.
- overwrite
overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.
Value
spark.kmeans
returns a fitted k-means model.
summary
returns summary information of the fitted model, which is a list.
The list includes the model's k
(the configured number of cluster centers),
coefficients
(model cluster centers),
size
(number of data points in each cluster), cluster
(cluster centers of the transformed data), is.loaded (whether the model is loaded
from a saved file), and clusterSize
(the actual number of cluster centers. When using initMode = "random",
clusterSize
may not equal to k
).
predict
returns the predicted values based on a k-means model.
Note
spark.kmeans since 2.0.0
summary(KMeansModel) since 2.0.0
predict(KMeansModel) since 2.0.0
write.ml(KMeansModel, character) since 2.0.0
Examples
if (FALSE) {
sparkR.session()
t <- as.data.frame(Titanic)
df <- createDataFrame(t)
model <- spark.kmeans(df, Class ~ Survived, k = 4, initMode = "random")
summary(model)
# fitted values on training data
fitted <- predict(model, df)
head(select(fitted, "Class", "prediction"))
# save fitted model to input path
path <- "path/to/model"
write.ml(model, path)
# can also read back the saved model and print
savedModel <- read.ml(path)
summary(savedModel)
}