Transformer()
Transformer
Abstract class for transformers that transform one dataset into another.
UnaryTransformer()
UnaryTransformer
Abstract class for transformers that take one input column, apply transformation, and output the result as a new column.
Estimator()
Estimator
Abstract class for estimators that fit models to data.
Model()
Model
Abstract class for models that are fitted by estimators.
Predictor()
Predictor
Estimator for prediction tasks (regression and classification).
PredictionModel()
PredictionModel
Model for prediction tasks (regression and classification).
Pipeline(*[, stages])
Pipeline
A simple pipeline, which acts as an estimator.
PipelineModel(stages)
PipelineModel
Represents a compiled pipeline with transformers and fitted models.
Param(parent, name, doc[, typeConverter])
Param
A param with self-contained documentation.
Params()
Params
Components that take parameters.
TypeConverters
Factory methods for common type conversion functions for Param.typeConverter.
Binarizer(*[, threshold, inputCol, …])
Binarizer
Binarize a column of continuous features given a threshold.
BucketedRandomProjectionLSH(*[, inputCol, …])
BucketedRandomProjectionLSH
LSH class for Euclidean distance metrics.
BucketedRandomProjectionLSHModel([java_model])
BucketedRandomProjectionLSHModel
Model fitted by BucketedRandomProjectionLSH, where multiple random vectors are stored.
Bucketizer(*[, splits, inputCol, outputCol, …])
Bucketizer
Maps a column of continuous features to a column of feature buckets.
ChiSqSelector(*[, numTopFeatures, …])
ChiSqSelector
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.
ChiSqSelectorModel([java_model])
ChiSqSelectorModel
Model fitted by ChiSqSelector.
CountVectorizer(*[, minTF, minDF, maxDF, …])
CountVectorizer
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
CountVectorizerModel
CountVectorizerModel([java_model])
Model fitted by CountVectorizer.
DCT(*[, inverse, inputCol, outputCol])
DCT
A feature transformer that takes the 1D discrete cosine transform of a real vector.
ElementwiseProduct(*[, scalingVec, …])
ElementwiseProduct
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector.
FeatureHasher(*[, numFeatures, inputCols, …])
FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space).
HashingTF(*[, numFeatures, binary, …])
HashingTF
Maps a sequence of terms to their term frequencies using the hashing trick.
IDF(*[, minDocFreq, inputCol, outputCol])
IDF
Compute the Inverse Document Frequency (IDF) given a collection of documents.
IDFModel([java_model])
IDFModel
Model fitted by IDF.
Imputer(*[, strategy, missingValue, …])
Imputer
Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located.
ImputerModel([java_model])
ImputerModel
Model fitted by Imputer.
IndexToString(*[, inputCol, outputCol, labels])
IndexToString
A pyspark.ml.base.Transformer that maps a column of indices back to a new column of corresponding string values.
pyspark.ml.base.Transformer
Interaction(*[, inputCols, outputCol])
Interaction
Implements the feature interaction transform.
MaxAbsScaler(*[, inputCol, outputCol])
MaxAbsScaler
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.
MaxAbsScalerModel([java_model])
MaxAbsScalerModel
Model fitted by MaxAbsScaler.
MinHashLSH(*[, inputCol, outputCol, seed, …])
MinHashLSH
LSH class for Jaccard distance.
MinHashLSHModel([java_model])
MinHashLSHModel
Model produced by MinHashLSH, where where multiple hash functions are stored.
MinMaxScaler(*[, min, max, inputCol, outputCol])
MinMaxScaler
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
MinMaxScalerModel([java_model])
MinMaxScalerModel
Model fitted by MinMaxScaler.
NGram(*[, n, inputCol, outputCol])
NGram
A feature transformer that converts the input array of strings into an array of n-grams.
Normalizer(*[, p, inputCol, outputCol])
Normalizer
Normalize a vector to have unit norm using the given p-norm.
OneHotEncoder(*[, inputCols, outputCols, …])
OneHotEncoder
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
OneHotEncoderModel([java_model])
OneHotEncoderModel
Model fitted by OneHotEncoder.
PCA(*[, k, inputCol, outputCol])
PCA
PCA trains a model to project vectors to a lower dimensional space of the top k principal components.
k
PCAModel([java_model])
PCAModel
Model fitted by PCA.
PolynomialExpansion(*[, degree, inputCol, …])
PolynomialExpansion
Perform feature expansion in a polynomial space.
QuantileDiscretizer(*[, numBuckets, …])
QuantileDiscretizer
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.
RobustScaler(*[, lower, upper, …])
RobustScaler
RobustScaler removes the median and scales the data according to the quantile range.
RobustScalerModel([java_model])
RobustScalerModel
Model fitted by RobustScaler.
RegexTokenizer(*[, minTokenLength, gaps, …])
RegexTokenizer
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false).
RFormula(*[, formula, featuresCol, …])
RFormula
Implements the transforms required for fitting a dataset against an R model formula.
RFormulaModel([java_model])
RFormulaModel
Model fitted by RFormula.
SQLTransformer(*[, statement])
SQLTransformer
Implements the transforms which are defined by SQL statement.
StandardScaler(*[, withMean, withStd, …])
StandardScaler
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
StandardScalerModel([java_model])
StandardScalerModel
Model fitted by StandardScaler.
StopWordsRemover(*[, inputCol, outputCol, …])
StopWordsRemover
A feature transformer that filters out stop words from input.
StringIndexer(*[, inputCol, outputCol, …])
StringIndexer
A label indexer that maps a string column of labels to an ML column of label indices.
StringIndexerModel([java_model])
StringIndexerModel
Model fitted by StringIndexer.
Tokenizer(*[, inputCol, outputCol])
Tokenizer
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
UnivariateFeatureSelector(*[, featuresCol, …])
UnivariateFeatureSelector
Feature selector based on univariate statistical tests against labels.
UnivariateFeatureSelectorModel([java_model])
UnivariateFeatureSelectorModel
Model fitted by UnivariateFeatureSelector.
VarianceThresholdSelector(*[, featuresCol, …])
VarianceThresholdSelector
Feature selector that removes all low-variance features.
VarianceThresholdSelectorModel([java_model])
VarianceThresholdSelectorModel
Model fitted by VarianceThresholdSelector.
VectorAssembler(*[, inputCols, outputCol, …])
VectorAssembler
A feature transformer that merges multiple columns into a vector column.
VectorIndexer(*[, maxCategories, inputCol, …])
VectorIndexer
Class for indexing categorical feature columns in a dataset of Vector.
VectorIndexerModel([java_model])
VectorIndexerModel
Model fitted by VectorIndexer.
VectorSizeHint(*[, inputCol, size, …])
VectorSizeHint
A feature transformer that adds size information to the metadata of a vector column.
VectorSlicer(*[, inputCol, outputCol, …])
VectorSlicer
This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
Word2Vec(*[, vectorSize, minCount, …])
Word2Vec
Word2Vec trains a model of Map(String, Vector), i.e.
Word2VecModel([java_model])
Word2VecModel
Model fitted by Word2Vec.
LinearSVC(*[, featuresCol, labelCol, …])
LinearSVC
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer.
LinearSVCModel([java_model])
LinearSVCModel
Model fitted by LinearSVC.
LinearSVCSummary([java_obj])
LinearSVCSummary
Abstraction for LinearSVC Results for a given model.
LinearSVCTrainingSummary([java_obj])
LinearSVCTrainingSummary
Abstraction for LinearSVC Training results.
LogisticRegression(*[, featuresCol, …])
LogisticRegression
Logistic regression.
LogisticRegressionModel([java_model])
LogisticRegressionModel
Model fitted by LogisticRegression.
LogisticRegressionSummary([java_obj])
LogisticRegressionSummary
Abstraction for Logistic Regression Results for a given model.
LogisticRegressionTrainingSummary([java_obj])
LogisticRegressionTrainingSummary
Abstraction for multinomial Logistic Regression Training results.
BinaryLogisticRegressionSummary([java_obj])
BinaryLogisticRegressionSummary
Binary Logistic regression results for a given model.
BinaryLogisticRegressionTrainingSummary([…])
BinaryLogisticRegressionTrainingSummary
Binary Logistic regression training results for a given model.
DecisionTreeClassifier(*[, featuresCol, …])
DecisionTreeClassifier
Decision tree learning algorithm for classification.It supports both binary and multiclass labels, as well as both continuous and categorical features..
DecisionTreeClassificationModel([java_model])
DecisionTreeClassificationModel
Model fitted by DecisionTreeClassifier.
GBTClassifier(*[, featuresCol, labelCol, …])
GBTClassifier
Gradient-Boosted Trees (GBTs) learning algorithm for classification.It supports binary labels, as well as both continuous and categorical features..
GBTClassificationModel([java_model])
GBTClassificationModel
Model fitted by GBTClassifier.
RandomForestClassifier(*[, featuresCol, …])
RandomForestClassifier
Random Forest learning algorithm for classification.It supports both binary and multiclass labels, as well as both continuous and categorical features..
RandomForestClassificationModel([java_model])
RandomForestClassificationModel
Model fitted by RandomForestClassifier.
RandomForestClassificationSummary([java_obj])
RandomForestClassificationSummary
Abstraction for RandomForestClassification Results for a given model.
RandomForestClassificationTrainingSummary([…])
RandomForestClassificationTrainingSummary
Abstraction for RandomForestClassificationTraining Training results.
BinaryRandomForestClassificationSummary([…])
BinaryRandomForestClassificationSummary
BinaryRandomForestClassification results for a given model.
BinaryRandomForestClassificationTrainingSummary([…])
BinaryRandomForestClassificationTrainingSummary
BinaryRandomForestClassification training results for a given model.
NaiveBayes(*[, featuresCol, labelCol, …])
NaiveBayes
Naive Bayes Classifiers.
NaiveBayesModel([java_model])
NaiveBayesModel
Model fitted by NaiveBayes.
MultilayerPerceptronClassifier(*[, …])
MultilayerPerceptronClassifier
Classifier trainer based on the Multilayer Perceptron.
MultilayerPerceptronClassificationModel([…])
MultilayerPerceptronClassificationModel
Model fitted by MultilayerPerceptronClassifier.
MultilayerPerceptronClassificationSummary([…])
MultilayerPerceptronClassificationSummary
Abstraction for MultilayerPerceptronClassifier Results for a given model.
MultilayerPerceptronClassificationTrainingSummary([…])
MultilayerPerceptronClassificationTrainingSummary
Abstraction for MultilayerPerceptronClassifier Training results.
OneVsRest(*[, featuresCol, labelCol, …])
OneVsRest
Reduction of Multiclass Classification to Binary Classification.
OneVsRestModel(models)
OneVsRestModel
Model fitted by OneVsRest.
FMClassifier(*[, featuresCol, labelCol, …])
FMClassifier
Factorization Machines learning algorithm for classification.
FMClassificationModel([java_model])
FMClassificationModel
Model fitted by FMClassifier.
FMClassificationSummary([java_obj])
FMClassificationSummary
Abstraction for FMClassifier Results for a given model.
FMClassificationTrainingSummary([java_obj])
FMClassificationTrainingSummary
Abstraction for FMClassifier Training results.
BisectingKMeans(*[, featuresCol, …])
BisectingKMeans
A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark.
BisectingKMeansModel([java_model])
BisectingKMeansModel
Model fitted by BisectingKMeans.
BisectingKMeansSummary([java_obj])
BisectingKMeansSummary
Bisecting KMeans clustering results for a given model.
KMeans(*[, featuresCol, predictionCol, k, …])
KMeans
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
KMeansModel([java_model])
KMeansModel
Model fitted by KMeans.
KMeansSummary([java_obj])
KMeansSummary
Summary of KMeans.
GaussianMixture(*[, featuresCol, …])
GaussianMixture
GaussianMixture clustering.
GaussianMixtureModel([java_model])
GaussianMixtureModel
Model fitted by GaussianMixture.
GaussianMixtureSummary([java_obj])
GaussianMixtureSummary
Gaussian mixture clustering results for a given model.
LDA(*[, featuresCol, maxIter, seed, …])
LDA
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
LDAModel([java_model])
LDAModel
Latent Dirichlet Allocation (LDA) model.
LocalLDAModel([java_model])
LocalLDAModel
Local (non-distributed) model fitted by LDA.
DistributedLDAModel([java_model])
DistributedLDAModel
Distributed model fitted by LDA.
PowerIterationClustering(*[, k, maxIter, …])
PowerIterationClustering
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data..
array_to_vector(col)
array_to_vector
Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances
vector_to_array(col[, dtype])
vector_to_array
Converts a column of MLlib sparse/dense vectors into a column of dense arrays.
Vector
DenseVector(ar)
DenseVector
A dense vector represented by a value array.
SparseVector(size, *args)
SparseVector
A simple sparse vector class for passing data to MLlib.
Vectors
Factory methods for working with vectors.
Matrix(numRows, numCols[, isTransposed])
Matrix
DenseMatrix(numRows, numCols, values[, …])
DenseMatrix
Column-major dense matrix.
SparseMatrix(numRows, numCols, colPtrs, …)
SparseMatrix
Sparse Matrix stored in CSC format.
Matrices
ALS(*[, rank, maxIter, regParam, …])
ALS
Alternating Least Squares (ALS) matrix factorization.
ALSModel([java_model])
ALSModel
Model fitted by ALS.
AFTSurvivalRegression(*[, featuresCol, …])
AFTSurvivalRegression
Accelerated Failure Time (AFT) Model Survival Regression
AFTSurvivalRegressionModel([java_model])
AFTSurvivalRegressionModel
Model fitted by AFTSurvivalRegression.
DecisionTreeRegressor(*[, featuresCol, …])
DecisionTreeRegressor
Decision tree learning algorithm for regression.It supports both continuous and categorical features..
DecisionTreeRegressionModel([java_model])
DecisionTreeRegressionModel
Model fitted by DecisionTreeRegressor.
GBTRegressor(*[, featuresCol, labelCol, …])
GBTRegressor
Gradient-Boosted Trees (GBTs) learning algorithm for regression.It supports both continuous and categorical features..
GBTRegressionModel([java_model])
GBTRegressionModel
Model fitted by GBTRegressor.
GeneralizedLinearRegression(*[, labelCol, …])
GeneralizedLinearRegression
Generalized Linear Regression.
GeneralizedLinearRegressionModel([java_model])
GeneralizedLinearRegressionModel
Model fitted by GeneralizedLinearRegression.
GeneralizedLinearRegressionSummary([java_obj])
GeneralizedLinearRegressionSummary
Generalized linear regression results evaluated on a dataset.
GeneralizedLinearRegressionTrainingSummary([…])
GeneralizedLinearRegressionTrainingSummary
Generalized linear regression training results.
IsotonicRegression(*[, featuresCol, …])
IsotonicRegression
Currently implemented using parallelized pool adjacent violators algorithm.
IsotonicRegressionModel([java_model])
IsotonicRegressionModel
Model fitted by IsotonicRegression.
LinearRegression(*[, featuresCol, labelCol, …])
LinearRegression
Linear regression.
LinearRegressionModel([java_model])
LinearRegressionModel
Model fitted by LinearRegression.
LinearRegressionSummary([java_obj])
LinearRegressionSummary
Linear regression results evaluated on a dataset.
LinearRegressionTrainingSummary([java_obj])
LinearRegressionTrainingSummary
Linear regression training results.
RandomForestRegressor(*[, featuresCol, …])
RandomForestRegressor
Random Forest learning algorithm for regression.It supports both continuous and categorical features..
RandomForestRegressionModel([java_model])
RandomForestRegressionModel
Model fitted by RandomForestRegressor.
FMRegressor(*[, featuresCol, labelCol, …])
FMRegressor
Factorization Machines learning algorithm for regression.
FMRegressionModel([java_model])
FMRegressionModel
Model fitted by FMRegressor.
ChiSquareTest
Conduct Pearson’s independence test for every feature against the label.
Correlation
Compute the correlation matrix for the input dataset of Vectors using the specified method.
KolmogorovSmirnovTest
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution.
MultivariateGaussian(mean, cov)
MultivariateGaussian
Represents a (mean, cov) tuple
Summarizer
Tools for vectorized statistics on MLlib Vectors.
SummaryBuilder(jSummaryBuilder)
SummaryBuilder
A builder object that provides summary statistics about a given column.
ParamGridBuilder()
ParamGridBuilder
Builder for a param grid used in grid search-based model selection.
CrossValidator(*[, estimator, …])
CrossValidator
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
CrossValidatorModel(bestModel[, avgMetrics, …])
CrossValidatorModel
CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data.
TrainValidationSplit(*[, estimator, …])
TrainValidationSplit
Validation for hyper-parameter tuning.
TrainValidationSplitModel(bestModel[, …])
TrainValidationSplitModel
Model from train validation split.
Evaluator()
Evaluator
Base class for evaluators that compute metrics from predictions.
BinaryClassificationEvaluator(*[, …])
BinaryClassificationEvaluator
Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column.
RegressionEvaluator(*[, predictionCol, …])
RegressionEvaluator
Evaluator for Regression, which expects input columns prediction, label and an optional weight column.
MulticlassClassificationEvaluator(*[, …])
MulticlassClassificationEvaluator
Evaluator for Multiclass Classification, which expects input columns: prediction, label, weight (optional) and probabilityCol (only for logLoss).
MultilabelClassificationEvaluator(*[, …])
MultilabelClassificationEvaluator
Evaluator for Multilabel Classification, which expects two input columns: prediction and label.
ClusteringEvaluator(*[, predictionCol, …])
ClusteringEvaluator
Evaluator for Clustering results, which expects two input columns: prediction and features.
RankingEvaluator(*[, predictionCol, …])
RankingEvaluator
Evaluator for Ranking, which expects two input columns: prediction and label.
FPGrowth(*[, minSupport, minConfidence, …])
FPGrowth
A parallel FP-growth algorithm to mine frequent itemsets.
FPGrowthModel([java_model])
FPGrowthModel
Model fitted by FPGrowth.
PrefixSpan(*[, minSupport, …])
PrefixSpan
A parallel PrefixSpan algorithm to mine frequent sequential patterns.
ImageSchema
Internal class for pyspark.ml.image.ImageSchema attribute.
_ImageSchema()
_ImageSchema
BaseReadWrite()
BaseReadWrite
Base class for MLWriter and MLReader.
DefaultParamsReadable
Helper trait for making simple Params types readable.
DefaultParamsReader(cls)
DefaultParamsReader
Specialization of MLReader for Params types
MLReader
DefaultParamsWritable
Helper trait for making simple Params types writable.
DefaultParamsWriter(instance)
DefaultParamsWriter
Specialization of MLWriter for Params types
MLWriter
GeneralMLWriter()
GeneralMLWriter
Utility class that can save ML instances in different formats.
HasTrainingSummary
Base class for models that provides Training summary.
Identifiable()
Identifiable
Object with a unique ID.
MLReadable
Mixin for instances that provide MLReader.
MLReader()
Utility class that can load ML instances.
MLWritable
Mixin for ML instances that provide MLWriter.
MLWriter()
Utility class that can save ML instances.