pyspark.pandas.DataFrame.assign¶
-
DataFrame.
assign
(**kwargs: Any) → pyspark.pandas.frame.DataFrame[source]¶ Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
- Parameters
- **kwargsdict of {str: callable, Series or Index}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas-on-Spark doesn’t check it). If the values are not callable, (e.g. a Series or a literal), they are simply assigned.
- Returns
- DataFrame
A new DataFrame with the new columns in addition to all the existing columns.
Notes
Assigning multiple columns within the same
assign
is possible but you cannot refer to newly created or modified columns. This feature is supported in pandas for Python 3.6 and later but not in pandas-on-Spark. In pandas-on-Spark, all items are computed first, and then assigned.Examples
>>> df = ps.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0
Where the value is a callable, evaluated on df:
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence and you can also create multiple columns within the same assign.
>>> assigned = df.assign(temp_f=df['temp_c'] * 9 / 5 + 32, ... temp_k=df['temp_c'] + 273.15, ... temp_idx=df.index) >>> assigned[['temp_c', 'temp_f', 'temp_k', 'temp_idx']] temp_c temp_f temp_k temp_idx Portland 17.0 62.6 290.15 Portland Berkeley 25.0 77.0 298.15 Berkeley