pyspark.RDD.join¶

RDD.join(other: pyspark.rdd.RDD[Tuple[K, U]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, Tuple[V, U]]][source]¶

Return an RDD containing all pairs of elements with matching keys in self and other.

Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.

Performs a hash join across the cluster.

New in version 0.7.0.

Parameters

otherRDD: another RDD
numPartitionsint, optional: the number of partitions in new RDD

Returns

RDD: a RDD containing all pairs of elements with matching keys

See also

RDD.leftOuterJoin()
RDD.rightOuterJoin()
RDD.fullOuterJoin()
RDD.cogroup()
RDD.groupWith()
pyspark.sql.DataFrame.join()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4)])
>>> rdd2 = sc.parallelize([("a", 2), ("a", 3)])
>>> sorted(rdd1.join(rdd2).collect())
[('a', (1, 2)), ('a', (1, 3))]

pyspark.RDD.isLocallyCheckpointed

pyspark.RDD.keyBy