Package pyspark :: Module rdd :: Class RDD

Class RDD

object --+
         |
        RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Instance Methods

__init__(self, jrdd, ctx, jrdd_deserializer)
x.__init__(...) initializes x; see help(type(x)) for signature

source code

id(self)
A unique ID for this RDD (within its SparkContext).

source code

__repr__(self)
repr(x)

source code

context(self)
The SparkContext that this RDD was created on.

source code

cache(self)
Persist this RDD with the default storage level (MEMORY_ONLY_SER). source code

persist(self, storageLevel)
Set this RDD's storage level to persist its values across operations after the first time it is computed.

source code

unpersist(self)
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.

source code

checkpoint(self)
Mark this RDD for checkpointing.

source code

isCheckpointed(self)
Return whether this RDD has been checkpointed or not

source code

getCheckpointFile(self)
Gets the name of the file to which this RDD was checkpointed

source code

map(self, f, preservesPartitioning=False)
Return a new RDD by applying a function to each element of this RDD.

source code

flatMap(self, f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

source code

mapPartitions(self, f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.

source code

mapPartitionsWithIndex(self, f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

source code

mapPartitionsWithSplit(self, f, preservesPartitioning=False)
Deprecated: use mapPartitionsWithIndex instead.

source code

getNumPartitions(self)
Returns the number of partitions in RDD

source code

filter(self, f)
Return a new RDD containing only the elements that satisfy a predicate.

source code

distinct(self)
Return a new RDD containing the distinct elements in this RDD.

source code

sample(self, withReplacement, fraction, seed=None)
Return a sampled subset of this RDD (relies on numpy and falls back on default random generator if numpy is unavailable).

source code

takeSample(self, withReplacement, num, seed=None)
Return a fixed-size sampled subset of this RDD (currently requires numpy).

source code

union(self, other)
Return the union of this RDD and another one.

source code

intersection(self, other)
Return the intersection of this RDD and another one.

source code

__add__(self, other)
Return the union of this RDD and another one.

source code

sortByKey(self, ascending=True, numPartitions=None, keyfunc=lambda x: x)
Sorts this RDD, which is assumed to consist of (key, value) pairs.

source code

sortBy(self, keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc

source code

glom(self)
Return an RDD created by coalescing all elements within each partition into a list.

source code

cartesian(self, other)
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other. source code

groupBy(self, f, numPartitions=None)
Return an RDD of grouped items.

source code

pipe(self, command, env={})
Return an RDD created by piping elements to a forked external process.

source code

foreach(self, f)
Applies a function to all elements of this RDD.

source code

foreachPartition(self, f)
Applies a function to each partition of this RDD.

source code

collect(self)
Return a list that contains all of the elements in this RDD.

source code

reduce(self, f)
Reduces the elements of this RDD using the specified commutative and associative binary operator.

source code

fold(self, zeroValue, op)
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value."

source code

aggregate(self, zeroValue, seqOp, combOp)
Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value."

source code

max(self)
Find the maximum item in this RDD.

source code

min(self)
Find the minimum item in this RDD.

source code

sum(self)
Add up the elements in this RDD.

source code

count(self)
Return the number of elements in this RDD.

source code

stats(self)
Return a StatCounter object that captures the mean, variance and count of the RDD's elements in one operation.

source code

histogram(self, buckets)
Compute a histogram using the provided buckets.

source code

mean(self)
Compute the mean of this RDD's elements.

source code

variance(self)
Compute the variance of this RDD's elements.

source code

stdev(self)
Compute the standard deviation of this RDD's elements.

source code

sampleStdev(self)
Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).

source code

sampleVariance(self)
Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).

source code

countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.

source code

top(self, num)
Get the top N elements from a RDD.

source code

takeOrdered(self, num, key=None)
Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.

source code

take(self, num)
Take the first num elements of the RDD.

source code

first(self)
Return the first element in this RDD.

source code

saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, valueConverter=None)
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package). source code

saveAsNewAPIHadoopFile(self, path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None)
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package). source code

saveAsHadoopDataset(self, conf, keyConverter=None, valueConverter=None)
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). source code

saveAsHadoopFile(self, path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). source code

saveAsSequenceFile(self, path, compressionCodecClass=None)
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the org.apache.hadoop.io.Writable types that we convert from the RDD's key and value types. source code

saveAsPickleFile(self, path, batchSize=10)
Save this RDD as a SequenceFile of serialized objects.

source code

saveAsTextFile(self, path)
Save this RDD as a text file, using string representations of elements.

source code

collectAsMap(self)
Return the key-value pairs in this RDD to the master as a dictionary.

source code

keys(self)
Return an RDD with the keys of each tuple.

source code

values(self)
Return an RDD with the values of each tuple.

source code

reduceByKey(self, func, numPartitions=None)
Merge the values for each key using an associative reduce function.

source code

reduceByKeyLocally(self, func)
Merge the values for each key using an associative reduce function, but return the results immediately to the master as a dictionary.

source code

countByKey(self)
Count the number of elements for each key, and return the result to the master as a dictionary.

source code

join(self, other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and other. source code

leftOuterJoin(self, other, numPartitions=None)
Perform a left outer join of self and other. source code

rightOuterJoin(self, other, numPartitions=None)
Perform a right outer join of self and other. source code

partitionBy(self, numPartitions, partitionFunc=portable_hash)
Return a copy of the RDD partitioned using the specified partitioner.

source code

combineByKey(self, createCombiner, mergeValue, mergeCombiners, numPartitions=None)
Generic function to combine the elements for each key using a custom set of aggregation functions.

source code

aggregateByKey(self, zeroValue, seqFunc, combFunc, numPartitions=None)
Aggregate the values of each key, using given combine functions and a neutral "zero value".

source code

foldByKey(self, zeroValue, func, numPartitions=None)
Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).

source code

groupByKey(self, numPartitions=None)
Group the values for each key in the RDD into a single sequence.

source code

flatMapValues(self, f)
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning.

source code

mapValues(self, f)
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning.

source code

groupWith(self, other, *others)
Alias for cogroup but with support for multiple RDDs.

source code

cogroup(self, other, numPartitions=None)
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other. source code

sampleByKey(self, withReplacement, fractions, seed=None)
Return a subset of this RDD sampled by key (via stratified sampling).

source code

subtractByKey(self, other, numPartitions=None)
Return each (key, value) pair in self that has no pair with matching key in other. source code

subtract(self, other, numPartitions=None)
Return each value in self that is not contained in other. source code

keyBy(self, f)
Creates tuples of the elements in this RDD by applying f. source code

repartition(self, numPartitions)
Return a new RDD that has exactly numPartitions partitions.

source code

coalesce(self, numPartitions, shuffle=False)
Return a new RDD that is reduced into `numPartitions` partitions.

source code

zip(self, other)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc.

source code

zipWithIndex(self)
Zips this RDD with its element indices.

source code

zipWithUniqueId(self)
Zips this RDD with generated unique Long ids.

source code

name(self)
Return the name of this RDD.

source code

setName(self, name)
Assign a name to this RDD.

source code

toDebugString(self)
A description of this RDD and its recursive dependencies for debugging.

source code

getStorageLevel(self)
Get the RDD's current storage level.

source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties
Inherited from `object`: `__class__`

Method Details

Class RDD

__init__(self, jrdd, ctx, jrdd_deserializer) (Constructor)

__repr__(self) (Representation operator)

context(self)

persist(self, storageLevel)

checkpoint(self)

map(self, f, preservesPartitioning=False)

flatMap(self, f, preservesPartitioning=False)

mapPartitions(self, f, preservesPartitioning=False)

mapPartitionsWithIndex(self, f, preservesPartitioning=False)

mapPartitionsWithSplit(self, f, preservesPartitioning=False)

getNumPartitions(self)

filter(self, f)

distinct(self)

takeSample(self, withReplacement, num, seed=None)

union(self, other)

intersection(self, other)

__add__(self, other) (Addition operator)

sortByKey(self, ascending=True, numPartitions=None, keyfunc=lambda x: x)

sortBy(self, keyfunc, ascending=True, numPartitions=None)

glom(self)

cartesian(self, other)

groupBy(self, f, numPartitions=None)

pipe(self, command, env={})

foreach(self, f)

foreachPartition(self, f)

reduce(self, f)

fold(self, zeroValue, op)

aggregate(self, zeroValue, seqOp, combOp)

max(self)

min(self)

sum(self)

count(self)

histogram(self, buckets)

mean(self)

variance(self)

stdev(self)

sampleStdev(self)

sampleVariance(self)

countByValue(self)

top(self, num)

takeOrdered(self, num, key=None)

take(self, num)

first(self)

saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, valueConverter=None)

saveAsNewAPIHadoopFile(self, path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None)

saveAsHadoopDataset(self, conf, keyConverter=None, valueConverter=None)

saveAsHadoopFile(self, path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)

saveAsSequenceFile(self, path, compressionCodecClass=None)

saveAsPickleFile(self, path, batchSize=10)

saveAsTextFile(self, path)

collectAsMap(self)

keys(self)

values(self)

reduceByKey(self, func, numPartitions=None)

reduceByKeyLocally(self, func)

countByKey(self)

join(self, other, numPartitions=None)

leftOuterJoin(self, other, numPartitions=None)

rightOuterJoin(self, other, numPartitions=None)

partitionBy(self, numPartitions, partitionFunc=portable_hash)

combineByKey(self, createCombiner, mergeValue, mergeCombiners, numPartitions=None)

aggregateByKey(self, zeroValue, seqFunc, combFunc, numPartitions=None)

foldByKey(self, zeroValue, func, numPartitions=None)

groupByKey(self, numPartitions=None)

flatMapValues(self, f)

mapValues(self, f)

groupWith(self, other, *others)

cogroup(self, other, numPartitions=None)

sampleByKey(self, withReplacement, fractions, seed=None)

subtractByKey(self, other, numPartitions=None)

subtract(self, other, numPartitions=None)

keyBy(self, f)

repartition(self, numPartitions)

coalesce(self, numPartitions, shuffle=False)

zip(self, other)

zipWithIndex(self)

zipWithUniqueId(self)

setName(self, name)

getStorageLevel(self)

init(self, jrdd, ctx, jrdd_deserializer)
(Constructor)

repr(self)
(Representation operator)

add(self, other)
(Addition operator)