Accessing OpenStack Swift from Spark
Spark’s support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
same URI formats as in Hadoop. You can specify a path in Swift as input through a
URI of the form swift://container.PROVIDER/path
. You will also need to set your
Swift security credentials, through core-site.xml
or via
SparkContext.hadoopConfiguration
.
The current Swift driver requires Swift to use the Keystone authentication method, or
its Rackspace-specific predecessor.
Configuring Swift for Better Data Locality
Although not mandatory, it is recommended to configure the proxy server of Swift with
list_endpoints
to have better data locality. More information is
available here.
Dependencies
The Spark application should include hadoop-openstack
dependency, which can
be done by including the hadoop-cloud
module for the specific version of spark used.
For example, for Maven support, add the following to the pom.xml
file:
Configuration Parameters
Create core-site.xml
and place it inside Spark’s conf
directory.
The main category of parameters that should be configured is the authentication parameters
required by Keystone.
The following table contains a list of Keystone mandatory parameters. PROVIDER
can be
any (alphanumeric) name.
Property Name | Meaning | Required |
---|---|---|
fs.swift.service.PROVIDER.auth.url |
Keystone Authentication URL | Mandatory |
fs.swift.service.PROVIDER.auth.endpoint.prefix |
Keystone endpoints prefix | Optional |
fs.swift.service.PROVIDER.tenant |
Tenant | Mandatory |
fs.swift.service.PROVIDER.username |
Username | Mandatory |
fs.swift.service.PROVIDER.password |
Password | Mandatory |
fs.swift.service.PROVIDER.http.port |
HTTP port | Mandatory |
fs.swift.service.PROVIDER.region |
Keystone region | Mandatory |
fs.swift.service.PROVIDER.public |
Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints | Mandatory |
For example, assume PROVIDER=SparkTest
and Keystone contains user tester
with password testing
defined for tenant test
. Then core-site.xml
should include:
Notice that
fs.swift.service.PROVIDER.tenant
,
fs.swift.service.PROVIDER.username
,
fs.swift.service.PROVIDER.password
contains sensitive information and keeping them in
core-site.xml
is not always a good approach.
We suggest to keep those parameters in core-site.xml
for testing purposes when running Spark
via spark-shell
.
For job submissions they should be provided via sparkContext.hadoopConfiguration
.