Working With Data in S3¶
Amazon’s Simple Storage Service (S3) is a highly scalable, durable, general purpose store, that has been around since the original launch of Amazon Web Services (AWS), and is one of their most widely used services.
Whether you run a pyfora cluster in AWS or locally, pyfora lets you work with datasets stored in S3 in much the same way you would use files on your local disk.
Reading From S3¶
pyfora lets you treat files stored in S3 as if they are regular python strings even if they
are much larger than amount of memory available on any machine in your cluster.
The importS3Dataset()
function creates a
RemotePythonObject
that represents the entire content of the
specified file in S3 as a string of bytes, which can then be parsed into different data-structures.
For example, to parse a CSV file in S3 into a pandas.DataFrame
:
import pyfora
import pyfora.pandas_util
executor = pyfora.connect('http://<cluster_manager_address>:30000')
data_as_string = executor.importS3Dataset('bucket_name', 'path/to/file.csv')
with executor.remotely:
data_frame = pyfora.pandas_util.read_csv_from_string(data_as_string)
# data_frame is a pandas.DataFrame that lives in memory in the pyfora cluster
num_of_rows = len(data_frame)
# do stuff with data_frame...
print "Num of rows:", num_of_rows.toLocal().result()
Writing to S3¶
exportS3Dataset()
is used to write strings into S3.
For example:
import pyfora
executor = pyfora.connect('http://<cluster_manager_address>:30000')
with executor.remotely:
large_string = 'lots of data ' * 10**9
executor.exportS3Dataset(large_string, 'bucket_name', 'path/to/file.txt')
AWS Credentials¶
To access private data in S3, the pyfora cluster must be given credentials with appropriate read
and/or write permissions to the buckets and keys being used.
The pyfora worker service reads AWS credentials from two environment variables:
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
These are the same variables used by boto
and the AWS CLI tools.
When launching pyfora services in docker containers, you can set these variables as part of the
docker run
command. For example:
docker run -d -e AWS_ACCESS_KEY_ID=<key> -e AWS_SECRET_ACCESS_KEY=<secret> ufora/service