Working With Data in S3

Amazon’s Simple Storage Service (S3) is a highly scalable, durable, general purpose store, that has been around since the original launch of Amazon Web Services (AWS), and is one of their most widely used services.

Whether you run a pyfora cluster in AWS or locally, pyfora lets you work with datasets stored in S3 in much the same way you would use files on your local disk.

Reading From S3

pyfora lets you treat files stored in S3 as if they are regular python strings even if they are much larger than amount of memory available on any machine in your cluster. The importS3Dataset() function creates a RemotePythonObject that represents the entire content of the specified file in S3 as a string of bytes, which can then be parsed into different data-structures.

For example, to parse a CSV file in S3 into a pandas.DataFrame:

import pyfora
import pyfora.pandas_util

executor = pyfora.connect('http://<cluster_manager_address>:30000')

data_as_string = executor.importS3Dataset('bucket_name', 'path/to/file.csv')
with executor.remotely:
    data_frame = pyfora.pandas_util.read_csv_from_string(data_as_string)

    # data_frame is a pandas.DataFrame that lives in memory in the pyfora cluster
    num_of_rows = len(data_frame)

    # do stuff with data_frame...

print "Num of rows:", num_of_rows.toLocal().result()

Writing to S3

exportS3Dataset() is used to write strings into S3. For example:

import pyfora

executor = pyfora.connect('http://<cluster_manager_address>:30000')

with executor.remotely:
    large_string = 'lots of data ' * 10**9

executor.exportS3Dataset(large_string, 'bucket_name', 'path/to/file.txt')

AWS Credentials

To access private data in S3, the pyfora cluster must be given credentials with appropriate read and/or write permissions to the buckets and keys being used. The pyfora worker service reads AWS credentials from two environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. These are the same variables used by boto and the AWS CLI tools.

When launching pyfora services in docker containers, you can set these variables as part of the docker run command. For example:

docker run -d -e AWS_ACCESS_KEY_ID=<key> -e AWS_SECRET_ACCESS_KEY=<secret> ufora/service