Linear RegressionΒΆ

This tutorial demonstrates using pyfora to:
  1. Load a large CSV file from Amazon S3
  2. Parse it into a pandas.DataFrame
  3. Run linear regression on the loaded DataFrame
  4. Download the regression coefficients and intercept back to python

Important

The example below uses a large dataset. It is a 64GB csv file that parses into 20GB of normally-distributed, randomly generated floating point numbers. It takes about 10 minutes to run on three c3.8xlarge instances in EC2.

You can use the pyfora_aws script installed with the pyfora package to easily set up a pyfora cluster in EC2 using either on-demand or spot instances.

If you prefer to try a (much) smaller version of this example, you can use the 5.2GB dataset iid-normal-floats-13mm-by-17.csv, by modifying line 9 below accordingly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
 import pyfora
 from pyfora.pandas_util import read_csv_from_string
 from pyfora.algorithms import linearRegression

 print "Connecting..."
 executor = pyfora.connect('http://<cluster_manager>:30000')
 print "Importing data..."
 raw_data = executor.importS3Dataset('ufora-test-data',
                                  'iid-normal-floats-20GB-20-columns.csv').result()

 print "Parsing and regressing..."
 with executor.remotely:
     data_frame = read_csv_from_string(raw_data)
     predictors = data_frame.iloc[:, :-1]
     responses = data_frame.iloc[:, -1:]

     regression_result = linearRegression(predictors, responses)
     coefficients = regression_result[:-1]
     intercept = regression_result[-1]


 print 'coefficients:', coefficients.toLocal().result()
 print 'intercept:', intercept.toLocal().result()

If you are familiar with pandas the code above should look quite familiar. After connecting to a pyfora cluster using pyfora.connect() in line 6, we import a dataset from Amazon S3 in line 8 using importS3Dataset().

The value raw_data returned from importS3Dataset() is a RemotePythonObject that represents the entire dataset as a string. The data itself is lazily loaded to memory in the cluster when it is needed.

All the code inside the with executor.remotely: block that starts in line 12 is shipped to the cluster and executes remotely.

We use read_csv_from_string() to read the CSV in raw_data and produce a DataFrame.

Our regression fits a linear model to predict the last column from the prior ones. The linearRegression() algorithm is used to return an array with the linear model’s coefficients and intercept.

In lines 22 and 23, outside the with executor.remotely: block, we bring some of the values computed remotely back into the local python environment. Values assigned to variables inside the with executor.remotely: are left in the pyfora cluster by default because they can be very large - much larger than the amount of memory available on your machine. Instead, they are represented locally using RemotePythonObject instances that can be downloaded using their toLocal() function.