Linear RegressionΒΆ
- This tutorial demonstrates using pyfora to:
- Load a large CSV file from Amazon S3
- Parse it into a
pandas.DataFrame
- Run linear regression on the loaded DataFrame
- Download the regression coefficients and intercept back to python
Important
The example below uses a large dataset. It is a 64GB csv file that parses into 20GB of normally-distributed, randomly generated floating point numbers. It takes about 10 minutes to run on three c3.8xlarge instances in EC2.
You can use the pyfora_aws
script installed with the pyfora package to easily
set up a pyfora cluster in EC2 using either on-demand or spot instances.
If you prefer to try a (much) smaller version of this example, you can use the 5.2GB dataset
iid-normal-floats-13mm-by-17.csv
, by modifying line 9 below accordingly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import pyfora
from pyfora.pandas_util import read_csv_from_string
from pyfora.algorithms import linearRegression
print "Connecting..."
executor = pyfora.connect('http://<cluster_manager>:30000')
print "Importing data..."
raw_data = executor.importS3Dataset('ufora-test-data',
'iid-normal-floats-20GB-20-columns.csv').result()
print "Parsing and regressing..."
with executor.remotely:
data_frame = read_csv_from_string(raw_data)
predictors = data_frame.iloc[:, :-1]
responses = data_frame.iloc[:, -1:]
regression_result = linearRegression(predictors, responses)
coefficients = regression_result[:-1]
intercept = regression_result[-1]
print 'coefficients:', coefficients.toLocal().result()
print 'intercept:', intercept.toLocal().result()
|
If you are familiar with pandas
the code above should look quite familiar.
After connecting to a pyfora cluster using pyfora.connect()
in line 6, we import a dataset
from Amazon S3 in line 8 using importS3Dataset()
.
The value raw_data
returned from importS3Dataset()
is a
RemotePythonObject
that represents the entire dataset as a string.
The data itself is lazily loaded to memory in the cluster when it is needed.
All the code inside the with executor.remotely:
block that starts in line 12 is shipped to the cluster
and executes remotely.
We use read_csv_from_string()
to read the CSV in raw_data
and
produce a DataFrame.
Our regression fits a linear model to predict the last column from the prior ones.
The linearRegression()
algorithm is used to return an array with the linear
model’s coefficients and intercept.
In lines 22 and 23, outside the with executor.remotely:
block, we bring some of the values computed
remotely back into the local python environment.
Values assigned to variables inside the with executor.remotely:
are left in the pyfora cluster
by default because they can be very large - much larger than the amount of memory available on your
machine. Instead, they are represented locally using RemotePythonObject
instances that can be downloaded using their toLocal()
function.