This course covers the fundamentals of parallel machine learning algorithms, tailored specifically to learning tasks involving large datasets. The course reviews methods for dealing with both large and high-dimensional datasets, emphasizing distributed implementations. Beyond covering the theory behind statistical data analysis, the course also offers a hands-on approach, using Spark as a development platform for parallel learning and the Massachusetts Green High Performance Computing Cluster (MGHPCC) as a programming environment. In detail, the course will cover:

- Apache Spark fundamentals, multi-threaded/cluster execution.
- Resilient distributed data structures, map-reduce operations, persistence and iterative algorithms, lazy evaluation.
- Working with key-value pairs, joins.
- Convex sets and functions, convex optimization, gradient descent.
- Linear regression, Gauss Markov theorem, generalized linear models, ridge and lasso regularization.
- Feature Selection, cross validation. Variance vs bias trade-off.
- Classification, logistic regression, loss functions. ROC curves and AUC.
- Stochastic gradient descent. Matrix and tensor factorization.
- Graph-parallel algorithms & sparsity.
- Perceptron algorithm & deep neural networks.

There will be 4 homework assignments, all of which will involve a programming component, as well as a midterm and a final course project. The grade breakdown is as follows:

- Homework: 40%
- Midterm exam: 30%
- Course project: 30%

- Karau, H., Konwinski, A., Wendell, P. and Zaharia, M., 2015. Learning Spark: Lightning-Fast Big Data Analysis. Available online at the NEU library.
- Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Available online.
- Friedman, J., Hastie, T., and Tibshirani, R. The Elements of Statistical Learning. Springer. Available online.

All homework assignments are in Apache Spark, using the Discovery Cluster as a computing environment. Knowledge of Python is recommended but not strictly required; the first few lectures of the course cover Python to the extent necessary to proceed with the course.

Students enrolled in the class can find additional information in the course's Blackboard website.