PySpark | Project

Tuesday. November 20, 2018 - 1 min

This document is designed to be read in parallel with the code in the pyspark-template-project repository. Together, these constitute what we consider to be a ‘best practices’ approach to writing ETL jobs using Apache Spark and its Python (‘PySpark’) APIs. This project addresses the following topics:

how to pass configuration parameters to a PySpark job; how to handle dependencies on other modules and packages; how to structure ETL code in such a way that it can be easily tested and debugged; and, what constitutes a ‘meaningful’ test for an ETL job.

Find my notebook on my GitHub repository here: https://github.com/lp-dataninja/pyspark-example-project

PySpark | Project

Related Posts