PySpark | Project
- 1 minThis document is designed to be read in parallel with the code in the pyspark-template-project repository. Together, these constitute what we consider to be a ‘best practices’ approach to writing ETL jobs using Apache Spark and its Python (‘PySpark’) APIs. This project addresses the following topics:
how to pass configuration parameters to a PySpark job; how to handle dependencies on other modules and packages; how to structure ETL code in such a way that it can be easily tested and debugged; and, what constitutes a ‘meaningful’ test for an ETL job.
Find my notebook on my GitHub repository here: https://github.com/lp-dataninja/pyspark-example-project