Skip to main content

Data Engineering Best Practices

This page is still W.I.P and solely exists for my own reference.

Avoid Multiple SparkSessions and SparkContexts

Creating multiple SparkSessions and SparkContexts can cause problems. It's a best practice to use the SparkSession.builder.getOrCreate() method. This gives you an existing SparkSession if there's one around, or it makes a new one if needed.

# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession

# Create spark_session
spark_session = SparkSession.builder.getOrCreate()

# Print spark_session
print(spark_session)

This page is still W.I.P and solely exists for my own reference.