Essential Day-to-Day Commands for Data Engineers
This article is still W.I.P and solely exists for my own purpose.
What is this, man?
This is my command center for Spark, Scala, sbt, PostgreSQL, Docker, and SQL commands, streamlining my daily tasks and serving as a quick cheat sheet to make things faster and easier.
Let's go!
Creating and Activating Python Virtual Environments with venv
While there are many powerful tools available such as conda
, pyenv
, and Poetry
; each offering unique features, when it comes to getting a project up and running in a simplistic way, nothing beats built-in venv
module.
python3 -m venv myenv
This approach keeps your project environment isolated from your system-wide Python installation and helps avoid dependency conflicts.
Step-by-Step Guide:
1. Create the Virtual Environment
In your project directory, run:
python3 -m venv myenv
This command creates a new directory named myenv
containing a separate Python interpreter, libraries, and scripts.
2. Activate the Virtual Environment
To start using the virtual environment, activate it with:
source myenv/bin/activate
3. Work Within the Environment
With the environment activated, install packages and run your scripts without affecting your global Python setup. For example:
pip install requests
4. Deactivate the Virtual Environment
Once you're done working, simply run:
deactivate
This command returns you to your system's default Python interpreter.
Using Python's venv is a lightweight and effective way to maintain clean project environments. Whether you're juggling multiple projects or just want to avoid dependency conflicts, this simple approach has you covered.
While there are many advanced tools out there, sometimes the simplest method is the best for getting up and running quickly.
Scala sbt Project Commands
sbt stands for Simple Build Tool for Scala
Create a new Scala3 project using sbt
# creates a new Scala 3 project template
sbt new scala/scala3.g8
# creates a new Scala 2 project template
sbt new scala/scala-seed.g8
To find all the classes in your sbt project
sbt run 'show discoveredMainClasses'
Commonly used sbt commands that are particularly relevant for Scala data engineering projects.
# Project setup and dependencies
sbt new scala/scala-seed.g8 # Create a new Scala project
sbt clean # Clean all generated files
sbt update # Update dependencies
sbt dependencyUpdates # Show dependency updates available
# Building and running
sbt compile # Compile the project
sbt run # Run the main class
sbt "runMain com.example.MainClass" # Run a specific main class
sbt package # Create a JAR file
sbt assembly # Create a fat JAR with all dependencies
# Testing
sbt test # Run all tests
sbt "testOnly com.example.MyTest" # Run a specific test
sbt "testOnly *MyTest" # Run tests matching pattern
sbt coverage # Enable code coverage
sbt coverageReport # Generate coverage report
# Continuous development
sbt ~compile # Recompile on source changes
sbt ~test # Run tests on code changes
sbt console # Start Scala REPL with project classes
# Debugging and inspection
sbt dependencyTree # Show dependency tree
sbt scalafixLint # Run Scalafix linter
sbt scalafmtCheck # Check code formatting
sbt scalafmt # Format code automatically
# Multi-project builds
sbt "project core" # Switch to a specific subproject
sbt projects # List all subprojects
A few extra tips that are particularly useful for data engineering:
- Use
sbt assembly
when building jobs for Spark, as it creates a fat JAR with all dependencies included - Add
fork in run := true
to your build.sbt to avoid memory issues with large datasets - The
sbt console
command is great for testing data transformations interactively - Consider using
sbt-revolver
plugin for faster development cycles with long-running services
Spark Scala Commands
A few useful Spark-Scala commands and examples commonly used in data engineering projects
// SparkSession initialization
val spark = SparkSession.builder()
.appName("MySparkApp")
.master("local[*]") // Use "*" for all available cores
.config("spark.some.config.option", "value")
.getOrCreate()
// Reading data
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("path/to/file.csv")
// Writing data
df.write
.mode("overwrite") // or "append", "error", "ignore"
.partitionBy("column_name")
.parquet("output/path")
// Common transformations
df.select("col1", "col2")
df.filter($"age" > 25)
df.groupBy("department").agg(avg("salary"))
df.join(otherDf, Seq("id"), "left")
// Window functions
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", rank().over(windowSpec))
// UDFs (User Defined Functions)
val myUdf = udf((x: String) => x.toUpperCase)
df.withColumn("upper_name", myUdf($"name"))
// Performance optimization
df.cache() // Cache dataframe in memory
df.persist(StorageLevel.MEMORY_AND_DISK)
df.unpersist()
// Spark SQL
spark.sql("SELECT * FROM my_table WHERE age > 25")
df.createOrReplaceTempView("my_table")
// RDD operations (when needed)
val rdd = df.rdd
rdd.map(row => row.getString(0))
.filter(_.nonEmpty)
.collect()
// Debugging and analysis
df.show() // Display first 20 rows
df.printSchema() // Show schema
df.explain() // Show execution plan
df.describe().show() // Basic statistics
Scala Spark Configuration Commands
// Memory configuration
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.driver.memory", "2g")
// Partition control
spark.conf.set("spark.sql.shuffle.partitions", "200")
df.repartition(10) // Adjust number of partitions
// Broadcast joins for small tables
import org.apache.spark.sql.functions.broadcast
df1.join(broadcast(df2), "key")
// Dynamic partition pruning
spark.conf.set("spark.sql.dynamic.partition.pruning", "true")
Scala Spark Logging and Monitoring Commands
// Set log level
spark.sparkContext.setLogLevel("WARN") // Options: OFF, ERROR, WARN, INFO, DEBUG
// Get application UI URL
spark.sparkContext.uiWebUrl
// Monitor active jobs
spark.sparkContext.statusTracker
Spark-submit Commands
# Basic spark-submit
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
path/to/application.jar
# With specific resource configuration
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 3 \
--driver-memory 2G \
path/to/application.jar
# With additional configurations
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=200 \
path/to/application.jar arg1 arg2
# Running in local mode (useful for testing)
spark-submit \
--class com.example.MainClass \
--master local[*] \
path/to/application.jar
# With additional jars or files
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
--jars additional1.jar,additional2.jar \
--files config.properties \
path/to/application.jar
Key parameters explained:
--class
: The main class of your application
--master:
Cluster manager (yarn, local, spark://)
--deploy-mode
: cluster (driver runs on cluster) or client (driver runs locally)
--executor-memory
: Memory per executor
--executor-cores
: Cores per executor
--num-executors
: Number of executors to launch
--driver-memory
: Memory for driver process
--jars:
Additional JARs needed
--files:
Additional files to be distributed
Spark Shell Interactive Commands
These commands are particularly useful when working in a production environment or when debugging Spark applications.
# Running Spark Shell (interactive)
spark-shell --master yarn \
--executor-memory 4G \
--executor-cores 2
# Running PySpark Shell
pyspark --master yarn \
--executor-memory 4G \
--executor-cores 2
# With logging configuration
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
--files log4j.properties \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
path/to/application.jar
# With Spark History Server configuration
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs://history/directory \
path/to/application.jar
# With specific queue assignment (for YARN)
spark-submit \
--class com.example.MainClass \
--master yarn \
--deploy-mode cluster \
--queue production_queue \
path/to/application.jar
# View application logs
yarn logs -applicationId <application_id>
# List running applications
yarn application -list
# Kill a running application
yarn application -kill <application_id>
dbt Commands
# Project setup
dbt init project_name # Create a new dbt project
dbt deps # Install dependencies from packages.yml
# Building models
dbt run # Run all models
dbt run --models model_name # Run specific model
dbt run --models +model_name # Run model and all its dependencies
dbt run --models model_name+ # Run model and all its dependents
dbt run --models tag:staging # Run models with specific tag
dbt run --exclude model_name # Run all models except specified one
dbt run --full-refresh # Run with full refresh (rebuild from scratch)
# Testing
dbt test # Run all tests
dbt test --models model_name # Test specific model
dbt test --data # Run data tests only
dbt test --schema # Run schema tests only
# Documentation
dbt docs generate # Generate documentation
dbt docs serve # Serve documentation locally
# Debugging and analysis
dbt compile # Compile SQL without executing
dbt parse # Check for parsing errors
dbt run --models model_name -v # Verbose output
dbt debug # Check configuration and connectivity
dbt run-operation my_macro # Run a macro
dbt run --profiles-dir path # Use alternate profiles directory
# State management
dbt source freshness # Check if source data is fresh
dbt snapshot # Run snapshot models for SCD Type 2
dbt seed # Load CSV files into database
# Environment management
dbt run --target prod # Run using production target
dbt run --vars '{"var": "value"}' # Set variables
PostgreSQL Commands
Check PostgreSQL Version (Homebrew):
brew formulae | grep postgresql@
Find All Tables in a Database:
SELECT
table_name
FROM
information_schema.tables
WHERE
table_schema = 'public';
Show Indexes on a Table:
SELECT
indexname,
indexdef
FROM
pg_indexes
WHERE
tablename = 'your_table_name';
Check Row Count of Each Table:
SELECT
relname AS table_name,
n_live_tup AS row_count
FROM
pg_stat_user_tables
ORDER BY
relname;
Backup a Database:
pg_dump -U your_username -F c -b -v -f "backup_name.backup" database_name
Docker & Container Management
List All Containers (Running and Stopped):
docker ps -a
Start a PostgreSQL Container with Specific Configurations:
docker run --name postgres-container -e POSTGRES_USER=user -e POSTGRES_PASSWORD=pass -d postgres
Attach to a Running Container:
docker exec -it container_name bash
Stop All Containers:
docker stop $(docker ps -aq)
Remove All Containers and Volumes:
docker rm $(docker ps -aq)
docker volume prune -f
Docker Image Commands
# Images
docker images # List all images
docker pull image_name # Pull an image
docker build -t name:tag . # Build image from Dockerfile
docker rmi image_id # Remove image
docker image prune -a # Remove all unused images
Docker Container Commands
# Containers
docker ps # List running containers
docker ps -a # List all containers
docker run -d -p 8080:80 image # Run container (detached, port mapping)
docker exec -it container_id bash # Enter container shell
docker stop container_id # Stop container
docker rm container_id # Remove container
docker logs -f container_id # Follow container logs
docker stats
Docker Compose Commands
# Docker Compose
docker-compose up -d # Start all services
docker-compose down # Stop and remove all containers
docker-compose logs -f # Follow logs for all services
docker-compose ps # List running services
docker-compose build # Build/rebuild services
docker-compose restart # Restart all services
Docker Network Commands
# Network
docker network ls # List networks
docker network create name # Create network
docker network connect name container # Connect container to network
docker network inspect name # Inspect network
Docker Volume Commands
# Volumes
docker volume ls # List volumes
docker volume create name # Create volume
docker volume rm name # Remove volume
docker volume prune # Remove unused volumes
Docker System Commands
# System
docker system df # Show docker disk usage
docker system prune # Remove unused data
docker system prune -a # Remove all unused data including volumes
Docker Registry Commands
# Registry
docker login # Log in to registry
docker push image_name # Push image to registry
docker tag source:tag target:tag # Tag image
Docker Debugging Commands
# Debugging
docker inspect container_id # Show container details
docker events # Show real time events
docker top container_id # Show running processes
docker diff container_id # Show changed files
Docker Environment and configuration Commands
# Environment and configuration
docker run -e "VAR=value" image # Set environment variable
docker run --env-file=file.env image # Use env file
Data Engineering & SQL
Get Column Names of a Table:
SELECT
column_name,
data_type,
character_maximum_length,
is_nullable
FROM
information_schema.columns
WHERE
table_name = 'your_table';
Retrieve Duplicate Rows in a Table:
SELECT
column1,
column2,
COUNT(*) AS duplicate_count
FROM
your_table
GROUP BY
column1,
column2
HAVING
COUNT(*) > 1;
Query Top N Rows per Group (Window Function):
SELECT
column1,
column2,
ROW_NUMBER() OVER (
PARTITION BY
column1
ORDER BY
column2 DESC
) AS row_num
FROM
your_table
WHERE
row_num <= N;
pgcli Commands
# Basic connection
pgcli -h hostname -U username -d database_name
pgcli postgres://username:password@hostname:5432/database_name
# Switch databases
\c database_name
# List things
\l # List all databases
\dt # List all tables
\du # List all users/roles
\dn # List all schemas
\df # List all functions
\dv # List all views
\di # List all indexes
# Describe objects
\d table_name # Describe table
\d+ table_name # Detailed table description
\dx # List installed extensions
# Output formats
\x # Toggle expanded display
\H # Toggle HTML output
\T filename # Set output to file
\o filename # Send query results to file
# Special commands
\timing # Toggle timing of commands
\e # Edit command in external editor
\i filename # Execute commands from file
\? # Show help for all slash commands
# Quit
\q # Exit pgcli
Useful keyboard shortcuts in pgcli:
Ctrl + R
: Reverse search in history
Ctrl + D
: Exit
F3
: Toggle multiline mode
Alt + Enter
: Insert newline without executing
Ctrl + E
: Move cursor to end of line
Ctrl + A
: Move cursor to start of line
Parquet Commands
parquet-tools schema file_name.parquet
parquet-tools head -n 5 file_name.parquet
Post Tags: