Skip to main content

Essential Day-to-Day Commands for Data Engineers

This article is still W.I.P and solely exists for my own purpose.

What is this, man?

This is my command center for Spark, Scala, sbt, PostgreSQL, Docker, and SQL commands, streamlining my daily tasks and serving as a quick cheat sheet to make things faster and easier.

Let's go!

Creating and Activating Python Virtual Environments with venv

While there are many powerful tools available such as conda, pyenv, and Poetry; each offering unique features, when it comes to getting a project up and running in a simplistic way, nothing beats built-in venv module.

python3 -m venv myenv

This approach keeps your project environment isolated from your system-wide Python installation and helps avoid dependency conflicts.

Step-by-Step Guide:

1. Create the Virtual Environment

In your project directory, run:

python3 -m venv myenv

This command creates a new directory named myenv containing a separate Python interpreter, libraries, and scripts.

2. Activate the Virtual Environment

To start using the virtual environment, activate it with:

source myenv/bin/activate

3. Work Within the Environment

With the environment activated, install packages and run your scripts without affecting your global Python setup. For example:

pip install requests

4. Deactivate the Virtual Environment

Once you're done working, simply run:

deactivate

This command returns you to your system's default Python interpreter.

Using Python's venv is a lightweight and effective way to maintain clean project environments. Whether you're juggling multiple projects or just want to avoid dependency conflicts, this simple approach has you covered.

While there are many advanced tools out there, sometimes the simplest method is the best for getting up and running quickly.

Scala sbt Project Commands

sbt stands for Simple Build Tool for Scala

Create a new Scala3 project using sbt

 # creates a new Scala 3 project template
 sbt new scala/scala3.g8

 # creates a new Scala 2 project template
 sbt new scala/scala-seed.g8

To find all the classes in your sbt project

sbt run 'show discoveredMainClasses'

Commonly used sbt commands that are particularly relevant for Scala data engineering projects.

# Project setup and dependencies
sbt new scala/scala-seed.g8             # Create a new Scala project
sbt clean                               # Clean all generated files
sbt update                              # Update dependencies
sbt dependencyUpdates                   # Show dependency updates available

# Building and running
sbt compile                             # Compile the project
sbt run                                 # Run the main class
sbt "runMain com.example.MainClass"     # Run a specific main class
sbt package                             # Create a JAR file
sbt assembly                            # Create a fat JAR with all dependencies

# Testing
sbt test                                # Run all tests
sbt "testOnly com.example.MyTest"       # Run a specific test
sbt "testOnly *MyTest"                  # Run tests matching pattern
sbt coverage                            # Enable code coverage
sbt coverageReport                      # Generate coverage report

# Continuous development
sbt ~compile                            # Recompile on source changes
sbt ~test                               # Run tests on code changes
sbt console                             # Start Scala REPL with project classes

# Debugging and inspection
sbt dependencyTree                      # Show dependency tree
sbt scalafixLint                        # Run Scalafix linter
sbt scalafmtCheck                       # Check code formatting
sbt scalafmt                            # Format code automatically

# Multi-project builds
sbt "project core"                      # Switch to a specific subproject
sbt projects                            # List all subprojects

A few extra tips that are particularly useful for data engineering:

  • Use sbt assembly when building jobs for Spark, as it creates a fat JAR with all dependencies included
  • Add fork in run := true to your build.sbt to avoid memory issues with large datasets
  • The sbt console command is great for testing data transformations interactively
  • Consider using sbt-revolver plugin for faster development cycles with long-running services

Spark Scala Commands

A few useful Spark-Scala commands and examples commonly used in data engineering projects

// SparkSession initialization
val spark = SparkSession.builder()
  .appName("MySparkApp")
  .master("local[*]")    // Use "*" for all available cores
  .config("spark.some.config.option", "value")
  .getOrCreate()

// Reading data
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("path/to/file.csv")

// Writing data
df.write
  .mode("overwrite")  // or "append", "error", "ignore"
  .partitionBy("column_name")
  .parquet("output/path")

// Common transformations
df.select("col1", "col2")
df.filter($"age" > 25)
df.groupBy("department").agg(avg("salary"))
df.join(otherDf, Seq("id"), "left")

// Window functions
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", rank().over(windowSpec))

// UDFs (User Defined Functions)
val myUdf = udf((x: String) => x.toUpperCase)
df.withColumn("upper_name", myUdf($"name"))

// Performance optimization
df.cache()  // Cache dataframe in memory
df.persist(StorageLevel.MEMORY_AND_DISK)
df.unpersist()

// Spark SQL
spark.sql("SELECT * FROM my_table WHERE age > 25")
df.createOrReplaceTempView("my_table")

// RDD operations (when needed)
val rdd = df.rdd
rdd.map(row => row.getString(0))
  .filter(_.nonEmpty)
  .collect()

// Debugging and analysis
df.show()  // Display first 20 rows
df.printSchema()  // Show schema
df.explain()  // Show execution plan
df.describe().show()  // Basic statistics

Scala Spark Configuration Commands

// Memory configuration
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.driver.memory", "2g")

// Partition control
spark.conf.set("spark.sql.shuffle.partitions", "200")
df.repartition(10)  // Adjust number of partitions

// Broadcast joins for small tables
import org.apache.spark.sql.functions.broadcast
df1.join(broadcast(df2), "key")

// Dynamic partition pruning
spark.conf.set("spark.sql.dynamic.partition.pruning", "true")

Scala Spark Logging and Monitoring Commands

// Set log level
spark.sparkContext.setLogLevel("WARN")  // Options: OFF, ERROR, WARN, INFO, DEBUG

// Get application UI URL
spark.sparkContext.uiWebUrl

// Monitor active jobs
spark.sparkContext.statusTracker

Spark-submit Commands

# Basic spark-submit
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  path/to/application.jar
# With specific resource configuration
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 4G \
  --executor-cores 2 \
  --num-executors 3 \
  --driver-memory 2G \
  path/to/application.jar
# With additional configurations
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.shuffle.service.enabled=true \
  --conf spark.sql.shuffle.partitions=200 \
  path/to/application.jar arg1 arg2
# Running in local mode (useful for testing)
spark-submit \
  --class com.example.MainClass \
  --master local[*] \
  path/to/application.jar
# With additional jars or files
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  --jars additional1.jar,additional2.jar \
  --files config.properties \
  path/to/application.jar

Key parameters explained:

--class: The main class of your application
--master: Cluster manager (yarn, local, spark://)
--deploy-mode: cluster (driver runs on cluster) or client (driver runs locally)
--executor-memory: Memory per executor
--executor-cores: Cores per executor
--num-executors: Number of executors to launch
--driver-memory: Memory for driver process
--jars: Additional JARs needed
--files: Additional files to be distributed

Spark Shell Interactive Commands

These commands are particularly useful when working in a production environment or when debugging Spark applications.

# Running Spark Shell (interactive)
spark-shell --master yarn \
  --executor-memory 4G \
  --executor-cores 2

# Running PySpark Shell
pyspark --master yarn \
  --executor-memory 4G \
  --executor-cores 2
# With logging configuration
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  --files log4j.properties \
  --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
  path/to/application.jar
# With Spark History Server configuration
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=hdfs://history/directory \
  path/to/application.jar
# With specific queue assignment (for YARN)
spark-submit \
  --class com.example.MainClass \
  --master yarn \
  --deploy-mode cluster \
  --queue production_queue \
  path/to/application.jar
# View application logs
yarn logs -applicationId <application_id>

# List running applications
yarn application -list

# Kill a running application
yarn application -kill <application_id>

dbt Commands

# Project setup
dbt init project_name           # Create a new dbt project
dbt deps                        # Install dependencies from packages.yml

# Building models
dbt run                         # Run all models
dbt run --models model_name     # Run specific model
dbt run --models +model_name    # Run model and all its dependencies
dbt run --models model_name+    # Run model and all its dependents
dbt run --models tag:staging    # Run models with specific tag
dbt run --exclude model_name    # Run all models except specified one
dbt run --full-refresh          # Run with full refresh (rebuild from scratch)

# Testing
dbt test                        # Run all tests
dbt test --models model_name    # Test specific model
dbt test --data                 # Run data tests only
dbt test --schema               # Run schema tests only

# Documentation
dbt docs generate               # Generate documentation
dbt docs serve                  # Serve documentation locally

# Debugging and analysis
dbt compile                     # Compile SQL without executing
dbt parse                       # Check for parsing errors
dbt run --models model_name -v  # Verbose output
dbt debug                       # Check configuration and connectivity
dbt run-operation my_macro      # Run a macro
dbt run --profiles-dir path     # Use alternate profiles directory

# State management
dbt source freshness           # Check if source data is fresh
dbt snapshot                   # Run snapshot models for SCD Type 2
dbt seed                       # Load CSV files into database

# Environment management
dbt run --target prod          # Run using production target
dbt run --vars '{"var": "value"}' # Set variables

PostgreSQL Commands

Check PostgreSQL Version (Homebrew):

brew formulae | grep postgresql@

Find All Tables in a Database:

SELECT
  table_name
FROM
  information_schema.tables
WHERE
  table_schema = 'public';

Show Indexes on a Table:

SELECT
  indexname,
  indexdef
FROM
  pg_indexes
WHERE
  tablename = 'your_table_name';

Check Row Count of Each Table:

SELECT
  relname AS table_name,
  n_live_tup AS row_count
FROM
  pg_stat_user_tables
ORDER BY
  relname;

Backup a Database:

pg_dump -U your_username -F c -b -v -f "backup_name.backup" database_name

Docker & Container Management

List All Containers (Running and Stopped):

docker ps -a

Start a PostgreSQL Container with Specific Configurations:

docker run --name postgres-container -e POSTGRES_USER=user -e POSTGRES_PASSWORD=pass -d postgres

Attach to a Running Container:

docker exec -it container_name bash

Stop All Containers:

docker stop $(docker ps -aq)

Remove All Containers and Volumes:

docker rm $(docker ps -aq)
docker volume prune -f

Docker Image Commands

# Images
docker images                    # List all images
docker pull image_name          # Pull an image
docker build -t name:tag .      # Build image from Dockerfile
docker rmi image_id             # Remove image
docker image prune -a           # Remove all unused images

Docker Container Commands

# Containers
docker ps                       # List running containers
docker ps -a                    # List all containers
docker run -d -p 8080:80 image # Run container (detached, port mapping)
docker exec -it container_id bash  # Enter container shell
docker stop container_id        # Stop container
docker rm container_id          # Remove container
docker logs -f container_id     # Follow container logs
docker stats

Docker Compose Commands

# Docker Compose
docker-compose up -d            # Start all services
docker-compose down            # Stop and remove all containers
docker-compose logs -f         # Follow logs for all services
docker-compose ps              # List running services
docker-compose build           # Build/rebuild services
docker-compose restart         # Restart all services

Docker Network Commands

# Network
docker network ls              # List networks
docker network create name     # Create network
docker network connect name container  # Connect container to network
docker network inspect name    # Inspect network

Docker Volume Commands

# Volumes
docker volume ls               # List volumes
docker volume create name      # Create volume
docker volume rm name         # Remove volume
docker volume prune           # Remove unused volumes

Docker System Commands

# System
docker system df              # Show docker disk usage
docker system prune          # Remove unused data
docker system prune -a       # Remove all unused data including volumes

Docker Registry Commands

# Registry
docker login                  # Log in to registry
docker push image_name       # Push image to registry
docker tag source:tag target:tag  # Tag image

Docker Debugging Commands

# Debugging
docker inspect container_id   # Show container details
docker events                # Show real time events
docker top container_id      # Show running processes
docker diff container_id     # Show changed files

Docker Environment and configuration Commands

# Environment and configuration
docker run -e "VAR=value" image        # Set environment variable
docker run --env-file=file.env image   # Use env file

Data Engineering & SQL

Get Column Names of a Table:

SELECT
  column_name,
  data_type,
  character_maximum_length,
  is_nullable
FROM
  information_schema.columns
WHERE
  table_name = 'your_table';

Retrieve Duplicate Rows in a Table:

SELECT
  column1,
  column2,
  COUNT(*) AS duplicate_count
FROM
  your_table
GROUP BY
  column1,
  column2
HAVING
  COUNT(*) > 1;

Query Top N Rows per Group (Window Function):

SELECT
  column1,
  column2,
  ROW_NUMBER() OVER (
    PARTITION BY
      column1
    ORDER BY
      column2 DESC
  ) AS row_num
FROM
  your_table
WHERE
  row_num <= N;

pgcli Commands

# Basic connection
pgcli -h hostname -U username -d database_name
pgcli postgres://username:password@hostname:5432/database_name

# Switch databases
\c database_name

# List things
\l             # List all databases
\dt            # List all tables
\du            # List all users/roles
\dn            # List all schemas
\df            # List all functions
\dv            # List all views
\di            # List all indexes

# Describe objects
\d table_name  # Describe table
\d+ table_name # Detailed table description
\dx            # List installed extensions

# Output formats
\x             # Toggle expanded display
\H             # Toggle HTML output
\T filename    # Set output to file
\o filename    # Send query results to file

# Special commands
\timing        # Toggle timing of commands
\e             # Edit command in external editor
\i filename    # Execute commands from file
\?             # Show help for all slash commands

# Quit
\q             # Exit pgcli

Useful keyboard shortcuts in pgcli:

Ctrl + R: Reverse search in history
Ctrl + D: Exit
F3: Toggle multiline mode
Alt + Enter: Insert newline without executing
Ctrl + E: Move cursor to end of line
Ctrl + A: Move cursor to start of line

Parquet Commands

parquet-tools schema file_name.parquet
parquet-tools head -n 5 file_name.parquet

Post Tags: