Python Libraries Every Data Engineer Should Know
Python libraries for beginer data engineer
1. Requests
Requests is a simple, yet elegant, HTTP library.
Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your PUT
& POST
data — but nowadays, just use the json
method!
2. Psycopg2 and similar database libraries
Psycopg2 is the most popular PostgreSQL database adapter for the Python programming language.
3. Beautifulsoup and Scrapy
Beautifulsoup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
You can find Beautifulsoup official documentation here https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Scrapy an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
You can find Scrapy official documentation here https://docs.scrapy.org/en/latest/
4. Datetime
Datetime, the datetime module supplies classes for manipulating dates and times.
5. Virtualenv
Virtualenv, a tool for creating isolated virtual python environments. It creates an environment that has its own installation directories, that doesn’t share libraries with other virtualenv environments (and optionally doesn’t access the globally installed libraries either).
Python libraries for intermediate data engineer
6. Airflow
Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
7. Boto3 and similar libraries to interact with cloud
Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.
Get started quickly using AWS with boto3, the AWS SDK for Python. Boto3 makes it easy to integrate your Python application, library, or script with AWS services including Amazon S3, Amazon EC2, Amazon DynamoDB, and more.
pip install boto3
8. Flask/Django
Flask is a micro web framework written in Python. Flask is used for developing web applications using python, implemented on Werkzeug and Jinja2.
It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions.
Advantages of using Flask framework are:
- A built-in development server
- A fast debugger provided.
Django is a free and open-source, Python-based web framework that follows the model–template–views architectural pattern. It is a high-level Python web framework that encourages rapid development and clean, pragmatic design.
Django takes care of much of the hassle of web development, so you can focus on writing your app without needing to reinvent the wheel. It’s free and open source.
Advantages of using Django framework are:
-
Ridiculously fast
Django was designed to help developers take applications from concept to completion as quickly as possible. -
Reassuringly secure
Django takes security seriously and helps developers avoid many common security mistakes -
Exceedingly scalable
Some of the busiest sites on the web leverage Django’s ability to quickly and flexibly scale. -
Built by experienced developers
-
Free and open source
Django is my favourite web framework
Python libraries for advanced data engineer
9. Pyspark
10. Pyarrow
11. Pandas
It is essential for any sort of data wrangling with Python.
12. NumPy
It is essential for any sort of algebra if you want to dive deeper into ML
optional: based on need to know
MyPy/Pydantic
For data validation & static typing
Pytest - for testing
matplotlib & seaborn
for data visualization in Python
any sort of file libraries for specific file formats like json, csv, avro-python etc.
ML libraries like scikit-learn
FastAPI as an alternative to Django/Flask
Selenium
argparse for scripting
Post Tags:
- Previous: Machine Learning Model Cheatsheet
- Next: Learn to Generate Data Using SQL