- Spark: Map-reduce framework for dealing with big data, especially for data that doesn't fit into memory. Utilizes parallelization.
- AWS: Cloud platform for many tools used in software engineering.
- AWS Fargate: A task launch mode for
ECS task, where it automatically shuts down once a container exits. With EC2 launch mode, you'll have to turn off the machine yourself.
- AWS Lambda: Serverless function, can be used with docker image too. Can also hook this with API gateway to make it act as API endpoint.
- AWS RDS: Managed databases from AWS.
- ECS Task: Cron-like schedule for a task. Essentially at specified time, it runs a predefined docker image (you should configure your
- Parquet: Columnar data blob format, very efficient due to column-based compression with schema definition baked in.
- Dagster: Task orchestration framework with built-in pipelines validatioin.
- ETL: Stands for extract-transform-load. Essentially it means "moving data from A to B, with optional data wrangling in the middle."
- NLP: Using machine (computer) to work on human languages. For instance, analyze whether a message is positive or negative.
- Pandas: Dataframe wrangler, think of programmable Excel.
- Postgres: RMDBS with good performance.
- Great expectations: A framework for data validation.
- Docker: Virtualization via containers.
- Git: Version control.
- Kubernetes: Container orchestration system.
- Terraform: Infrastructure as code tool, essentially you use it to store a blueprint for your infra setup. If you were to move to another account, you can re-conjure existing infra with one command. This makes editing infra config easier too, since it automatically cleans up / update config automatically.
- PostGIS: GIS extension for Postgres.
- MLflow: A framework to track model parameters and output. Can also store model artifact as well.
- Jupyter: Python notebook, used for exploring solutions before converting it to .py.