To deploy Dagster to AWS, EC2 or ECS can host Dagit, RDS can store runs and events, and S3 can act as an IO manager.
To host Dagit or Dagster Daemon on a bare VM or in Docker on EC2 or ECS, see Running Dagster as a service.
You can use a hosted RDS PostgreSQL database for your Dagster run/events
data. You can do this by setting blocks in your
$DAGSTER_HOME/dagster.yaml
appropriately.
run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
username: { username }
password: { password }
hostname: { hostname }
db_name: { database }
port: { port }
event_log_storage:
module: dagster_postgres.event_log
class: PostgresEventLogStorage
config:
postgres_db:
username: { username }
password: { password }
hostname: { hostname }
db_name: { db_name }
port: { port }
schedule_storage:
module: dagster_postgres.schedule_storage
class: PostgresScheduleStorage
config:
postgres_db:
username: { username }
password: { password }
hostname: { hostname }
db_name: { db_name }
port: { port }
In this case, you'll want to ensure you provide the right connection strings for your RDS instance, and ensure that the node or container hosting Dagit is able to connect to RDS.
Be sure that this file is present, and DAGSTER_HOME is set, on the node where Dagit is running.
Note that using RDS for run and event log storage does not require that Dagit be running in the cloud. If you are connecting a local Dagit instance to a remote RDS storage, double check that your local node is able to connect to RDS.
To enable parallel computation (e.g., with the multiprocessing or Dagster celery executors), you will also need to configure persistent intermediate storage -- for instance, using an S3 bucket to store intermediates.
You'll first need to need to add S3 storage to your ModeDefinition
.
from dagster_aws.s3.resources import s3_resource
from dagster_aws.s3.system_storage import s3_plus_default_intermediate_storage_defs
from dagster import ModeDefinition
prod_mode = ModeDefinition(
name='prod',
intermediate_storage_defs=s3_plus_default_intermediate_storage_defs,
resource_defs={'s3': s3_resource}
)
Then, just add the following YAML block in your pipeline config:
intermediate_storage:
s3:
config:
s3_bucket: your-s3-bucket-name
The resource uses boto
under the hood, so if you are accessing your private buckets you will
need to provide the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables
or follow one of the other boto authentication methods.
With this in place, your pipeline runs will store intermediates on S3 in
the location
s3://<bucket>/dagster/storage/<pipeline run id>/intermediates/<solid name>.compute