MDAA dataops-project module
npm install @aws-mdaa/dataops-projectThe Data Ops Project CDK application is used to deploy the resources required to support and perform data operations on top of a Data Lake, primarily using Glue Crawlers and Glue Jobs.
---
Project KMS Key - Used to encrypt all project information at rest across all project resources.
- Usage access granted to project data engineer and execution roles (by key policy)
- Usage/Admin access granted to data admin role (by key policy)
Project S3 Bucket - A storage location for project activities (scratch and temporary).
- Read/write access granted (by prefix) to project data engineer, execution, and data admin roles (by bucket policy)
- Used as temp location for all project glue jobs
- Used to deploy/stage all glue job code
- Can be used to store project-related derived data for downstream processing
Glue Databases - A Glue Catalog database will be created for each project database specified in the config.
- Can be used by project crawlers and jobs to store crawled/generated tables
LakeFormation Grants - Grant access to project Glue databases and tables
- Data lake location and read/write data lake permission grants can be automatically created for project execution and engineer roles
- Data lake permission grants (read or write) can be configured on a per database (and optionally table) basis for additional principals
- If using LakeFormation across accounts, database resource links and resource link describe grants can be created across accounts (required for cross account access)
- When cross-account resource links are created, consumer accounts need KMS decrypt permissions on the Glue catalog KMS key (to read encrypted database metadata). If KMS keys are managed by external stacks (e.g., glue-catalog-app), you must add consumer account IDs to the kmsKeyConsumerAccounts configuration in those stacks. The dataops-project will attempt to grant permissions automatically, but this only works if the KMS keys are managed within the same stack.
Project Glue Security Config - Security config which will be used by all jobs under the project
- Ensures all job output, logging, and bookmark data is encryped with the project KMS key
Project Glue SecurityGroups - Security groups which can be used by Glue Connections or other project resources
- All egress permitted by default
- Self-referencing ingress rule added by default (allows all traffic within security group, required by Glue)
- All other ingress traffic denied by default
Glue Connections - Glue connections for reuse across project jobs and crawlers
- Network connections for VPC access
- Can use either a project Security Group or an existing security group
- JDBC connections for RDBMS access
- Credentials should be stored in a secret and referenced using dynamic references
- Note that secret rotation will break this configuration. Instead, use a Network/Vpc connection and directly consume credentials from Secret in Glue Job code
Glue Custom Classifiers - Glue classifiers for reuse across project crawlers
---
Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:
``yaml`
dataops-project: # Module Name can be customized
module_path: '@aws-caef/dataops-project' # Must match module NPM package name
module_configs:
- ./dataops-project.yaml # Filename/path can be customized
`yamlArns for IAM role which will be authoring code within the project
dataEngineerRoles:
- arn: arn:{{partition}}:iam::{{account}}:role/sample-org-dev-instance1-roles-data-engineer
projectExecutionRoles:
- arn: ssm:/sample-org/instance1/generated-role/glue-role/arn
- id: generated-role-id:databrew
# If true (default false), LakeFormation read grants will be automatically created
# for the database for project data engineer roles
createReadGrantsForDataEngineerRoles: true
# If true (default false), LakeFormation read/write grants will be automatically created
# for the database and its S3 Location for project execution roles
createReadWriteGrantsForProjectExecutionRoles: true
# Removing cross-account resource links for testing
# createCrossAccountResourceLinkAccounts:
# - "12312412"
# Optional - the name of the resource links to be generated
# If not specified, defaults to the database name
createCrossAccountResourceLinkName: 'testing'
grants:
# Each grant is keyed with a name which is unique within the context
# of the database
example_read_grant:
# # (Optional) Specify the database permissions level ("read", "write", "super")
# # Defauls to "read"
databasePermissions: read
# # (Optional) Specify the table permissions level ("read", "write", "super")
# # Defauls to "read"
tablePermissions: read
# (Optional) - List of tables for which to create grants
# If not specified, permissions are granted to all tables in the database.
tables:
- test-table
# List of principal references in the "principals" section to which the permissions will be granted
principals:
# Each principal (principalArns key) must be named uniquely within the context of the database
principalA:
# Arn of IAM SAML IDP
federationProviderArn: some-federation-provider-arn
# Federated username
federatedUser: some-user-name
principalB:
federationProviderArn: some-federation-provider-arn
# Federated group
federatedGroup: some-group-name
# Can directly specify the principalArn.
principalArns:
principalC: some-other-role-arn
# Condensed DB config
test-database2:
description: Test Database 2
locationBucketName: some-bucket-name
locationPrefix: data/test2
lakeFormation:
createSuperGrantsForDataAdminRoles: true
createReadGrantsForDataEngineerRoles: true
createReadWriteGrantsForProjectExecutionRoles: true
# Removing cross-account resource links for testing
# createCrossAccountResourceLinkAccounts:
# - "12312412"
grants:
example_condensed_read_grant:
principalArns:
principalA: arn:{{partition}}:iam::{{account}}:role/cross-account-role
# A Database which will also create a Datazone Datasource (Requires the Datazone project to be configured in this config)
test-database3:
description: Test Datazone Datasource
locationPrefix: data/test-database3
createDatazoneDatasource: true
# Verbatim DB Name Config
test-database4:
description: Test Database 4
verbatimName: true
locationBucketName: some-bucket-name
locationPrefix: data/test4
lakeFormation:
createSuperGrantsForDataAdminRoles: true
createReadGrantsForDataEngineerRoles: true
createReadWriteGrantsForProjectExecutionRoles: true
# Removing cross-account resource links for testing
# createCrossAccountResourceLinkAccounts:
# - "12312412"
grants:
example_condensed_read_grant:
principalArns:
principalA: arn:{{partition}}:iam::{{account}}:role/cross-account-role
# Iceberg Compliant DB Name Config
test-database5:
description: Test Database 5
icebergCompliantName: true
locationBucketName: some-bucket-name
locationPrefix: data/test5
lakeFormation:
createSuperGrantsForDataAdminRoles: true
createReadGrantsForDataEngineerRoles: true
createReadWriteGrantsForProjectExecutionRoles: true
# Removing cross-account resource links for testing
# createCrossAccountResourceLinkAccounts:
# - "12312412"
grants:
example_condensed_read_grant:
principalArns:
principalA: arn:{{partition}}:iam::{{account}}:role/cross-account-role
# Tag-Based Access Control Database Config
test-database6:
description: Test Database with Tag-Based Access Control
locationBucketName: some-bucket-name
locationPrefix: data/test6
lakeFormation:
createSuperGrantsForDataAdminRoles: true
createReadWriteGrantsForProjectExecutionRoles: true
# Assign specific tag values to this database
databaseTagValues:
- tagKey: environment
tagValues: [dev]
- tagKey: data_tier
tagValues: [bronze]
- tagKey: data_classification
tagValues: [public]
# Define tag-based grants using LF-Tag expressions
tagBasedGrants:
# Grant for development environment access
dev_access:
principalArns:
dev-role: arn:{{partition}}:iam::{{account}}:role/dev-data-user
permissions: [DESCRIBE, SELECT]
resourceType: TABLE
lfTagExpression:
environment: [dev]
data_tier: [bronze, silver]
# Grant for production read access to public/internal data
prod_read_access:
principalArns:
prod-reader: arn:{{partition}}:iam::{{account}}:role/prod-data-reader
permissions: [DESCRIBE, SELECT]
resourceType: TABLE
lfTagExpression:
environment: [prod]
data_classification: [public, internal]
``