MDAA dataops-job module
npm install @aws-mdaa/dataops-jobThe Data Ops Job CDK application is used to deploy the resources required to support and perform data operations on top of a Data Lake using Glue Jobs.
---
Glue Jobs - Glue Jobs will be created for each job specification in the configs
- Automatically configured to use project security config
- Can optionally be VPC bound (via Glue connection)
- Automatically configured to use project bucket as temp location
- Can use job templates to promote reuse/minimize config duplication
---
Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:
``yaml`
dataops-job: # Module Name can be customized
module_path: '@aws-caef/dataops-job' # Must match module NPM package name
module_configs:
- ./dataops-job.yaml # Filename/path can be customized
Job configs can be templated in order to reuse job definitions across multiple jobs for which perhaps only a few parameters change (such as input/output paths). Templates can be stored separate from job configs, or stored together with job configs in the same file.
`yaml
projectName: dataops-project-test
templates:
# An example job template. Can be referenced from other jobs. Will not itself be deployed.
ExamplePythonTemplate:
executionRoleArn: some-arn
# (required) Command definition for the glue job
command:
# (required) Either of "glueetl" | "pythonshell"
name: 'glueetl'
# (optional) Python version. Either "2" or "3"
pythonVersion: '3'
# (required) Path to a .py file relative to the configuration.
scriptLocation: ./src/glue/python/job.py
# (required) Description of the Glue Job
description: Example of a Glue Job using an inline script
# (optional) List of connections for the glue job to use. Reference back to the connection name in the 'connections:' section of the project.yaml
connections:
- project:connections/connectionVpc
# (optional) key: value pairs for the glue job to use. see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
defaultArguments:
--job-bookmark-option: job-bookmark-enable
# (optional) maximum concurrent runs. See: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-ExecutionProperty
executionProperty:
maxConcurrentRuns: 1
# (optional) Glue version to use as a string. See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
glueVersion: '2.0'
# (optional) Maximum capacity. See: MaxCapcity Section: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html
# Use maxCapacity or WorkerType. Not both.
#maxCapacity: 1
# (optional) Maximum retries. see: MaxRetries section:
maxRetries: 3
# (optional) Number of minutes to wait before sending a job run delay notification.
notificationProperty:
notifyDelayAfter: 1
# (optional) Number of workers to provision
#numberOfWorkers: 1
# (optional) Number of minutes to wait before considering the job timed out
timeout: 60
# (optional) Worker type to use. Any of: "Standard" | "G.1X" | "G.2X"
# Use maxCapacity or WorkerType. Not both.
#workerType: Standard
# An example job template. Can be referenced from other jobs. Will not itself be deployed.
ExampleScalaTemplate:
executionRoleArn: some-arn
# (required) Command definition for the glue job
# (optional) key: value pairs for the glue job to use. see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
defaultArguments:
--job-language: scala
# (optional) Glue version to use as a string. See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
glueVersion: '5.0'
jobs:
# Job definitions below
PythonJobOne: # Job Name
template: 'ExamplePythonTemplate' # Reference a job template.
defaultArguments:
--Input: s3://some-bucket/some-location1
allocatedCapacity: 2
continuousLogging:
# For allowed values, refer https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_logs.RetentionDays.html
# Possible values are: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653, and 0.
logGroupRetentionDays: 3
PythonJobTwo:
template: 'ExamplePythonTemplate' # Reference a job template.
defaultArguments:
--Input: s3://some-bucket/some-location2
--enable-spark-ui: 'true'
--spark-event-logs-path: s3://some-bucket/spark-event-logs-path/JobTwo/
allocatedCapacity: 20
# (Optional) List of all the helper scripts reference in main glue ETL script.
# All these helper scripts will be grouped at immediate parent directory level, which will result in dedicated zip.
# After deployment, they will be alongside the main script. Hence, must be referenced by file names directly from main glue script
# Example (main.py)
# from core import core_function1, core_function2;
# from helper_etl import helper_function1, helper_function2;
additionalScripts:
- ./src/glue/python/helper_etl.py
- ./src/glue/python/utils/core.py
# (Optional) List of additional files which will be available to the Glue Job next to the main script
additionalFiles:
- ./src/glue/scala/extra_file.txt
# Job definitions below
ScalaJobOne: # Job Name
template: 'ExampleScalaTemplate' # Reference a job template.
description: testing
defaultArguments:
--class: some.java.package.App
allocatedCapacity: 2
command:
# (required) Either of "glueetl" | "pythonshell"
name: 'glueetl'
# (required) Path to a script file relative to the configuration.
scriptLocation: ./src/glue/scala/App.scala
# (Optional) List of additional files which will be available to the Glue Job next to the main script
additionalFiles:
- ./src/glue/scala/extra_file.txt
# (Optional) List of additional jars which will be loaded into the Spark driver and executor JVMs for use
# within the Scala script
additionalJars:
- ./src/glue/scala/lib/extra.jar
``