Tasks
Tasks are an abstraction that represent some action to take at workflow execution time. Some examples are:
- call a service
- insert data into a database
- do some transformation on a dataset
- scrape a webpage
Task Types
There are four types of tasks currently supported:
- python
- bash
- csip
- http
Task Structure
A task has a common list of properties and structure amongst all the different types:
Property | Description | Type | Required |
---|---|---|---|
task | The task name, must be unique and a valid identifier (i.e. no spaces) | string | Y |
python | Either an inline python string, or a path to a python file | string | At least one kind is required |
bash | Either an inline bash string, or a path to a shell | string | At least one kind is required |
http | a http url to make a request against, can be to a rest api, or just a normal website | string | At least one kind is required |
csip | A url to a csip service | string | At least one kind is required |
inputs | a object describing inputs to call the task with, the values can reference from other tasks or workflow variables | object | N |
outputs | a list of outputs to capture from the tasks execution | string array | N |
config | an optional configuration object, its available properties depend on the kind of task | object | N |
Referencing Task data later in the workflow
Each orca task once completed persists its state for reuse later.
apiVersion: '1.0'
version: '0.1'
name: referencing task outputs
job:
- task: get_today
python: |
import datetime
today = datetime.datetime.utcnow()
outputs:
- today
- task: print_today
python: print(today)
inputs:
today: task.get_today.today
In this example we are referencing the output specified by the get_today
task by the string task.get_today.today
Task state is namespaced under the task.
directive and each tasks data is namespaced further under the task name.
Python Task
A python task is flexible and can be used in a number of different ways to suit different needs, these are: inline python external script * external module with function calls
Inline python example
This is the example we are familiar with at this point:
task: get_today
python: |
import datetime
today = datetime.datetime.utcnow()
outputs:
- today
Here we are just writing our python inline to the task, this can be useful for very simple things but for more complex
tasks, its recommended to either use a external script or a module to complete your python needs. Here we can retrieve
the today
variable for use later in our workflow, just like we would be able to if we were writing all of this in
a normal python script
External python script example
External scripts are useful for when you want to perform some action that does not have any inputs (right now external
scripts do not support injecting inputs, but we are actively working to change this)
given the python file
scrape.py
import bs4
import requests
import datetime
today = datetime.datetime.utcnow()
forecast = 0
file = 'nwm.t{0}z.short_range.channel_rt'
url = 'https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.{0}/short_range/'
formatted_url = url.format(today.strftime("%Y%m%d"))
html = requests.get(formatted_url, headers={'Content-Type': 'text/plain'}).content
soup = bs4.BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
fmt_file = file.format(str(forecast).zfill(2))
find_file = lambda f: f.get_text().startswith(fmt_file)
file_exists = len ( list ( filter ( find_file, a_tags ) ) ) > 0
print('file {0} is present ? = {1}'.format(fmt_file, file_exists))
apiVersion: '1.0'
version: '0.1'
name: 'scrape nomads for netcdf file'
job:
- task: scrape
python: ./scrape.py
outputs:
- file_exists
- if: task.scrape.file_exists
do:
- ...
External python modules
Another way to use the python task is to reference a python module and directly access functions in the module. To do this you must utilize a special configuration for the python task using the config object.
The python config object has the following properties available
property | description | type | required |
---|---|---|---|
callable | The python function in the module to call | string | Y |
returns | A name to assign the return value of the callable | string | Y |
Lets rewrite the example above to a module, and see how orca can utilize it
scrape.py
import bs4
import requests
def get_html(url, today):
formatted_url = url.format(today.strftime("%Y%m%d"))
return requests.get(formatted_url, headers={'Content-Type': 'text/plain'}).content
def scrape_html(url, today, forecast, file):
html = get_html(url, today)
soup = bs4.BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
fmt_file = file.format(str(forecast).zfill(2))
find_file = lambda f: f.get_text().startswith(fmt_file)
file_exists = len ( list ( filter ( find_file, a_tags ) ) ) > 0
print('file {0} is present ? = {1}'.format(fmt_file, file_exists))
return file_exists
apiVersion: '1.0'
version: '0.1'
name: 'check if netcdf file exists for current hour'
var:
forecast: 0
nomadsUrl: 'https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.{0}/short_range/'
fileName: 'nwm.t{0}z.short_range.channel_rt'
job:
- task: get_today
python: |
import datetime
today = datetime.datetime.utcnow()
- task: scrape
python: ./scrape.py
config:
callable: scrape_html
returns: current_file_exists
inputs:
url: var.nomadsUrl
today: task.get_today.today
forecast: var.forecast
file: var.fileName
outputs:
- current_file_exists
- if: current_file_exists
do:
- .....
In this example we introduced a config object that specifies which function to call, and maps the inputs defined on the task to the inputs defined in the callable function. Additionally we specified a name for the return value, this name can be any valid identifier. If the function returns nothing then returns is not required