variables
Pipeline variables are defined at the top of a dataflow using name/value pairs set as jinja variables. This enables these variables to referenced anywhere in the dataflow using double curly braces (e.g. “{{ varname }}”). At runtime, these variables will be replaced.
Variables are stored in a dictionary in memory when a dataflow is run. Normally these are defined at the root of the YAML at the top of the dataflow. But they can also be set in a separate config file.
To learn more about how jinja uses variables, see template variables
Examples
variables:
var1: a
var2: 42
var3: True
var4: {"key": "value"}
var5: {{ kwargs.var5|default("somevalue", true) }} # set default example
steps:
- name: show_vars
params:
output: |
"{{ var1 }}" == "a".
{{ var2 }} == 42.
{{ var3 }} == True.
"{{ var4.key }}" == "value".
"{{ var5 }}" == "somevalue"
Scoped Variables
Variables can also be defined in the context of a given stage, job or step. This may be used to provide overrides or local variables.
When referenced locally, just the variable key can be used. If there is a need to review the variables from an external scope, the context prefix must be supplied.
The following example illustrates this:
jobs:
- name: first_job
variables:
key: value
steps:
- name: first_step
params:
output: |
"{{ key }}" == "value"
- name: second_job
steps:
- name: another_step
params:
output: |
"{{ jobs.first_job.variables.key }}" == "value"
Set Parent Variables
Setting output.set_parent_variables_enabled
to True
promotes variables to the parent.
For example:
steps:
- action: set
params:
data:
step_var1: A
step_var2: B
output:
set_parent_variables_enabled: True
- params:
output: |
"{{ step_var1 }}" == "A" # promoted from first_step
Scoped variables as aliases
Using the set parent variables feature is especially useful to create shorthand aliases for commonly used values.
Context references are a useful way to access data from the namespace. However, the string for the reference including the level and the name(s) can sometimes be long. This can in turn make jinja strings or context references that use them long.
Consider the following example: one step is getting setting a value that will be used in multiple subsequent steps. A scoped variable is created in parent job. The first step then sets it using a key named the same as the variable and promotes it to the parent variable. Now any following steps can reference it directly just using the variable name.
jobs:
- name: myjob
variables:
myvar: ""
steps:
- name: set_value
action: set
params:
data:
step_var1: A
step_var2: B
output:
set_parent_variables_enabled: True
cache:
- key: myvar
source: "step_var1"
- name: check_value
condition: "{{ myvar == 'A' }}" # promoted from first_step
params:
output: |
The variable 'myvar' was set by the set_value step!
Note a few things that were done in this more involved example:
- The variable “myvar” was declared in the job
- This is not a requirement for this example but it is a best practice. It is a helpful hint that the variable is available. Also, it can be used to provide a default value.
- The “key” name in the “cache” matches the variable name
- This explicitly names the data returned in that cache target. So in this case, the cache target created is “steps.set_value.data.myvar”.
- The source is set to an attribute of the data
- The value of “step_var1” is used to set “steps.set_value.data.myvar” (new in rvrdata-1.1.0)
The value from cache target “steps.set_value.data.myvar” is then promoted to the job variable “myvar” so that any subsequent steps can then use it.
Set Pipeline Variables
Setting output.set_pipeline_variables_enabled
to True
promotes variables to the pipeline level (new in rvrdata-1.1.0).
This is similar to set_parent_variables_enabled
but in this case, variables are set not on the parent, but at pipeline.
This enabled variables to be shared easily across jobs and steps.
For example:
jobs:
- name: first_job
steps:
- action: set
params:
data:
step_var1: A
step_var2: B
output:
set_pipeline_variables_enabled: True
- params:
output: "{{ step_var1 }}"
- name: second_job
steps:
- params:
output: |
"{{ step_var1 }}" == "A" # promoted from step in first_job
Input variables
Dataflow variables values can input from the command line, from configuration files and from environment variables.
Input variables are collected in a dictionary under the reserved key kwargs
.
It is a best practice to avoid storing secrets such as user ids and passwords in dataflows. Since dataflows are often stored in source code repositories, they should be treated with the same best practices are any code. Storing any values that may change regularly separately from dataflows will avoid having to update the YAML files.
Example types of inputs
The following snippet shows the various styles of inputs:
name: dataflow_with_input_variables
variables:
var1: "{{ kwargs.var1 }}" # keyword from config file or CLI or parent dataflow
var2: "{{ kwargs.var2 | default('VALUE', true) }}" # keyword with default
var3: "{{ "VALUE" | env_override('ENVVAR') }}" # static value overridden by env var
var4: "{{ kwargs.var4 | env_override('ENVVAR') }}" # kwarg overridden by env var
# Example below shows keyword with default and environment variable override
var5: "{{ kwargs.var5 | default('DEFAULTVALUE', true) | env_override('ENVVAR') }}"
Config File
Using a config file for common input variables is recommended. This avoids having to change values in multiple places and keeps updatable values separate from the logic in dataflows.
See rvrdata Config File for more detail on how to use a configuration file for common variables.
Command-line Keywords
Any value supplied in a command-line switch (a string prefixed --
) is converted into a keyword and is available in the kwargs
dictionary.
See rvrdata CLI Input Keywords for more detail on how to feed variables in from the command line.
Setting defaults
It is a best practice to use default values for input variables.
See jinja built-in filters - default for more information about setting a default value.
If no defaults are available and a input keyword is required, There should be a step early in the dataflow which checks to see if the required values are supplied and, if not, provide user-friendly error messages.
Environment Variables
Environment variables are a handy way to feed input variable values.
This may be used to avoid storing values in text files.
Or it may be used to share values with other tools using .env
files.
Passing in values with environment vars avoids having values visible on the command line.
Care should still be taken with proper management of environment variables to ensure security best practices are being followed.
A custom jinja filter env_override
can be used to pick up values from environment variables.
This takes the variable name as input and will override the supplied value with the value from the environment.
Order of precedence for inputs
- The value from a configuration file is used first.
- If a command-line value is supplied for the same keyword, this overrides the value config file value.
- Values can be overridden in the dataflow YAML using static values, or filters such as
default
andenv_override
.
Required variables
There is often a need to have required variables that must be set in order for a dataflow to run properly. A special filter named required() can be used to test if a variable has a value or not. If the required check fails, the dataflow will fail with a warning log message.
An optional message can be included in the check using the format required('message')
.
This is useful to provide a prompt of what is expected in the variable.
Returning current input values
By default, the required check does not return what the variable currently has in it. This is important for sensitive inputs values since they would appear in log messages.
To see the input value, enable the “show_val” option using the format
required('msg', True)
.
Note that if the variable is not supplied or an empty string, only a colon “:” will be displayed.
Testing variables with RegEx
Regular expressions can be supplied to test the input values. A pattern can be supplied and it will run through a match.
To use a regex pattern, use the format
required('msg', False, patt)
where “patt” is the regex string.
There are lots of great regex resources available on the web. Here a few examples:
- Useful Regex Patterns
- Complete guide to regex
- regex101.com - use the python flavor
Note that when supplying a regex pattern, the match automatically starts from the start of the string. Also, the string supplied is a regular string, not a “raw” string, so escape characters will need to be double-escaped (i.e. use “" to denote the “” character).
Required Examples
name: required_var_missing
notes: |
The examples below illustrate how the required filter can be used.
Input values can be supplied via the cli using the format `--key value` (e.g. --var01 SomeValue).
variables:
# fail if no value provided for var01
var01: "{{ kwargs.var01|required }}"
# fail with supplied message if no value provided for var02
var02: "{{ kwargs.var02|required('var02 cannot be blank') }}"
# fail unless var03 is 1, 2, or 3
var03: "{{ kwargs.var03|required('var03 must be a number between 1 and 3', False, '[1-3]$') }}"
# fail unless var04 is 1, 2, or 3 and show the input value
var04: "{{ kwargs.var04|required('var04 must be a number between 1 and 3', True, '[1-3]$') }}"
steps:
- name: required_keyword_arguments_present
params:
output: "This step only runs if all required variables are supplied."
Step run time variables
In addition to pipeline variables, the most commonly used variables are the runtime variables linked to steps. For example “steps.STEPNAME.data” references the data generated as part of a step.
See steps runtime variables for a listing of the runtime variables created for steps.
Also note the dataflow namespace which is used to reference variables in specific steps or jobs.
The most efficient way to reference a variable is to use the __ref
syntax.
When a single value is to be used in dataflow YAML, use the format “VARNAME__ref”
to retrieve it.
See context references for keywords).
To see the step run-time variables, see the show_context feature.