variables

Pipeline variables are defined at the top of a dataflow using name/value pairs set as jinja variables. This enables these variables to referenced anywhere in the dataflow using double curly braces (e.g. “{{ varname }}”). At runtime, these variables will be replaced.

variables:
    string: string  # Name/value pairs

Variables are stored in a dictionary in memory when a dataflow is run. Normally these are defined at the root of the YAML at the top of the dataflow. But they can also be set in a separate config file.

To learn more about how jinja uses variables, see template variables

Examples

variables:
    var1: a
    var2: 42
    var3: True
    var4: {"key": "value"}
    var5: {{ kwargs.var5|default("somevalue", true) }}  # set default example

steps:
  - name: show_vars
    params:
        output: |
          "{{ var1 }}" == "a".
          {{ var2 }} == 42.
          {{ var3 }} == True.
          "{{ var4.key }}" == "value".
          "{{ var5 }}" == "somevalue"

Scoped Variables

Variables can also be defined in the context of a given stage, job or step. This may be used to provide overrides or local variables.

When referenced locally, just the variable key can be used. If there is a need to review the variables from an external scope, the context prefix must be supplied.

The following example illustrates this:

jobs:
  - name: first_job
    variables:
        key: value
    steps:
      - name: first_step
        params:
            output: |
                "{{ key }}" == "value"
  - name: second_job
    steps:
      - name: another_step
        params:
            output: |
                "{{ jobs.first_job.variables.key }}" == "value"

Set Parent Variables

Setting output.set_parent_variables_enabled to True promotes variables to the parent.

For example:

steps:
  - action: set
    params:
        data:
            step_var1: A
            step_var2: B
    output:
        set_parent_variables_enabled: True
  - params:
        output: |
            "{{ step_var1 }}" == "A"  # promoted from first_step

Scoped variables as aliases

Using the set parent variables feature is especially useful to create shorthand aliases for commonly used values.

Context references are a useful way to access data from the namespace. However, the string for the reference including the level and the name(s) can sometimes be long. This can in turn make jinja strings or context references that use them long.

Consider the following example: one step is getting setting a value that will be used in multiple subsequent steps. A scoped variable is created in parent job. The first step then sets it using a key named the same as the variable and promotes it to the parent variable. Now any following steps can reference it directly just using the variable name.

jobs:
  - name: myjob
    variables:
        myvar: ""

    steps:
      - name: set_value
        action: set
        params:
            data:
                step_var1: A
                step_var2: B
        output:
            set_parent_variables_enabled: True
        cache:
          - key: myvar
            source: "step_var1"
      - name: check_value
        condition: "{{ myvar == 'A' }}"  # promoted from first_step
        params:
            output: |
                The variable 'myvar' was set by the set_value step!

Note a few things that were done in this more involved example:

The variable “myvar” was declared in the job
This is not a requirement for this example but it is a best practice. It is a helpful hint that the variable is available. Also, it can be used to provide a default value.
The “key” name in the “cache” matches the variable name
This explicitly names the data returned in that cache target. So in this case, the cache target created is “steps.set_value.data.myvar”.
The source is set to an attribute of the data
The value of “step_var1” is used to set “steps.set_value.data.myvar” (new in rvrdata-1.1.0)

The value from cache target “steps.set_value.data.myvar” is then promoted to the job variable “myvar” so that any subsequent steps can then use it.

Set Pipeline Variables

Setting output.set_pipeline_variables_enabled to True promotes variables to the pipeline level (new in rvrdata-1.1.0). This is similar to set_parent_variables_enabled but in this case, variables are set not on the parent, but at pipeline. This enabled variables to be shared easily across jobs and steps.

For example:

jobs:
  - name: first_job
    steps:
      - action: set
        params:
          data:
            step_var1: A
            step_var2: B
        output:
          set_pipeline_variables_enabled: True
      - params:
          output: "{{ step_var1 }}"
  - name: second_job
    steps:
      - params:
        output: |
          "{{ step_var1 }}" == "A"  # promoted from step in first_job

Input variables

Dataflow variables values can input from the command line, from configuration files and from environment variables.

Input variables are collected in a dictionary under the reserved key kwargs.

It is a best practice to avoid storing secrets such as user ids and passwords in dataflows. Since dataflows are often stored in source code repositories, they should be treated with the same best practices are any code. Storing any values that may change regularly separately from dataflows will avoid having to update the YAML files.

Example types of inputs

The following snippet shows the various styles of inputs:

name: dataflow_with_input_variables

variables:
    var1: "{{ kwargs.var1 }}"  # keyword from config file or CLI or parent dataflow
    var2: "{{ kwargs.var2 | default('VALUE', true) }}"  # keyword with default
    var3: "{{ "VALUE" | env_override('ENVVAR') }}"  # static value overridden by env var
    var4: "{{ kwargs.var4 | env_override('ENVVAR') }}"  # kwarg overridden by env var
    # Example below shows keyword with default and environment variable override
    var5: "{{ kwargs.var5 | default('DEFAULTVALUE', true) | env_override('ENVVAR') }}"

Config File

Using a config file for common input variables is recommended. This avoids having to change values in multiple places and keeps updatable values separate from the logic in dataflows.

See rvrdata Config File for more detail on how to use a configuration file for common variables.

Command-line Keywords

Any value supplied in a command-line switch (a string prefixed --) is converted into a keyword and is available in the kwargs dictionary.

See rvrdata CLI Input Keywords for more detail on how to feed variables in from the command line.

Setting defaults

It is a best practice to use default values for input variables.

See jinja built-in filters - default for more information about setting a default value.

If no defaults are available and a input keyword is required, There should be a step early in the dataflow which checks to see if the required values are supplied and, if not, provide user-friendly error messages.

Environment Variables

Environment variables are a handy way to feed input variable values.

This may be used to avoid storing values in text files. Or it may be used to share values with other tools using .env files. Passing in values with environment vars avoids having values visible on the command line. Care should still be taken with proper management of environment variables to ensure security best practices are being followed.

A custom jinja filter env_override can be used to pick up values from environment variables. This takes the variable name as input and will override the supplied value with the value from the environment.

Order of precedence for inputs

The value from a configuration file is used first.
If a command-line value is supplied for the same keyword, this overrides the value config file value.
Values can be overridden in the dataflow YAML using static values, or filters such as default and env_override.

Required variables

There is often a need to have required variables that must be set in order for a dataflow to run properly. A special filter named required() can be used to test if a variable has a value or not. If the required check fails, the dataflow will fail with a warning log message.

An optional message can be included in the check using the format required('message'). This is useful to provide a prompt of what is expected in the variable.

Returning current input values

By default, the required check does not return what the variable currently has in it. This is important for sensitive inputs values since they would appear in log messages.

To see the input value, enable the “show_val” option using the format required('msg', True).

Note that if the variable is not supplied or an empty string, only a colon “:” will be displayed.

Testing variables with RegEx

Regular expressions can be supplied to test the input values. A pattern can be supplied and it will run through a match.

To use a regex pattern, use the format required('msg', False, patt) where “patt” is the regex string.

There are lots of great regex resources available on the web. Here a few examples:

Note that when supplying a regex pattern, the match automatically starts from the start of the string. Also, the string supplied is a regular string, not a “raw” string, so escape characters will need to be double-escaped (i.e. use “" to denote the “” character).

Required Examples

name: required_var_missing
notes: |
    The examples below illustrate how the required filter can be used.
    Input values can be supplied via the cli using the format `--key value` (e.g. --var01 SomeValue).

variables:
    # fail if no value provided for var01
    var01: "{{ kwargs.var01|required }}"
    # fail with supplied message if no value provided for var02
    var02: "{{ kwargs.var02|required('var02 cannot be blank') }}"
    # fail unless var03 is 1, 2, or 3
    var03: "{{ kwargs.var03|required('var03 must be a number between 1 and 3', False, '[1-3]$') }}"
    # fail unless var04 is 1, 2, or 3 and show the input value
    var04: "{{ kwargs.var04|required('var04 must be a number between 1 and 3', True, '[1-3]$') }}"

steps:
  - name: required_keyword_arguments_present
    params:
        output: "This step only runs if all required variables are supplied."

Step run time variables

In addition to pipeline variables, the most commonly used variables are the runtime variables linked to steps. For example “steps.STEPNAME.data” references the data generated as part of a step.

See steps runtime variables for a listing of the runtime variables created for steps.

Also note the dataflow namespace which is used to reference variables in specific steps or jobs.

The most efficient way to reference a variable is to use the __ref syntax. When a single value is to be used in dataflow YAML, use the format “VARNAME__ref” to retrieve it. See context references for keywords).

To see the step run-time variables, see the show_context feature.