Transforms

aggregate

Groups table on groups and computes value. See Aggregate Functions below for available aggregation functions.

aggregate value: sum(price),count() group: make,model

countpattern

Counts number of times a pattern occurs in a column.

// Counts occurrences of the word 'honda'.
countpattern col: text on: 'honda'
// Counts occurrences of any number using regular expressions.
countpattern col: text on: /[0-9]+/
// Counts occurrences of any number using Trifacta patterns.
countpattern col: text on: #+

delete

Deletes rows in a table that satisfy the supplied predicate function. The opposite of 'keep'.

delete row: (make == 'honda') && (price > 50000)

derive

Creates a new column by computing the supplied formula.

// Single row arithmetic.
derive value: price / 1000
// Create a new column with the mean of every value in a column.
derive value: mean(price)
// Create a new column with the mean of every value in a column, grouped other columns.
derive value: mean(price) group: make,model
// Compute z-scores for all values in a column.
derive value: (price - mean(price)) / stdev(price)
// Compute z-scores for all values in a column grouped by others columns.
derive value: (price - mean(price)) / stdev(price) group: make,model
// Normalize all values in a column to values between 0 and 1.
derive value: (price - min(price)) / (max(price) - min(price))
// Conditional logic.
derive value: (make == 'honda') ? 10 : 0

drop

Drops columns from the table.

drop col: make,price

extract

Extracts occurences of a pattern into new columns.

// Extract text between 2nd and 4th characters.
extract col: text at: 2,4
// Extract occurrences of the word 'honda' up to 10 times.
extract col: text on: 'honda' limit: 10
// Extract occurrences of any number up to 10 times using regular expressions.
extract col: text on: /[0-9]+/ limit: 10
// Extract occurrences of any number up to 10 times using Trifacta patterns.
extract col: text on: #+ limit: 10
// Extract occurrences of domains between http:// and /.
extract col: text after: 'http:\/\/' before: '\/'
// Extract occurrences of domains including http:// and /.
extract col: text from: 'http:\/\/' to: '\/'

highlight

Highlights rows in a table that satisfy the supplied predicate function.

highlight row: (make == 'honda') && (price > 50000)

header

Convert the first row of data to a column headers.

header 

keep

Keeps only those rows in a table that satisfy the supplied predicate function. The opposite of 'delete'.

keep row: (make == 'honda') && (price > 50000)

move

Moves a column.

move col: make after: model
move col: make before: model

rename

Renames a column.

rename col: make to: 'newColumnName'

replace

Replaces occurences of a pattern with new values. 'global:' controls the number of times the replace operation can be performed within a single value. Setting global to true causes all matches in a value to be replaced, not just the first one.

// Delete all quotes.
replace col: * on: '\"' with: '' global: true
// Replace perrson with person
replace col: make on: 'perrson' with: 'person' global: false

replaceposition

Replaces text between start and end positions with a new value.

// Replace first two characters with whitespace
replaceposition col: make start: 0 end: 2 with: ''

merge

Merge (i.e., concatanates) values together.

// Merge two columns together.
merge col: make,model

set

Overwrites values in a column by computing the supplied formula. See derive transform for more examples of formulas.

// Single row arithmetic.
set col: price value: price / 1000
// Conditional logic.
set col: make value: (make == 'honda') ? make : 'Other'

split

Splits the value on occurences of regex the number of the times specified by matches.

// Split column at the 2nd character.
split col: text at: 2,2
// Split column on commas 10 times.
split col: text on: ',' limit: 10
// Split column on commas that are not between double quotes.
split col: text on: ',' limit: 10 quote: '\"'
// Split column on a regular expression.
split col: text on: /[,;]/ limit: 10 quote: '\"'
// Split column on a Trifacta pattern.
split col: text on: {alpha} limit: 10 quote: '\"'

splitrows

Splits text in to multiple rows. Currently, you can only split on strings (patterns and regular expressions are not yet supported).

// Split rows on newlines.
splitrows col: column1 on: '\n'
// Split rows on that are not between double quotes.
splitrows col: column1 on: '\r\n' quote: '\"'

Aggregate Functions

count

Computes the count of tuples.

max

Computes the maximum value in a column.

min

Computes the maximum value in a column.

mean

Computes the mean for values in a column.

sum

Computes the sum for values in a column.

stdev

Computes the population standard deviation of values in a column.

var

Computes the population variance of values in a column.

Logic Functions

and

Computes lhs && rhs.

delete row: (make == 'honda') && (model == 'maserati')

or

Computes lhs || rhs.

delete row: (make == 'honda') || (model == 'maserati')

not

Computes !operand.

delete row: !(make == 'honda')

Comparison Functions

in

Computes whether lhs is contained in the array rhs.

delete row: in(make, ['honda','maserati','lamborghini'])

matches

Computes whether a string contains a string or pattern.

// Delete rows where a column matches a specific string.
delete row: matches(make, 'honda')
// Delete rows where a column matches a regular expression.
delete row: matches(make, /A.+/)

lessThan

Computes whether lhs < rhs.

delete row: price < 100

lessThanEqual

Computes whether lhs <= rhs.

delete row: price <= 100

greaterThan

Computes whether lhs < rhs.

delete row: price > 100

greaterThanEqual

Computes whether lhs >= rhs.

delete row: price >= 100

equal

Computes whether lhs equals rhs.

delete row: price == 100

notEqual

Computes whether lhs does not equal rhs.

delete row: price != 100

Math Functions

abs

Computes absolue value of a value.

derive value: abs(price)

add

Adds lhs to rhs.

derive value: price + 100

exp

Computes e raised to the power of a value.

derive value: exp(price)

subtract

Subtracts rhs from lhs.

derive value: price - 100

multiply

Multiplies lhs by rhs.

derive value: price * 100

log

Computes logarithm of lhs base rhs.

derive value: pow(price, 10)

pow

Raises lhs to the power of rhs.

derive value: pow(price, 2)

divide

Divided lhs by rhs.

derive value: price / 100

ceil

Computes ceiling of operand.

derive value: ceil(price)

ln

Computes natural logarithm of a value.

derive value: ln(price)

sqrt

Computes square root of operand. This function is currently only available in the JavaScript runtime and not in Hadoop.

derive value: sqrt(price)

floor

Computes floor of operand.

derive value: floor(price)

Date Functions

date

Constructs a date from year, month and day.

// Delete rows where date is between Jan 1st, 2013 and Feb 15th, 2013
delete row: (date(2013, 1, 1) <= date) && (date <= date(2013, 2, 15))

time

Constructs a date from hour, minute and second.

// Delete rows where date is between 9am and 3pm
delete row: (time(9, 0, 0) <= date) && (date <= time(15, 0, 0))

Month

Computes the month (as an integer) of a date column.

derive value: month(date)

MonthName

Computes the name of month of a date column.

derive value: monthname(date)

Year

Computes the year of a date column.

derive value: year(date)

Day

Computes the day of a date column.

derive value: day(date)

Hour

Computes the hour of a time or datetime column.

derive value: hour(time)

Minute

Computes the day of a time or datetime column.

derive value: minute(time)

Second

Computes the day of a time or datetime column.

derive value: second(time)

String Functions

Char

Converts number to unicode character.

// Converts 126 to ~
derive value: char(126)

Unicode

Produce unicode code point for first character in a string.

// Converts '~' to 126
derive value: unicode('~')

Uppercase

Converts string to uppercase.

derive value: uppercase('honda')

Lowercase

Converts string to lowercase.

derive value: lowercase('honda')

Length

Calculates length of a string.

derive value: length('honda')

Trim

Trims whitespace from a string.

derive value: trim('honda')

Substring

Computes substring of another string.

// Pull out first 3 letters of a string
derive value: substring('honda', 0, 3)
// Pull out 2-5th characters of a string
derive value: substring('honda', 1, 6)

Left

Extracts first k characters of a string.

derive value: left('honda', 2)

Right

Extracts last k characters of a string.

derive value: right('honda', 2)

Find

Finds the index of the first occurrence of text.

// Find the token 'honda' in make
derive value: find(make, 'honda', true, 0)
// Find the token 'honda' in make and ignore case.
derive value: find(make, 'honda', true, 0)
// Find the token 'honda' in make after the third character.
derive value: find(make, 'honda', true, 2)

Type Functions

empty

Checks whether the values in the column(s) are empty.

// Delete rows where a column is empty
delete row: empty([make])

mismatched

Checks whether the values in a column do not match a given type.

// Delete rows that are integers or empty
keep row: mismatched(make, ['Integer'])

valid

Checks whether the values in a column match a given type.

// Delete rows that are not integers
keep row: valid(make, ['Integer'])

Text Matching

Trifacta supports three types of text matching clauses:

  • String literals are written using single quotes ('...') or double quotes ("..."). These will only match the provided string exactly.
  • Regular expressions are written using forward slashes (/.../). The regular expression syntax is based on JavaScript regular expressions.

Selection rules are written using backticks (`...`). Selection rules provide an easier to use and more readable alternative to regular expressions:

%match any character, exactly once
%?match any character, zero or one times
%match any character, zero or more times
%+match any character, one or more times
%{3}match any character, exactly three times
%{3,5}match any character, 3, 4, or 5 times
%#, {digit}digit character [0-9]
{alpha} alpha character [A-Za-z]
{upper} uppercase alpha character [A-Z]
{lower} lowercase alpha character [a-z_]
{phone}, {url} extensible types, registered as regexp patterns
{[...]} character class, match characters in brackets
{![...]}negated class, match characters not in brackets
(...) grouping, including captures
#, %, ?, , +, {, }, (, ), ’, \n, \t escaped character literals
| logical OR

Trifacta CLI Script

The Trifacta CLI script is used to programmatically run a script produced in Transformer. Each run of the Trifacta CLI script will create a new job. The CLI Job will appear on the Jobs page with the CLI Job icon: .

Running the CLI script has the following requirements:

  • The CLI script must be able to access a running Trifacta instance. You can specify the host and port where the instance is running.
  • The CLI script can only access data sets that are stored on HDFS which must be accessible from the running Trifacta instance.
  • Currently, the CLI script cannot contain any Join or Lookup transformations.

To run the CLI script, run the command below from the top level Trifacta directory. Replace the app_host parameter with the host and post of the running Trifacta instance and the user parameter with the email of a valid user on that instance. If no password is specified, you will be prompted to enter one.

python ./bin/execute_script_template.py --app_host=localhost:3005 --script=/path/to/template/file --data=hdfs://path/to/dataset --job_type=pig --monitor=on --user=username --password=password

For documentation on the Trifacta CLI parameters, run:

python ./bin/execute_script_template.py --help