Groups table on groups and computes value. See Aggregate Functions below for available aggregation functions.
aggregate value: sum(price),count() group: make,model
Counts number of times a pattern occurs in a column.
// Counts occurrences of the word 'honda'.
countpattern col: text on: 'honda'
// Counts occurrences of any number using regular expressions.
countpattern col: text on: /[0-9]+/
// Counts occurrences of any number using Trifacta patterns.
countpattern col: text on: #+
Deletes rows in a table that satisfy the supplied predicate function. The opposite of 'keep'.
delete row: (make == 'honda') && (price > 50000)
Creates a new column by computing the supplied formula.
// Single row arithmetic. derive value: price / 1000 // Create a new column with the mean of every value in a column. derive value: mean(price) // Create a new column with the mean of every value in a column, grouped other columns. derive value: mean(price) group: make,model // Compute z-scores for all values in a column. derive value: (price - mean(price)) / stdev(price) // Compute z-scores for all values in a column grouped by others columns. derive value: (price - mean(price)) / stdev(price) group: make,model // Normalize all values in a column to values between 0 and 1. derive value: (price - min(price)) / (max(price) - min(price)) // Conditional logic. derive value: (make == 'honda') ? 10 : 0
Drops columns from the table.
drop col: make,price
Extracts occurences of a pattern into new columns.
// Extract text between 2nd and 4th characters.
extract col: text at: 2,4
// Extract occurrences of the word 'honda' up to 10 times.
extract col: text on: 'honda' limit: 10
// Extract occurrences of any number up to 10 times using regular expressions.
extract col: text on: /[0-9]+/ limit: 10
// Extract occurrences of any number up to 10 times using Trifacta patterns.
extract col: text on: #+
limit: 10
// Extract occurrences of domains between http:// and /.
extract col: text after: 'http:\/\/' before: '\/'
// Extract occurrences of domains including http:// and /.
extract col: text from: 'http:\/\/' to: '\/'
Highlights rows in a table that satisfy the supplied predicate function.
highlight row: (make == 'honda') && (price > 50000)
Convert the first row of data to a column headers.
header
Keeps only those rows in a table that satisfy the supplied predicate function. The opposite of 'delete'.
keep row: (make == 'honda') && (price > 50000)
Moves a column.
move col: make after: model move col: make before: model
Renames a column.
rename col: make to: 'newColumnName'
Replaces occurences of a pattern with new values. 'global:' controls the number of times the replace operation can be performed within a single value. Setting global to true causes all matches in a value to be replaced, not just the first one.
// Delete all quotes. replace col: * on: '\"' with: '' global: true // Replace perrson with person replace col: make on: 'perrson' with: 'person' global: false
Replaces text between start and end positions with a new value.
// Replace first two characters with whitespace replaceposition col: make start: 0 end: 2 with: ''
Merge (i.e., concatanates) values together.
// Merge two columns together. merge col: make,model
Overwrites values in a column by computing the supplied formula. See derive transform for more examples of formulas.
// Single row arithmetic. set col: price value: price / 1000 // Conditional logic. set col: make value: (make == 'honda') ? make : 'Other'
Splits the value on occurences of regex the number of the times specified by matches.
// Split column at the 2nd character.
split col: text at: 2,2
// Split column on commas 10 times.
split col: text on: ',' limit: 10
// Split column on commas that are not between double quotes.
split col: text on: ',' limit: 10 quote: '\"'
// Split column on a regular expression.
split col: text on: /[,;]/ limit: 10 quote: '\"'
// Split column on a Trifacta pattern.
split col: text on: {alpha}
limit: 10 quote: '\"'
Splits text in to multiple rows. Currently, you can only split on strings (patterns and regular expressions are not yet supported).
// Split rows on newlines. splitrows col: column1 on: '\n' // Split rows on that are not between double quotes. splitrows col: column1 on: '\r\n' quote: '\"'
Computes the count of tuples.
Computes the maximum value in a column.
Computes the maximum value in a column.
Computes the mean for values in a column.
Computes the sum for values in a column.
Computes the population standard deviation of values in a column.
Computes the population variance of values in a column.
Computes lhs && rhs.
delete row: (make == 'honda') && (model == 'maserati')
Computes lhs || rhs.
delete row: (make == 'honda') || (model == 'maserati')
Computes !operand.
delete row: !(make == 'honda')
Computes whether lhs is contained in the array rhs.
delete row: in(make, ['honda','maserati','lamborghini'])
Computes whether a string contains a string or pattern.
// Delete rows where a column matches a specific string. delete row: matches(make, 'honda') // Delete rows where a column matches a regular expression. delete row: matches(make, /A.+/)
Computes whether lhs < rhs.
delete row: price < 100
Computes whether lhs <= rhs.
delete row: price <= 100
Computes whether lhs < rhs.
delete row: price > 100
Computes whether lhs >= rhs.
delete row: price >= 100
Computes whether lhs equals rhs.
delete row: price == 100
Computes whether lhs does not equal rhs.
delete row: price != 100
Computes absolue value of a value.
derive value: abs(price)
Adds lhs to rhs.
derive value: price + 100
Computes e raised to the power of a value.
derive value: exp(price)
Subtracts rhs from lhs.
derive value: price - 100
Multiplies lhs by rhs.
derive value: price * 100
Computes logarithm of lhs base rhs.
derive value: pow(price, 10)
Raises lhs to the power of rhs.
derive value: pow(price, 2)
Divided lhs by rhs.
derive value: price / 100
Computes ceiling of operand.
derive value: ceil(price)
Computes natural logarithm of a value.
derive value: ln(price)
Computes square root of operand. This function is currently only available in the JavaScript runtime and not in Hadoop.
derive value: sqrt(price)
Computes floor of operand.
derive value: floor(price)
Constructs a date from year, month and day.
// Delete rows where date is between Jan 1st, 2013 and Feb 15th, 2013 delete row: (date(2013, 1, 1) <= date) && (date <= date(2013, 2, 15))
Constructs a date from hour, minute and second.
// Delete rows where date is between 9am and 3pm delete row: (time(9, 0, 0) <= date) && (date <= time(15, 0, 0))
Computes the month (as an integer) of a date column.
derive value: month(date)
Computes the name of month of a date column.
derive value: monthname(date)
Computes the year of a date column.
derive value: year(date)
Computes the day of a date column.
derive value: day(date)
Computes the hour of a time or datetime column.
derive value: hour(time)
Computes the day of a time or datetime column.
derive value: minute(time)
Computes the day of a time or datetime column.
derive value: second(time)
Converts number to unicode character.
// Converts 126 to ~ derive value: char(126)
Produce unicode code point for first character in a string.
// Converts '~' to 126 derive value: unicode('~')
Converts string to uppercase.
derive value: uppercase('honda')
Converts string to lowercase.
derive value: lowercase('honda')
Calculates length of a string.
derive value: length('honda')
Trims whitespace from a string.
derive value: trim('honda')
Computes substring of another string.
// Pull out first 3 letters of a string derive value: substring('honda', 0, 3) // Pull out 2-5th characters of a string derive value: substring('honda', 1, 6)
Extracts first k characters of a string.
derive value: left('honda', 2)
Extracts last k characters of a string.
derive value: right('honda', 2)
Finds the index of the first occurrence of text.
// Find the token 'honda' in make derive value: find(make, 'honda', true, 0) // Find the token 'honda' in make and ignore case. derive value: find(make, 'honda', true, 0) // Find the token 'honda' in make after the third character. derive value: find(make, 'honda', true, 2)
Checks whether the values in the column(s) are empty.
// Delete rows where a column is empty delete row: empty([make])
Checks whether the values in a column do not match a given type.
// Delete rows that are integers or empty keep row: mismatched(make, ['Integer'])
Checks whether the values in a column match a given type.
// Delete rows that are not integers keep row: valid(make, ['Integer'])
Trifacta supports three types of text matching clauses:
Selection rules are written using backticks (`...`). Selection rules provide an easier to use and more readable alternative to regular expressions:
% | match any character, exactly once |
%? | match any character, zero or one times |
% | match any character, zero or more times |
%+ | match any character, one or more times |
%{3} | match any character, exactly three times |
%{3,5} | match any character, 3, 4, or 5 times |
%#, {digit} | digit character [0-9] |
{alpha} | alpha character [A-Za-z] |
{upper} | uppercase alpha character [A-Z] |
{lower} | lowercase alpha character [a-z_] |
{phone}, {url} | extensible types, registered as regexp patterns |
{[...]} | character class, match characters in brackets |
{![...]} | negated class, match characters not in brackets |
(...) | grouping, including captures |
#, %, ?, , +, {, }, (, ), ’, \n, \t | escaped character literals |
| | logical OR |
The Trifacta CLI script is used to programmatically run a script produced in Transformer. Each run of the Trifacta CLI script will create a new job. The CLI Job will appear on the Jobs page with the CLI Job icon: .
Running the CLI script has the following requirements:
To run the CLI script, run the command below from the top level Trifacta directory. Replace the app_host parameter with the host and post of the running Trifacta instance and the user parameter with the email of a valid user on that instance. If no password is specified, you will be prompted to enter one.
python ./bin/execute_script_template.py --app_host=localhost:3005 --script=/path/to/template/file --data=hdfs://path/to/dataset --job_type=pig --monitor=on --user=username --password=password
For documentation on the Trifacta CLI parameters, run:
python ./bin/execute_script_template.py --help