Normalising Data with Talend tMap

The other day I got a particularly tricky set of data to transform. It consisted of a list of contracts each identifying an account and a list of contacts related to the contract and by implication the account. I needed to extract from this list the accounts and contacts and put them into a normalised table structure.

Here is what the data set I received looked like. I’ve shown a category of data in each column. In fact I had an Excel file where there were many columns in each category.

Contract

Account

Contact 1

Contact 2

Contact 3

Contact 15

 

Here is my target data model.

datmodel

 

So the contacts belong primarily to accounts but may have a relationship to any number of contracts.

Lets start by looking at the contacts. I wanted to normalise these out of the many columns and into a single set of contact columns. My target table looked like this

 

Contract Key

Account Key

Contact Key

Contact Full Name

Contact Mobile Phone

Contact Phone

Contact Email

Each set of source contact columns contained fields for the main attributes of the contact.

So for example the columns for Contact 1 looked like this:

Contact 1 Full Name

Contact 1 Mobile Phone

Contact 1 Phone

Contact 1 Email

So I needed a process that would look at each row of the data and break out some of the columns into new rows while copying down the common fields.

My tool of choice for this sort of processing is Talend. It is a freely available open source ETL tool that has a broad range of compatibility with databases, cloud platforms and file types. For this particular problem I used a tMap component in Talend. The input to the component was the source table with all the columns. The output consisted of fifteen joined tables, one for each contact. The first output table specified set the schema for the remaining tables and was defined to match the target table schema. For each contact the common fields are mapped using the Auto-Match button. The contact specific fields must be mapped manually as their column names will not match the output table column names. The manual mapping is a bit tedious to repeat fifteen times. The key here is to ensure that the output tables were joined together. This can be done for the second output table onwards by setting the ‘join to’ check box to true on the new output dialog. This is a useful feature that allows you to process the same row using different criteria into the output without resulting in multiple output tables. All the joined tables are actually output in a single output flow, the rows from each are effectively union-ed together. This is the equivalent then of a UNION of two or more SQL Statements in T-SQL. Usually you would not want to duplicate the input rows to the output like this and would make sure that the criteria each joined output table was mutually exclusive. For example if you were trying to apply default values to a record if an inner join failed. In this case though we get the correct behaviour because we need one row per contact which represents fifteen rows per input contract row.

 Once the output tables are all mapped the next step is to filter out any rows where all the contact columns are null. Each contract may have up to fifteen contacts associated to it so some will have fewer than that however the first map will not take account of this and so we will get fifteen contact rows for each contract regardless. So a second tMap component can be used to assess the contents of each of the Contact data columns. The Relational.ISNULL() and the String.equalsIgnoreCase() functions are useful along with the ternary operator ? : for this purpose. Here is an example expression for a single boolean variable that indicates that the Contact_1__Name column is null or blank. We are making use of the lazy evaluation in Java by placing the test for null first as if this is true the second criteria will not be evaluated. If we attempt to evaluate .equalsIgnoreCase on a null column value we will get a runtime exception.

Contact_1__NameIsBlank =

Relational.ISNULL(row1.Contact_1__Name) || row1.Contact_1_Name.equalsIgnoreCase(“”) ? true : false

This style of expression can be repeated for all of the contact fields on the row. Then we can bring them all together in a single boolean variable that indicates whether or not the contact columns are all blank. This single variable can then be used in the output table expression to filter out and ignore any rows where all the contact field are blank.

Once the empty contact rows are filtered out we can push all the remaining rows into the contact table.

However at this point we don’t have any keys for the contract or accounts associated with these contacts. We can go upstream of the process and add a contract key column to the input source data and pass that through onto the contacts. However the account key can’t be derived at this stage in the process. It will be necessary to process the contracts to extract the account data and then key this data before then reprocessing the contacts to work out which accounts they might belong to based on which contract they originated from.

So now l have three tables derived from the contract data where the account and contact data rows could be duplicated. My next task is to remove these duplicates.

 

Posted in Transformation Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*