ThatDataBlog

Where is my cloud ETL?

I’ve waited several years for someone else to set up a proper cloud ETL solution.

I’ve read Jamie Thomson’s excellent summary What would a cloud-based ETL tool look like? and The API Evangelist’s Bringing ETL to the Masses with APIs and held my breath waiting for someone else sort it out.

So far I’ve waited in vain.

My ETL tool of choice is Talend mainly because there is a completely free version that I can use without having to get my clients to fork out for a licence. My ideal would be to have all the visual programming functions of Talend in a cloud based freemium tool (like dataloader.io but more advanced).

Recently I’ve been exploring IFTTT, Zapier and Workato but they just aren’t there yet when it comes to the advanced data flow that I like to create.

Maybe one day I’ll get round to building one myself.

For the record I’ve also tried Pentaho, Clover ETL, Mule and Jitterbit but I’ve not warmed to them. Perhaps my mindset is fixed in Talend/SSIS mode and it will just take a bit of practice.

Posted in Cloud ETL, Transformation

How to see all SMS from several phones on one phone

1) Install and log in to IFTTT on each phone.
2) Create a recipe to log all SMS to a Google spreadsheet. Or use this recipe.
3) Set IFTTT to notify the user when the recipe is triggered

Result
All SMS to any phone running IFTTT will be logged to Google. Also a notification that the recipe was triggered will display on each phone. That notification will contain the contents of the SMS.

Posted in Uncategorized

Building slowly changing dimensions from salesforce.com data

An interesting problem arose recently. A colleague was trying to build a type 2 slowly changing dimension (SCD 2) using data from the case history object in salesforce.com.

The primary problem is that the case history is an audit trail for selected case object fields and so does not record the full state of the case record at each transaction. So the first entry in the case history for any new record shows only that the record was created. There is no further metadata about the state of the record when it was inserted. It is only possible to calculate the initial state by taking the current state of the record and the case history records and replaying them in reverse.

Usually SCD2 is used when taking regular snapshots of a source dataset. The record start and end time stamps are derived from the execution time of the record extraction process. In this case the need was to drive the record timestamps from the case history record creation date time fields. In order for the users of the data to be able to see all changes in the data in the correct order. This had the effect of setting the grain of the time dimension to the level of the second because that was the minimum level of the time stamps on the case history data.

That would not be easy to build though because the ETL process would have to create SCD2 records for each case history record. Those records would use the clarion date time of the case history as both the start and end time stamps of the SCD2 record. These records would be bookended by the last seen version of the case record and the most recent version. In effect the process would have to use a normal incremental load pattern to copy over the current end state of each record as a new SCD2 record and the back-fill between this and the previous version all intervening transactions that were found in the case history.

We haven’t attempted to implement it in this way yet. It would be preferable if we had some way of tracking changes to entire records in salesforce.com. That would probably require an Apex trigger to copy the entire row into a staging object within salesforce.com before our ETL process could extract the data. The obvious consequences of such apparel though would be a massive increase in the amount of data storage required for the app and the performance impact of running a trigger that copies entire rows on every transaction. There may be other problems with this approach.

The recent addition of lightning components to salesforce.com might offer an alternative pattern for this and will be something to investigate further. Similarly another avenue might also be the streaming API.

Posted in Salesforce.com Tagged with: , , ,

Extract, Load and Transform with Salesforce

Recently I had an interesting situation where I needed to load an account hierarchy daily from a single file. The tricky bit about this was that I couldn’t make any changes to the input source file as it came from a system that could not be changed in the timescales of my project.

My usual solution to this sort of data load would be a custom process built from Talend. However this time I could not change the load mechanism away from the Salesforce dataloader.

The approach I took was to implement an extract, load, transform process. I pointed the data load at a staging object that replicated the schema of the load file. Secondly I wrote a schedulable and batchable apex class to transform the staging to account records.

The transformation process had to work at three levels of grain aligned to each level in the account hierarchy. I used aggregate queries based on the top two layers to find the grandparents and parents that needed to be inserted or updated. Then the base level corresponding to each row in the staging table could be inserted or updated and at the same time linked to the correct parent.

I added some metadata to the staging records so an administrator could check any errors and fix them before using a custom button to reprocessing them.

Using apex code instead of the visual programming of Talend was a bit of a challenge . I had to translate the operations I wanted into apex code.

A future improvement to my solution would be to use a custom setting to hold the field mappings and key definitions. I had to hard code these in the apex class to meet my time constraints. I would have preferred to make it more easy to tweak the field mappings. And make it easier to reuse the code.

Posted in Salesforce.com, Transformation Tagged with: , ,

Fascinating genetic atlas of human admixture history

World Ancestry for North Italy

I just heard about this on the Science in Action podcast. It is a fascinating insight into historic events.

World Ancestry for North Italy

World Ancestry for North Italy

Posted in Visualisation Tagged with: , , , ,

Salesforce have moved their documentation and Google hasn’t caught up yet.

Just a word of warning that Salesforce.com have moved all their documentation into a different location and format and Google’s indexes haven’t caught up yet. So if you do a search for something like ‘salesforce sObject methods’ pretty much all the results will lead you to a 404.

Here is a link to the new Force.com documentation page

And here is a direct link to the page for my example query above for sObject methods.

Posted in Salesforce.com Tagged with: , , ,

Talend tHash components are great but fiddly

The tHash component in Talend is a really useful way of caching data in a flow for use in a lookup or for self-joins and loads of other useful things. However it has one fiddly draw back which is that you have to make sure the schema of the input matches the output every time you use it. Wouldn’t it be great if you could just specify the schema on the tHash output component and have it propagate to all the tHash inputs that link to it?

Posted in Transformation Tagged with: , ,

Normalising Data with Talend tMap

The other day I got a particularly tricky set of data to transform. It consisted of a list of contracts each identifying an account and a list of contacts related to the contract and by implication the account. I needed to extract from this list the accounts and contacts and put them into a normalised table structure.

Here is what the data set I received looked like. I’ve shown a category of data in each column. In fact I had an Excel file where there were many columns in each category.

Contract

Account

Contact 1

Contact 2

Contact 3

Contact 15

 

Here is my target data model.

datmodel

 

So the contacts belong primarily to accounts but may have a relationship to any number of contracts.

Lets start by looking at the contacts. I wanted to normalise these out of the many columns and into a single set of contact columns. My target table looked like this

 

Contract Key

Account Key

Contact Key

Contact Full Name

Contact Mobile Phone

Contact Phone

Contact Email

Each set of source contact columns contained fields for the main attributes of the contact.

So for example the columns for Contact 1 looked like this:

Contact 1 Full Name

Contact 1 Mobile Phone

Contact 1 Phone

Contact 1 Email

So I needed a process that would look at each row of the data and break out some of the columns into new rows while copying down the common fields.

My tool of choice for this sort of processing is Talend. It is a freely available open source ETL tool that has a broad range of compatibility with databases, cloud platforms and file types. For this particular problem I used a tMap component in Talend. The input to the component was the source table with all the columns. The output consisted of fifteen joined tables, one for each contact. The first output table specified set the schema for the remaining tables and was defined to match the target table schema. For each contact the common fields are mapped using the Auto-Match button. The contact specific fields must be mapped manually as their column names will not match the output table column names. The manual mapping is a bit tedious to repeat fifteen times. The key here is to ensure that the output tables were joined together. This can be done for the second output table onwards by setting the ‘join to’ check box to true on the new output dialog. This is a useful feature that allows you to process the same row using different criteria into the output without resulting in multiple output tables. All the joined tables are actually output in a single output flow, the rows from each are effectively union-ed together. This is the equivalent then of a UNION of two or more SQL Statements in T-SQL. Usually you would not want to duplicate the input rows to the output like this and would make sure that the criteria each joined output table was mutually exclusive. For example if you were trying to apply default values to a record if an inner join failed. In this case though we get the correct behaviour because we need one row per contact which represents fifteen rows per input contract row.

 Once the output tables are all mapped the next step is to filter out any rows where all the contact columns are null. Each contract may have up to fifteen contacts associated to it so some will have fewer than that however the first map will not take account of this and so we will get fifteen contact rows for each contract regardless. So a second tMap component can be used to assess the contents of each of the Contact data columns. The Relational.ISNULL() and the String.equalsIgnoreCase() functions are useful along with the ternary operator ? : for this purpose. Here is an example expression for a single boolean variable that indicates that the Contact_1__Name column is null or blank. We are making use of the lazy evaluation in Java by placing the test for null first as if this is true the second criteria will not be evaluated. If we attempt to evaluate .equalsIgnoreCase on a null column value we will get a runtime exception.

Contact_1__NameIsBlank =

Relational.ISNULL(row1.Contact_1__Name) || row1.Contact_1_Name.equalsIgnoreCase(“”) ? true : false

This style of expression can be repeated for all of the contact fields on the row. Then we can bring them all together in a single boolean variable that indicates whether or not the contact columns are all blank. This single variable can then be used in the output table expression to filter out and ignore any rows where all the contact field are blank.

Once the empty contact rows are filtered out we can push all the remaining rows into the contact table.

However at this point we don’t have any keys for the contract or accounts associated with these contacts. We can go upstream of the process and add a contract key column to the input source data and pass that through onto the contacts. However the account key can’t be derived at this stage in the process. It will be necessary to process the contracts to extract the account data and then key this data before then reprocessing the contacts to work out which accounts they might belong to based on which contract they originated from.

So now l have three tables derived from the contract data where the account and contact data rows could be duplicated. My next task is to remove these duplicates.

 

Posted in Transformation Tagged with: , ,