Data Pipelines in the Cloud

This post is a musing on data pipeline tools. A high-level discussion is followed by three commercial products: StitchData, Alooma, and FiveTran. The list is incomplete and only represents the opinion of the author. 

A data pipeline is software that integrates with diverse internal and external data sources for the purpose of consolidation into a single source of truth, sometimes a data warehouse. The database is typically used for reporting, analytics, or it powers a product feature.  Building a data pipeline is a big commitment, even with great open source frameworks widely available and a team of enthusiastic backend developers. Once you have built a scaleable, fault tolerant architecture that never loses a byte, you need to keep up with API changes of existing integrations as well as add more integrations of new APIs as the business evolves and grows. 

Some APIs are complex, e.g. SAP or NetSuite, and require significant expertise. Keeping the data pipeline alive and healthy usually requires a dedicated team. Their work encompasses keeping up with mandatory API changes and adding new integrations, which can mean weeks of development work.


What to do when things go awry? Implementing the happy path, even with full scalability and fault tolerance, is not enough. There must be a clear contingency plan for failed transactions. Keeping a backup of the raw data is a good start, but also the order of arrival might be business critical. A frequent cause for failure are changing data formats over time. 

Error Logging and Notification

When the inevitable failure happens, how fast do you hear about it? Logging design and alert notification at all levels is essential. There needs to be an audit trail with enough information of the type of failure. The minimum type of notification is by email, although with the flood of today’s emails this may be ineffective. Failures can be short-lived and self-healing or require intervention such as resetting a credential or updating a payment method.

Speed and Scalability

The faster the better generally, to provide more timely information to the business. Tolerable lag time of data collection is a business decision. A lag time of 100 milli seconds will be a lot more expensive than 5 minutes or one hour.  A one day lag may be acceptable for daily reporting.  Scalability should be high enough to handle bursts but also small enough to be economic. Hosted cloud services like Amazon Kinesis are perfect solutions.  


External APIs evolve over time. Version upgrades, new data types, etc. and old APIs are phased out, to name a few of the things that will break your integrations. It is definitely  a chore to stay on top of all the API changes that you are integrated with. Besides commercial integration options for popular SaaS systems, it is good advice to look for a programmatic integration option (API, language specific connectors, webhooks).


Building a data pipeline is a complicated project and expensive to do right. The data pipeline often plays a supporting role to a business and is not core to the revenue generation itself. Therefore, it is hard to raise the funds and support for it. A data pipeline has high visibility when it’s broken and next to no visibility when it works. A 100% working data integration is taken for granted and probably necessary to enable smooth operation of other parts of a business, even though external integrations are a collaboration with outside partners and businesses and not always 100% predictable.


SaaS Data Pipelines (Data Normalization Tools)

This section presents the following three commercial solutions with a detailed feature matrix of them all at the end.


StitchData is the most inexpensive solution of the three, yet robust. It is written in Python and uses the Singer open source framework, which means you can build your own connector. Singer is a an open source framework for ETL. It adds credibility and transparency to Stitch. 

My personal experience of Stitch was very positive. Through their online help chat, their experts guided me through my questions in no time. Our noSQL data in Mongo with deeply nested child objects in JSON was ‘relationalized’ in third-normal form automatically and correctly with synthetic parent-child keys injected. Exact error messages are not visible in the UI, at least for Mongo connectors.
The UI is not as spectacular as Alooma’s, but gives a authorative, relaxed view of all data replication statuses. 


Alooma is a good solution to stream data from different sources into the data warehouse in realtime. The Alooma pipeline has a very clean ‘Code Engine’ and a ‘Mapper’ module for ETL and database mapping code in Python on the fly. The UI is very clean. The Re-stream Queue gives full control over failed data, makes it easy to fix transformations and mappings on the fly and retry. Alooma offers the ultimate flexibility in mapping and ETL in real-time. Its dashboard resembles a river of data flowing from left to right. Alooma is a better solution for systems with quickly changing data formats that needs would otherwise need a lot of manual intervention and coding. Pricing is higher and not transparent. 


Very easy to setup, automated pull replication from SaaS sources. FiveTran has the least configuration of all by deciding what to replicate and how to name it for you. It detects deleted records, if the source API allows it, and appends a boolean column for deleted rows. They pride themselves to be the only zero-maintenance solution. The downside of this out-of-the-box configuration is loss of fine grained control. Nested objects in noSQL documents are extracted to a JSON string, which is not native to SQL databases. FiveTran has great support, reportedly, with the most startup feel. Pricing is not transparent and negotiable.

Dec. 2017

StitchData Alooma FiveTran
Input Connectors 68 SaaS connectors
Advertising, Analytics, Automation, Business, CRM, Databases, Email, Payments, Support
68 SaaS connectors
62 data input connectors, e.g. NetSuite, Salesforce, Quickbooks, Stripe, …
Output Connectors (SQL DWH) Redshift
Google BigQueryPanoply
Redshift, Google BigQuery, Snowflake, MySQL, , MeMSQL,
Google BigQuery, Snowflake, MySQL, SQLServer, Azure, Panoply
Replication Frequency 1 min to 24 h continuous 15 min – 24 h
Custom mapping no yes no
Granularity of Replication Tables, fields, collections, endpoints Tables, files, fields Everything by default,
Detection of Deleted records no no yes
Extensible for custom connector Yes, own connectors feasible Custom webhook, REST API, file dump Webhook callbackI
Connector API Stitch API Alooma API None
Monitoring & Alert email Info, warnings, errors as they happen or in digests email
Dashboard Clean web UI Very visual UI Minimal UI
Latency 5 min and up immediate +15 min
High Availability yes yes yes
Uptime Monitor ? ? ?
Data backup Complete historical data S3 24 hours
noSQL Transforms nested JSON documents into relational schemas automatically Tailing OPLog of MongoDB Saves nested JSON objects into a String column
Security SOC-2 SOC-2, HIPAA, EU-US privacy shield SOC-2,


Trial 2 week 2 week Unknown
Free tier 5 M rows / month none none
100 million rows for $500
Very transparent
Negotiated, but min. starting at  $1000 / month Negotiated
Live Support Built in chat Slack channel Built in chat



Facebooktwittergoogle_pluslinkedinmailby feather

The Looker 5 Analytics Platform

Looker is the next-generation data platform that delivers analytics and business insights to individual users, departments, applications and the enterprise as a whole. Users can integrate the insights to support data-driven decision making.

The five year old startup from Santa Cruz, California, just recently presented its new release 5 at its user conference JOIN 2017 at the Palace of Fine Arts Theater in San Francisco, attended by 700 business intelligence enthusiasts (including myself).

A review of Looker v5 is at A few of the spectacular features are

  • Action Hub: one-click buttons to data points actionable by configuring custom action, like DropBox or Marketo integration, if something happens.
  • Blocks: pre-built visualization and data blocks offer ready-to-use public data like weather and demographic to join with your data. Many blocks are available for common use cases like web analytics.
  • Applications: entire business workflows for marketing, event stream analytics, IT operations, and growing.
  • 57 new statistical functions, more security features, new landing page options, among other things.

October 5, 2017 was the actual release day of Looker v5, see

Facebooktwittergoogle_pluslinkedinmailby feather

JOIN 2017 Conference – Vendor Notes

On Sep 13-15, 2017, the beautiful Palace of Fine Arts in San Francisco hosted the very successful JOIN 2017 conference by Looker. These are my notes on vendors.


Sponsored the free coffee bar. Thank you very much!

Google Cloud



Novel data warehouse on S3 and EC2 with higher scalability for the masses.




Predictive analytics algorithms off the shelf on your data.


ETL with many connectors. Not good with noSQL.


Data governance layer on top of many sources






Boutique consultancy doing only Looker.





clarity insights



All in-memory MySQL compliant SQL database.










Low cost data pipeline tool written in Python. SaaS model. Allows injection of custom code in most recent version.



Facebooktwittergoogle_pluslinkedinmailby feather

Save Money with Google AdWords

Growing a business usually starts with a great idea, building something, marketing, sales, and execution. Focusing on saving money on digital ad spend becomes a chore once everything is setup and running, statistically enough data has been generated, and if your money saving activities have a higher return than other business tasks. Of course, everybody would like to make the best decisions in PPC from the get go, but information is sparse and things are dynamic. Following PPC tasks should be applied periodically, at least monthly, being general best practices:

Axe the Poor Performers
Pause or delete poor performing keywords. Keywords with a high cost of acquisition and no conversion are a waste. Check if they should move over to the negative keywords.

Grow List of Negative Keywords
Check the search term reports, if your keywords are actually relevant to your product.
Negative keywords can be defined at the account, campaign, and ad group level, with different match type settings. There are always surprises in this report, i.e. wasted money.

Adjust Keyword Match Type
Similarly, a match type too broad, can pull in too much irrelevant traffic and waste money. Consider adjusting keyword match type from modified broad to phrase to exact, trading in depth of specificity against breadth of exposure. [Note: this is time consuming and only recommended if you know for sure.]

Adjust Keyword Bids
For manually managed keywords with high bids and clicks, consider if a second position bid would yield the same benefit. [clicks * conversion * (profit – cost), use bid simulator]

Re-organize Ad Groups
Expert opinions for the number of keywords per ad group are from 20 to 50, depending on context. If too many keywords are part of an ad group, it is harder for the creative of the actual ad to be specific. Smaller ad groups allow for a more specific creative, and people click more on highly specific ads than generic ads with no particular proposition.

Branded Name Campaign
Create a campaign only dedicated to your company name. These are the highest converting campaigns because people are already searching for you. Protected brand names also have no competition (meaning traffic is cheap). Google disallows protected brand names as keywords, unless you are the owner of it.

Target Anything
You can make a PPC campaign more specific by targeting it by demographics (age, language, etc.), geographic area, devices (mobile, iPhone, tablet, etc.) and such. Also the ad campaign can be throttled to specified hours, to save money.

Use A/B Testing
The Google AdWords A/B testing tool allows to set aside a percentage of traffic for campaign experiments. After enough data is collected, Google will show which version of the campaign generates better results.

Revise Landing Pages
A/B testing can also be used to optimize landing pages. But before doing that, it should be audited if it matches the advertised content, and if it loads fast. Snappy landing pages that are highly specific to the ad content yield a high quality factor which saves money.

Use Remarketing Campaigns
Build remarketing lists and advertise to those who already had exposure to your content. Click throughs are much higher resulting in higher quality score which saves money.

Use Ad Extensions
Enriching your ad with site links, addresses, call buttons and such, a viewer is more likely to click on your ad, thus increasing clicks and raising quality.




Think of these practices as eliminating waste, to achieve a more productive state of growth.

Facebooktwittergoogle_pluslinkedinmailby feather

Targeting Existing Customers with Google’s New Customer Match

In September 2015 Google introduced a new way of remarketing to already existing customers by way of uploaded email addresses.

This feature, called Customer Match, is described at
You find it under Shared Library / Audiences / +Remarketing List, see image below.

You upload a list of email addresses of customers for targeting them with ads while they are logged in and searching on Google, in GMail, or YouTube.

Targeting existing customers in essence tries to convince the same people to spend more money with you, as opposed to finding new customers.

Some useful scenarios for configuring the targeting criteria of Customer Match are:

  • terms related to the customer’s original transaction for cross selling
  • terms of the customers original transaction for repeat selling
  • searches for competitor names and products as a friendly reminder that you are also open for business and ready for the competition
  • high performing keywords for your business

Screen Shot 2015-11-01 at 9.30.49 AM

Google has extensive privacy protection measures of the uploaded emails, e.g. emails are hashed so they cannot be stolen during transmission.

In summary, Customer Match is yet another channel of Remarketing thanks to Google’s ubiquity.


More References:





Facebooktwittergoogle_pluslinkedinmailby feather