Data Pipelines in the Cloud

This post is a musing on data pipeline tools. A high-level discussion is followed by three commercial products: StitchData, Alooma, and FiveTran. The list is incomplete and only represents the opinion of the author. 

A data pipeline is software that integrates with diverse internal and external data sources for the purpose of consolidation into a single source of truth, sometimes a data warehouse. The database is typically used for reporting, analytics, or it powers a product feature.  Building a data pipeline is a big commitment, even with great open source frameworks widely available and a team of enthusiastic backend developers. Once you have built a scaleable, fault tolerant architecture that never loses a byte, you need to keep up with API changes of existing integrations as well as add more integrations of new APIs as the business evolves and grows. 

Some APIs are complex, e.g. SAP or NetSuite, and require significant expertise. Keeping the data pipeline alive and healthy usually requires a dedicated team. Their work encompasses keeping up with mandatory API changes and adding new integrations, which can mean weeks of development work.

Accountability

What to do when things go awry? Implementing the happy path, even with full scalability and fault tolerance, is not enough. There must be a clear contingency plan for failed transactions. Keeping a backup of the raw data is a good start, but also the order of arrival might be business critical. A frequent cause for failure are changing data formats over time. 

Error Logging and Notification

When the inevitable failure happens, how fast do you hear about it? Logging design and alert notification at all levels is essential. There needs to be an audit trail with enough information of the type of failure. The minimum type of notification is by email, although with the flood of today’s emails this may be ineffective. Failures can be short-lived and self-healing or require intervention such as resetting a credential or updating a payment method.

Speed and Scalability

The faster the better generally, to provide more timely information to the business. Tolerable lag time of data collection is a business decision. A lag time of 100 milli seconds will be a lot more expensive than 5 minutes or one hour.  A one day lag may be acceptable for daily reporting.  Scalability should be high enough to handle bursts but also small enough to be economic. Hosted cloud services like Amazon Kinesis are perfect solutions.  

Flexibility

External APIs evolve over time. Version upgrades, new data types, etc. and old APIs are phased out, to name a few of the things that will break your integrations. It is definitely  a chore to stay on top of all the API changes that you are integrated with. Besides commercial integration options for popular SaaS systems, it is good advice to look for a programmatic integration option (API, language specific connectors, webhooks).

Summary

Building a data pipeline is a complicated project and expensive to do right. The data pipeline often plays a supporting role to a business and is not core to the revenue generation itself. Therefore, it is hard to raise the funds and support for it. A data pipeline has high visibility when it’s broken and next to no visibility when it works. A 100% working data integration is taken for granted and probably necessary to enable smooth operation of other parts of a business, even though external integrations are a collaboration with outside partners and businesses and not always 100% predictable.

 

SaaS Data Pipelines (Data Normalization Tools)

This section presents the following three commercial solutions with a detailed feature matrix of them all at the end.

StitchData

StitchData is the most inexpensive solution of the three, yet robust. It is written in Python and uses the Singer open source framework, which means you can build your own connector. Singer is a an open source framework for ETL. It adds credibility and transparency to Stitch. 

My personal experience of Stitch was very positive. Through their online help chat, their experts guided me through my questions in no time. Our noSQL data in Mongo with deeply nested child objects in JSON was ‘relationalized’ in third-normal form automatically and correctly with synthetic parent-child keys injected. Exact error messages are not visible in the UI, at least for Mongo connectors.
The UI is not as spectacular as Alooma’s, but gives a authorative, relaxed view of all data replication statuses. 


Alooma

Alooma is a good solution to stream data from different sources into the data warehouse in realtime. The Alooma pipeline has a very clean ‘Code Engine’ and a ‘Mapper’ module for ETL and database mapping code in Python on the fly. The UI is very clean. The Re-stream Queue gives full control over failed data, makes it easy to fix transformations and mappings on the fly and retry. Alooma offers the ultimate flexibility in mapping and ETL in real-time. Its dashboard resembles a river of data flowing from left to right. Alooma is a better solution for systems with quickly changing data formats that needs would otherwise need a lot of manual intervention and coding. Pricing is higher and not transparent. 


FiveTran

Very easy to setup, automated pull replication from SaaS sources. FiveTran has the least configuration of all by deciding what to replicate and how to name it for you. It detects deleted records, if the source API allows it, and appends a boolean column for deleted rows. They pride themselves to be the only zero-maintenance solution. The downside of this out-of-the-box configuration is loss of fine grained control. Nested objects in noSQL documents are extracted to a JSON string, which is not native to SQL databases. FiveTran has great support, reportedly, with the most startup feel. Pricing is not transparent and negotiable.

Dec. 2017

StitchData Alooma FiveTran
Website https://www.stitchdata.com https://www.alooma.com/platform
https://www.fivetran.com
Input Connectors 68 SaaS connectors
Advertising, Analytics, Automation, Business, CRM, Databases, Email, Payments, Support
68 SaaS connectors
Databases
62 data input connectors, e.g. NetSuite, Salesforce, Quickbooks, Stripe, …
Output Connectors (SQL DWH) Redshift
Google BigQueryPanoply
PostgreSQL
Snowflake
Redshift, Google BigQuery, Snowflake, MySQL, , MeMSQL,
S3
Redshift,
Google BigQuery, Snowflake, MySQL, SQLServer, Azure, Panoply
Replication Frequency 1 min to 24 h continuous 15 min – 24 h
Custom mapping no yes no
Granularity of Replication Tables, fields, collections, endpoints Tables, files, fields Everything by default,
Tables
Detection of Deleted records no no yes
Extensible for custom connector Yes, own connectors feasible Custom webhook, REST API, file dump Webhook callbackI
Connector API Stitch API Alooma API None
Monitoring & Alert email Info, warnings, errors as they happen or in digests email
Dashboard Clean web UI Very visual UI Minimal UI
Latency 5 min and up immediate +15 min
High Availability yes yes yes
Uptime Monitor ? ? ?
Data backup Complete historical data S3 24 hours
noSQL Transforms nested JSON documents into relational schemas automatically Tailing OPLog of MongoDB Saves nested JSON objects into a String column
Security SOC-2 SOC-2, HIPAA, EU-US privacy shield SOC-2,

HIPAA

Trial 2 week 2 week Unknown
Free tier 5 M rows / month none none
Pricing
100 million rows for $500
Very transparent
Negotiated, but min. starting at  $1000 / month Negotiated
Live Support Built in chat Slack channel Built in chat

 

 

Facebooktwittergoogle_pluslinkedinmailby feather

Leave a Reply