This post is a musing on data pipeline tools. A high-level discussion is followed by three commercial products: StitchData, Alooma, and FiveTran. The list is incomplete and only represents the opinion of the author.
A data pipeline is software that integrates with diverse internal and external data sources for the purpose of consolidation into a single source of truth, sometimes a data warehouse. The database is typically used for reporting, analytics, or it powers a product feature. Building a data pipeline is a big commitment, even with great open source frameworks widely available and a team of enthusiastic backend developers. Once you have built a scaleable, fault tolerant architecture that never loses a byte, you need to keep up with API changes of existing integrations as well as add more integrations of new APIs as the business evolves and grows.
Some APIs are complex, e.g. SAP or NetSuite, and require significant expertise. Keeping the data pipeline alive and healthy usually requires a dedicated team. Their work encompasses keeping up with mandatory API changes and adding new integrations, which can mean weeks of development work.
What to do when things go awry? Implementing the happy path, even with full scalability and fault tolerance, is not enough. There must be a clear contingency plan for failed transactions. Keeping a backup of the raw data is a good start, but also the order of arrival might be business critical. A frequent cause for failure are changing data formats over time.
Error Logging and Notification
When the inevitable failure happens, how fast do you hear about it? Logging design and alert notification at all levels is essential. There needs to be an audit trail with enough information of the type of failure. The minimum type of notification is by email, although with the flood of today’s emails this may be ineffective. Failures can be short-lived and self-healing or require intervention such as resetting a credential or updating a payment method.
Speed and Scalability
The faster the better generally, to provide more timely information to the business. Tolerable lag time of data collection is a business decision. A lag time of 100 milli seconds will be a lot more expensive than 5 minutes or one hour. A one day lag may be acceptable for daily reporting. Scalability should be high enough to handle bursts but also small enough to be economic. Hosted cloud services like Amazon Kinesis are perfect solutions.
External APIs evolve over time. Version upgrades, new data types, etc. and old APIs are phased out, to name a few of the things that will break your integrations. It is definitely a chore to stay on top of all the API changes that you are integrated with. Besides commercial integration options for popular SaaS systems, it is good advice to look for a programmatic integration option (API, language specific connectors, webhooks).
Building a data pipeline is a complicated project and expensive to do right. The data pipeline often plays a supporting role to a business and is not core to the revenue generation itself. Therefore, it is hard to raise the funds and support for it. A data pipeline has high visibility when it’s broken and next to no visibility when it works. A 100% working data integration is taken for granted and probably necessary to enable smooth operation of other parts of a business, even though external integrations are a collaboration with outside partners and businesses and not always 100% predictable.
SaaS Data Pipelines (Data Normalization Tools)
This section presents the following three commercial solutions with a detailed feature matrix of them all at the end.
StitchData is the most inexpensive solution of the three, yet robust. It is written in Python and uses the Singer open source framework, which means you can build your own connector. Singer is a an open source framework for ETL. It adds credibility and transparency to Stitch.
My personal experience of Stitch was very positive. Through their online help chat, their experts guided me through my questions in no time. Our noSQL data in Mongo with deeply nested child objects in JSON was ‘relationalized’ in third-normal form automatically and correctly with synthetic parent-child keys injected. Exact error messages are not visible in the UI, at least for Mongo connectors.
The UI is not as spectacular as Alooma’s, but gives a authorative, relaxed view of all data replication statuses.
Alooma is a good solution to stream data from different sources into the data warehouse in realtime. The Alooma pipeline has a very clean ‘Code Engine’ and a ‘Mapper’ module for ETL and database mapping code in Python on the fly. The UI is very clean. The Re-stream Queue gives full control over failed data, makes it easy to fix transformations and mappings on the fly and retry. Alooma offers the ultimate flexibility in mapping and ETL in real-time. Its dashboard resembles a river of data flowing from left to right. Alooma is a better solution for systems with quickly changing data formats that needs would otherwise need a lot of manual intervention and coding. Pricing is higher and not transparent.
Very easy to setup, automated pull replication from SaaS sources. FiveTran has the least configuration of all by deciding what to replicate and how to name it for you. It detects deleted records, if the source API allows it, and appends a boolean column for deleted rows. They pride themselves to be the only zero-maintenance solution. The downside of this out-of-the-box configuration is loss of fine grained control. Nested objects in noSQL documents are extracted to a JSON string, which is not native to SQL databases. FiveTran has great support, reportedly, with the most startup feel. Pricing is not transparent and negotiable.
|Input Connectors||68 SaaS connectors
Advertising, Analytics, Automation, Business, CRM, Databases, Email, Payments, Support
|68 SaaS connectors
|62 data input connectors, e.g. NetSuite, Salesforce, Quickbooks, Stripe, …|
|Output Connectors (SQL DWH)||Redshift
|Redshift, Google BigQuery, Snowflake, MySQL, , MeMSQL,
Google BigQuery, Snowflake, MySQL, SQLServer, Azure, Panoply
|Replication Frequency||1 min to 24 h||continuous||15 min – 24 h|
|Granularity of Replication||Tables, fields, collections, endpoints||Tables, files, fields||Everything by default,
|Detection of Deleted records||no||no||yes|
|Extensible for custom connector||Yes, own connectors feasible||Custom webhook, REST API, file dump||Webhook callbackI|
|Connector API||Stitch API||Alooma API||None|
|Monitoring & Alert||Info, warnings, errors as they happen or in digests|
|Dashboard||Clean web UI||Very visual UI||Minimal UI|
|Latency||5 min and up||immediate||+15 min|
|Data backup||Complete historical data||S3||24 hours|
|noSQL||Transforms nested JSON documents into relational schemas automatically||Tailing OPLog of MongoDB||Saves nested JSON objects into a String column|
|Security||SOC-2||SOC-2, HIPAA, EU-US privacy shield||SOC-2,
|Trial||2 week||2 week||Unknown|
|Free tier||5 M rows / month||none||none|
100 million rows for $500
|Negotiated, but min. starting at $1000 / month||Negotiated|
|Live Support||Built in chat||Slack channel||Built in chat|