Embedded Analytics Dashboard

Powered by Looker is the term coined by Looker for embedding entire analytics dashboards into your proprietary software, i.e. white labeling Looker as your own solution.

A lot of SaaS systems, especially enterprise grade, need a reporting layer. Analytics and Reporting is an important if not the most essential aspect of a business system. Few people would argue that building your own reporting layer is better than buying a best-of-breed one. Having recently been added to the Gartner Magic Quadrant for Analytics and BI Platforms, Looker is the perfect solution for white labeling reporting into your product. This is the story of an embedded analytics dashboard by Looker (5 min read).

Requirements

The project was to provide a set of canned user activity reports together with specific business reports embedded into a multi-tenant SaaS platform. Security and data privacy is of the utmost concern. Reporting needs to scale to tens of thousands of users and be realtime as much as possible. The user experience must be seamless, i.e. no hint of an interruption between system boundaries allowed. 

 

Data Pipeline and Data Warehouse

The origin of all business data is in a clustered Mongo database self-hosted in AWS. Data is extracted with the cloud ETL tool StitchData described here.

StitchData allows creation of instant analytics data pipelines without writing a single line of code. It has an intuitive realtime dashboard with timely information of replication status for each pipeline. Here is a screenshot of a database connector:

AWS Redshift is a fast, scaleable, inexpensive data warehouse solution in the Amazon cloud. It is SQL based and can be used as the persistent datastore for the Looker BI platform. The data warehouse is the consolidating single point of truth where data from various sources is collected, cleaned, normalized, and brought into a usable form for reporting. This data pipeline is not quite realtime, but “near realtime”. A connector in StitchData can be set up to replicate new data as frequently as every 5 minutes (up to once a day). There are more realtime solution such as AWS Kinesis, Kinesis Firehose, and others, though they are usually more involved to set up and/or at a higher price.

An interesting fact of this setup is it successfully transports MongoDB data (noSQL) over to AWS Redshift (SQL). StitchData unpacks the nested arrays of a JSON document into child tables in SQL with a synthetic foreign key for the parent-child relationship injected during replication. It correctly and automatically normalizes noSQL data into SQL data. Other data pipeline providers (which I leave unnamed here) wouldn’t get past reading a nested JSON object out as String, which means all followup work of parsing out to a useful data structure is on you.

Security

Authentication to the hosted Looker account must be provided on the fly. The solution was a web service endpoint which provided the URL of the embedded Looker dashboard, customized to the user already logged into the SaaS application. The security context of the user must be adequate and correct according to the user’s roles and permissions (authorization), as well as the initial session duration of the Looker session must be of equal duration.

 

Data Access Restriction in Multi-Tenant Environment

Looker allows specification of additional required user attributes per user session. The sample LookML snippet below shows a left outer join of table sales__items with table sales (where clause is  “where  sales__items._sds_source_key__id = sales._id” a parent child relationship introduced by StitchData automatically when unpacking the nested array items in document sales).
Secondly, the access_filter attribute enforces an additional filter on sales.client_id to match user.client_id. This provides the security context for a user in the multi-tenant SQL table.

explore: sales__items {

  join: sales {

    type: left_outer

    relationship: many_to_one

    sql_on: ${sales__items._sdc_source_key__id} = ${sales._id} ;;

  }

  access_filter: {

    field: sales.client_id

    user_attribute:client_id

  }

}

 

Building the URL for the iFrame

Step-by-step instructions for the embedded URL are at https://docs.looker.com/reference/embedding/sso-embed.
A few settings need to be changed in the Admin console. Code needs to be written to produce a URL like such

https://analytics.mycompany.com/login/embed//embed/dashboards/1?
nonce=”22b1ee700ef3dc2f500fb7″&
time=1407876784&
session_length=86400&
external_user_id=”user-4″&
permissions=[“access_data”,”see_user_dashboards”,”see_looks”]&
models=[“model_one”,”model_two”]&
group_ids=[4,3]&
external_group_id=”Allegra K”&
user_attributes={“vendor_id”:17,”company”:”acme”}&
access_filters={“model_one”:{“vendor.id”:17,”company.name”:”acme”},
 “model_two”:{“vendor.id”:12,”company.name”:”widgets-r-us”}}&
first_name=”Alice”&
last_name=”Jones”&
user_timezone=”US/Pacific”&
force_logout_login=true&
signature=123456789ABCDEFGHIJKL

Example of a Looker dashboard that can be embedded as an iFrame into a SaaS application.

 

Summary

Looker was used to deliver an embedded Analytics and Reporting solution for a large enterprise multi-tenant SaaS platform. A so-called cloud analytics data pipeline was set up from MongoDB to AWS Redshift using StitchData without a line of coding. Reports and dashboards were created in Looker on top of the AWS Redshift data warehouse. The URL of the top-level analytics dashboard that was assigned to an iFrame of the SaaS application, was created by a backend API to fit the security context of the client company and the authorization level of the user.

Powered by Looker delivers a white labeled solution with all the conveniences of a hosted service like automatic upgrades, along with the powerful features of Looker itself such as report scheduling, data distribution, API integration, access restriction, among others.

Lastly, it is very nice from an operational point of view, that new reports can be added to the reporting layer independently of the release schedule of the rest of the SaaS platform.

 

 

Facebooktwittergoogle_pluslinkedinmailby feather

Data Pipelines in the Cloud

This post is a musing on data pipeline tools. A high-level discussion is followed by three commercial products: StitchData, Alooma, and FiveTran. The list is incomplete and only represents the opinion of the author. 

A data pipeline is software that integrates with diverse internal and external data sources for the purpose of consolidation into a single source of truth, sometimes a data warehouse. The database is typically used for reporting, analytics, or it powers a product feature.  Building a data pipeline is a big commitment, even with great open source frameworks widely available and a team of enthusiastic backend developers. Once you have built a scaleable, fault tolerant architecture that never loses a byte, you need to keep up with API changes of existing integrations as well as add more integrations of new APIs as the business evolves and grows. 

Some APIs are complex, e.g. SAP or NetSuite, and require significant expertise. Keeping the data pipeline alive and healthy usually requires a dedicated team. Their work encompasses keeping up with mandatory API changes and adding new integrations, which can mean weeks of development work.

Accountability

What to do when things go awry? Implementing the happy path, even with full scalability and fault tolerance, is not enough. There must be a clear contingency plan for failed transactions. Keeping a backup of the raw data is a good start, but also the order of arrival might be business critical. A frequent cause for failure are changing data formats over time. 

Error Logging and Notification

When the inevitable failure happens, how fast do you hear about it? Logging design and alert notification at all levels is essential. There needs to be an audit trail with enough information of the type of failure. The minimum type of notification is by email, although with the flood of today’s emails this may be ineffective. Failures can be short-lived and self-healing or require intervention such as resetting a credential or updating a payment method.

Speed and Scalability

The faster the better generally, to provide more timely information to the business. Tolerable lag time of data collection is a business decision. A lag time of 100 milli seconds will be a lot more expensive than 5 minutes or one hour.  A one day lag may be acceptable for daily reporting.  Scalability should be high enough to handle bursts but also small enough to be economic. Hosted cloud services like Amazon Kinesis are perfect solutions.  

Flexibility

External APIs evolve over time. Version upgrades, new data types, etc. and old APIs are phased out, to name a few of the things that will break your integrations. It is definitely  a chore to stay on top of all the API changes that you are integrated with. Besides commercial integration options for popular SaaS systems, it is good advice to look for a programmatic integration option (API, language specific connectors, webhooks).

Summary

Building a data pipeline is a complicated project and expensive to do right. The data pipeline often plays a supporting role to a business and is not core to the revenue generation itself. Therefore, it is hard to raise the funds and support for it. A data pipeline has high visibility when it’s broken and next to no visibility when it works. A 100% working data integration is taken for granted and probably necessary to enable smooth operation of other parts of a business, even though external integrations are a collaboration with outside partners and businesses and not always 100% predictable.

 

SaaS Data Pipelines (Data Normalization Tools)

This section presents the following three commercial solutions with a detailed feature matrix of them all at the end.

StitchData

StitchData is the most inexpensive solution of the three, yet robust. It is written in Python and uses the Singer open source framework, which means you can build your own connector. Singer is a an open source framework for ETL. It adds credibility and transparency to Stitch. 

My personal experience of Stitch was very positive. Through their online help chat, their experts guided me through my questions in no time. Our noSQL data in Mongo with deeply nested child objects in JSON was ‘relationalized’ in third-normal form automatically and correctly with synthetic parent-child keys injected. Exact error messages are not visible in the UI, at least for Mongo connectors.
The UI is not as spectacular as Alooma’s, but gives a authorative, relaxed view of all data replication statuses. 


Alooma

Alooma is a good solution to stream data from different sources into the data warehouse in realtime. The Alooma pipeline has a very clean ‘Code Engine’ and a ‘Mapper’ module for ETL and database mapping code in Python on the fly. The UI is very clean. The Re-stream Queue gives full control over failed data, makes it easy to fix transformations and mappings on the fly and retry. Alooma offers the ultimate flexibility in mapping and ETL in real-time. Its dashboard resembles a river of data flowing from left to right. Alooma is a better solution for systems with quickly changing data formats that needs would otherwise need a lot of manual intervention and coding. Pricing is higher and not transparent. 


FiveTran

Very easy to setup, automated pull replication from SaaS sources. FiveTran has the least configuration of all by deciding what to replicate and how to name it for you. It detects deleted records, if the source API allows it, and appends a boolean column for deleted rows. They pride themselves to be the only zero-maintenance solution. The downside of this out-of-the-box configuration is loss of fine grained control. Nested objects in noSQL documents are extracted to a JSON string, which is not native to SQL databases. FiveTran has great support, reportedly, with the most startup feel. Pricing is not transparent and negotiable.

Dec. 2017

StitchData Alooma FiveTran
Website https://www.stitchdata.com https://www.alooma.com/platform
https://www.fivetran.com
Input Connectors 68 SaaS connectors
Advertising, Analytics, Automation, Business, CRM, Databases, Email, Payments, Support
68 SaaS connectors
Databases
62 data input connectors, e.g. NetSuite, Salesforce, Quickbooks, Stripe, …
Output Connectors (SQL DWH) Redshift
Google BigQueryPanoply
PostgreSQL
Snowflake
Redshift, Google BigQuery, Snowflake, MySQL, , MeMSQL,
S3
Redshift,
Google BigQuery, Snowflake, MySQL, SQLServer, Azure, Panoply
Replication Frequency 1 min to 24 h continuous 15 min – 24 h
Custom mapping no yes no
Granularity of Replication Tables, fields, collections, endpoints Tables, files, fields Everything by default,
Tables
Detection of Deleted records no no yes
Extensible for custom connector Yes, own connectors feasible Custom webhook, REST API, file dump Webhook callbackI
Connector API Stitch API Alooma API None
Monitoring & Alert email Info, warnings, errors as they happen or in digests email
Dashboard Clean web UI Very visual UI Minimal UI
Latency 5 min and up immediate +15 min
High Availability yes yes yes
Uptime Monitor ? ? ?
Data backup Complete historical data S3 24 hours
noSQL Transforms nested JSON documents into relational schemas automatically Tailing OPLog of MongoDB Saves nested JSON objects into a String column
Security SOC-2 SOC-2, HIPAA, EU-US privacy shield SOC-2,

HIPAA

Trial 2 week 2 week Unknown
Free tier 5 M rows / month none none
Pricing
100 million rows for $500
Very transparent
Negotiated, but min. starting at  $1000 / month Negotiated
Live Support Built in chat Slack channel Built in chat

 

 

Facebooktwittergoogle_pluslinkedinmailby feather