Embedded Analytics Dashboard

Powered by Looker is the term coined by Looker for embedding entire analytics dashboards into your proprietary software, i.e. white labeling Looker as your own solution.

A lot of SaaS systems, especially enterprise grade, need a reporting layer. Analytics and Reporting is an important if not the most essential aspect of a business system. Few people would argue that building your own reporting layer is better than buying a best-of-breed one. Having recently been added to the Gartner Magic Quadrant for Analytics and BI Platforms, Looker is the perfect solution for white labeling reporting into your product. This is the story of an embedded analytics dashboard by Looker (5 min read).

Requirements

The project was to provide a set of canned user activity reports together with specific business reports embedded into a multi-tenant SaaS platform. Security and data privacy is of the utmost concern. Reporting needs to scale to tens of thousands of users and be realtime as much as possible. The user experience must be seamless, i.e. no hint of an interruption between system boundaries allowed. 

 

Data Pipeline and Data Warehouse

The origin of all business data is in a clustered Mongo database self-hosted in AWS. Data is extracted with the cloud ETL tool StitchData described here.

StitchData allows creation of instant analytics data pipelines without writing a single line of code. It has an intuitive realtime dashboard with timely information of replication status for each pipeline. Here is a screenshot of a database connector:

AWS Redshift is a fast, scaleable, inexpensive data warehouse solution in the Amazon cloud. It is SQL based and can be used as the persistent datastore for the Looker BI platform. The data warehouse is the consolidating single point of truth where data from various sources is collected, cleaned, normalized, and brought into a usable form for reporting. This data pipeline is not quite realtime, but “near realtime”. A connector in StitchData can be set up to replicate new data as frequently as every 5 minutes (up to once a day). There are more realtime solution such as AWS Kinesis, Kinesis Firehose, and others, though they are usually more involved to set up and/or at a higher price.

An interesting fact of this setup is it successfully transports MongoDB data (noSQL) over to AWS Redshift (SQL). StitchData unpacks the nested arrays of a JSON document into child tables in SQL with a synthetic foreign key for the parent-child relationship injected during replication. It correctly and automatically normalizes noSQL data into SQL data. Other data pipeline providers (which I leave unnamed here) wouldn’t get past reading a nested JSON object out as String, which means all followup work of parsing out to a useful data structure is on you.

Security

Authentication to the hosted Looker account must be provided on the fly. The solution was a web service endpoint which provided the URL of the embedded Looker dashboard, customized to the user already logged into the SaaS application. The security context of the user must be adequate and correct according to the user’s roles and permissions (authorization), as well as the initial session duration of the Looker session must be of equal duration.

 

Data Access Restriction in Multi-Tenant Environment

Looker allows specification of additional required user attributes per user session. The sample LookML snippet below shows a left outer join of table sales__items with table sales (where clause is  “where  sales__items._sds_source_key__id = sales._id” a parent child relationship introduced by StitchData automatically when unpacking the nested array items in document sales).
Secondly, the access_filter attribute enforces an additional filter on sales.client_id to match user.client_id. This provides the security context for a user in the multi-tenant SQL table.

explore: sales__items {

  join: sales {

    type: left_outer

    relationship: many_to_one

    sql_on: ${sales__items._sdc_source_key__id} = ${sales._id} ;;

  }

  access_filter: {

    field: sales.client_id

    user_attribute:client_id

  }

}

 

Building the URL for the iFrame

Step-by-step instructions for the embedded URL are at https://docs.looker.com/reference/embedding/sso-embed.
A few settings need to be changed in the Admin console. Code needs to be written to produce a URL like such

https://analytics.mycompany.com/login/embed//embed/dashboards/1?
nonce=”22b1ee700ef3dc2f500fb7″&
time=1407876784&
session_length=86400&
external_user_id=”user-4″&
permissions=[“access_data”,”see_user_dashboards”,”see_looks”]&
models=[“model_one”,”model_two”]&
group_ids=[4,3]&
external_group_id=”Allegra K”&
user_attributes={“vendor_id”:17,”company”:”acme”}&
access_filters={“model_one”:{“vendor.id”:17,”company.name”:”acme”},
 “model_two”:{“vendor.id”:12,”company.name”:”widgets-r-us”}}&
first_name=”Alice”&
last_name=”Jones”&
user_timezone=”US/Pacific”&
force_logout_login=true&
signature=123456789ABCDEFGHIJKL

Example of a Looker dashboard that can be embedded as an iFrame into a SaaS application.

 

Summary

Looker was used to deliver an embedded Analytics and Reporting solution for a large enterprise multi-tenant SaaS platform. A so-called cloud analytics data pipeline was set up from MongoDB to AWS Redshift using StitchData without a line of coding. Reports and dashboards were created in Looker on top of the AWS Redshift data warehouse. The URL of the top-level analytics dashboard that was assigned to an iFrame of the SaaS application, was created by a backend API to fit the security context of the client company and the authorization level of the user.

Powered by Looker delivers a white labeled solution with all the conveniences of a hosted service like automatic upgrades, along with the powerful features of Looker itself such as report scheduling, data distribution, API integration, access restriction, among others.

Lastly, it is very nice from an operational point of view, that new reports can be added to the reporting layer independently of the release schedule of the rest of the SaaS platform.

 

 

Facebooktwittergoogle_pluslinkedinmailby feather

MongoDB 3.6 New Features

1. MongoDB Change Streams

Change streams can benefit architectures with reliant business systems, informing downstream systems once data changes are durable. For example, change streams can save time for developers when implementing Extract, Transform, and Load (ETL) services, cross-platform synchronization, collaboration functionality, and notification services, for any real-time event driven data integration.

    changeStream = db.inventory.watch()
    document = next(changeStream)


The five  Change Operation Types are  INSERT | DELETE | REPLACE | UPDATE | INVALIDATE.

Change Streams

  1. Only need read-access to collection subscribed on any node
  2. Come with Defined API
  3. Changes remain ordered across shards – client receives operations in correct order
  4. durable (to failure)
  5. resumable, use resumeToken, – pick up where you left off, on client side,
    but need to persist the resumeToken.
    changeStream = db.inventory.watch(resume: resumeToken)
  6. power of aggregation to target specific operation types
  7. Use documentKey for sharding, uniqueness is enforced only at shard level 

Example

var coll = client.db('demo').collection('peeps');

var cursor = coll.aggregate([

   { $changeStream: { fullDocument: 'updateLookup' } },

   { $match: { operationType: { $in: ['update', 'replace'] } } },

   { $project: {secret:0}}

]);

 

2. MongoDB Stitch

MongoDB Stitch is a backend as a service (BaaS) with declarative access control and offers ‘composability’ of data integration to popular third party systems. Availability is only on MongoDB Atlas so far. Stitch is an exciting new feature which will put Mongo back on the map to compete with other easy-to-use BaaS systems such as Firebase or Apigee.

Stitch allows building REST APIs and microservices declaratively. SDKs in Javascript, Swift, and Java exist. The control layer allows to define which fields a user can read or write with simple JSON rules.  Data privacy and compliance can be implemented at this way for read-through data access but also for data aggregations.

Built-in integrations make it simple for your app to use leading third party services and auth APIs, e.g. for AWS,  Facebook, Google, Twilio and others. Your Stitch microservices can thus be defined with external integrations included. Think of data orchestration or safely provide a new service endpoint for new applications.

The Stitch execution cycle consists of:

  1. Request from client is received
  2. Request is parsed, Stitch applies security rules
  3. Stitch orchestrates the database access and services  
  4. Stitch aggregates data and applies rules
  5. client receives result

An API defined in Stitch contains the calling context, such as client IP address or user authentication, which can be used in the audit log for security.

Price is per downloaded data size. Up to 25 GB per month are free. 100 GB per month are $75.

 

2. Aggregations in MongoDB

Mongo has a powerful aggregation pipeline that can take on any analytics task.

Input stream { } { } { } —-> Result { } { }

One of the more known aggregation operators are $match, $unwind, $group, $sort, $skip, $limit, $project.

https://docs.mongodb.com/manual/reference/operator/aggregation/ has over 120 aggregation operators documented. According to Asya Kamsky they can be categorized by their action into following seven categories:

  1. Group and Transform ($group, $count)
  2. Aliases ($bucket)
  3. Special Input/Output ($facet, $collStats)
  4. Re-order ($sort)
  5. Transform ($project, $lookup)
  6. Decrease ($skip, $limit)
  7. Increase ($unwind)

The aggregation pipeline is a toolbox full of tools for everything, operations on arrays, nested objects, strings and date manipulations, logic, math, and statistics operations. There is a considerable learning curve, but the MongoDB University courses get you started for free.

Data aggregations and transformations (ETL) are absolutely possible in Mongo. Even switching to the more popular ELT approach with data transformed in the same location after it is loaded is conceivable.

However, analytics in Mongo is not as widespread as data warehouses in SQL. The reason I suspect to be the extra verbosity of the Javascript syntax of Mongo (too many curly braces), the learning curve, and the fact that MongoDB is only one of many document databases, which traditionally are good for semi-structured or unstructured data, whereas fully structured data is a perfect fit for SQL data warehouse.

Example

db.books.aggregate([

   {$unwind: “$subjects”},

   {$match: [“language”:”English”, “subjects”:/^[ABC]/}},

   {$group:{_id:$”subjects”, count:{$sum:1}}},

   {$sort:{count:-1}},

   {$limit:3}

])

Analytic question: “Find top 3 books in English with subject beginning with A, B, or C”

 

3. MongoDB Compass

Finally, MongoDB provides a GUI for its database to visually explore data, run ad hoc queries, edit data with full CRUD support, performance tune queries, etc. It is available on Mac, Linux, and Windows and has an extensible architecture for plugins. There is a free community edition and a commercial version.

 

4. MongoDB Spark connector

Spark, the successor to Hadoop, provides a framework for massive parallel processing. MongoDB now offers a Spark connector which allows for heavy weight analytics with data from MongoDB directly. Spark is known for its parallelism, stream processing, caching layer, performance, interactive shell, APIs in Java, Scala, Python, R, and its wide adoption.

While MongoDB comes with a powerful aggregation framework, the bridge to Spark allows to do a lot more faster with the data. The analytics pipeline with the Spark connector could look like so

                    Mongo —> Spark —> Mongo (result)

 

For instance, Spark comes with the machine learning library MLlib, donated by IBM.

MLlib has distributed algorithms for the classic ML problems of classification, regression, decision trees, recommenders, clustering, linear algebra, and many more.

The MongoDB Spark connector opens the road to many powerful ML algorithms to your MongoDB.

 

 

 

Facebooktwittergoogle_pluslinkedinmailby feather

Data Pipelines in the Cloud

This post is a musing on data pipeline tools. A high-level discussion is followed by three commercial products: StitchData, Alooma, and FiveTran. The list is incomplete and only represents the opinion of the author. 

A data pipeline is software that integrates with diverse internal and external data sources for the purpose of consolidation into a single source of truth, sometimes a data warehouse. The database is typically used for reporting, analytics, or it powers a product feature.  Building a data pipeline is a big commitment, even with great open source frameworks widely available and a team of enthusiastic backend developers. Once you have built a scaleable, fault tolerant architecture that never loses a byte, you need to keep up with API changes of existing integrations as well as add more integrations of new APIs as the business evolves and grows. 

Some APIs are complex, e.g. SAP or NetSuite, and require significant expertise. Keeping the data pipeline alive and healthy usually requires a dedicated team. Their work encompasses keeping up with mandatory API changes and adding new integrations, which can mean weeks of development work.

Accountability

What to do when things go awry? Implementing the happy path, even with full scalability and fault tolerance, is not enough. There must be a clear contingency plan for failed transactions. Keeping a backup of the raw data is a good start, but also the order of arrival might be business critical. A frequent cause for failure are changing data formats over time. 

Error Logging and Notification

When the inevitable failure happens, how fast do you hear about it? Logging design and alert notification at all levels is essential. There needs to be an audit trail with enough information of the type of failure. The minimum type of notification is by email, although with the flood of today’s emails this may be ineffective. Failures can be short-lived and self-healing or require intervention such as resetting a credential or updating a payment method.

Speed and Scalability

The faster the better generally, to provide more timely information to the business. Tolerable lag time of data collection is a business decision. A lag time of 100 milli seconds will be a lot more expensive than 5 minutes or one hour.  A one day lag may be acceptable for daily reporting.  Scalability should be high enough to handle bursts but also small enough to be economic. Hosted cloud services like Amazon Kinesis are perfect solutions.  

Flexibility

External APIs evolve over time. Version upgrades, new data types, etc. and old APIs are phased out, to name a few of the things that will break your integrations. It is definitely  a chore to stay on top of all the API changes that you are integrated with. Besides commercial integration options for popular SaaS systems, it is good advice to look for a programmatic integration option (API, language specific connectors, webhooks).

Summary

Building a data pipeline is a complicated project and expensive to do right. The data pipeline often plays a supporting role to a business and is not core to the revenue generation itself. Therefore, it is hard to raise the funds and support for it. A data pipeline has high visibility when it’s broken and next to no visibility when it works. A 100% working data integration is taken for granted and probably necessary to enable smooth operation of other parts of a business, even though external integrations are a collaboration with outside partners and businesses and not always 100% predictable.

 

SaaS Data Pipelines (Data Normalization Tools)

This section presents the following three commercial solutions with a detailed feature matrix of them all at the end.

StitchData

StitchData is the most inexpensive solution of the three, yet robust. It is written in Python and uses the Singer open source framework, which means you can build your own connector. Singer is a an open source framework for ETL. It adds credibility and transparency to Stitch. 

My personal experience of Stitch was very positive. Through their online help chat, their experts guided me through my questions in no time. Our noSQL data in Mongo with deeply nested child objects in JSON was ‘relationalized’ in third-normal form automatically and correctly with synthetic parent-child keys injected. Exact error messages are not visible in the UI, at least for Mongo connectors.
The UI is not as spectacular as Alooma’s, but gives a authorative, relaxed view of all data replication statuses. 


Alooma

Alooma is a good solution to stream data from different sources into the data warehouse in realtime. The Alooma pipeline has a very clean ‘Code Engine’ and a ‘Mapper’ module for ETL and database mapping code in Python on the fly. The UI is very clean. The Re-stream Queue gives full control over failed data, makes it easy to fix transformations and mappings on the fly and retry. Alooma offers the ultimate flexibility in mapping and ETL in real-time. Its dashboard resembles a river of data flowing from left to right. Alooma is a better solution for systems with quickly changing data formats that needs would otherwise need a lot of manual intervention and coding. Pricing is higher and not transparent. 


FiveTran

Very easy to setup, automated pull replication from SaaS sources. FiveTran has the least configuration of all by deciding what to replicate and how to name it for you. It detects deleted records, if the source API allows it, and appends a boolean column for deleted rows. They pride themselves to be the only zero-maintenance solution. The downside of this out-of-the-box configuration is loss of fine grained control. Nested objects in noSQL documents are extracted to a JSON string, which is not native to SQL databases. FiveTran has great support, reportedly, with the most startup feel. Pricing is not transparent and negotiable.

Dec. 2017

StitchData Alooma FiveTran
Website https://www.stitchdata.com https://www.alooma.com/platform
https://www.fivetran.com
Input Connectors 68 SaaS connectors
Advertising, Analytics, Automation, Business, CRM, Databases, Email, Payments, Support
68 SaaS connectors
Databases
62 data input connectors, e.g. NetSuite, Salesforce, Quickbooks, Stripe, …
Output Connectors (SQL DWH) Redshift
Google BigQueryPanoply
PostgreSQL
Snowflake
Redshift, Google BigQuery, Snowflake, MySQL, , MeMSQL,
S3
Redshift,
Google BigQuery, Snowflake, MySQL, SQLServer, Azure, Panoply
Replication Frequency 1 min to 24 h continuous 15 min – 24 h
Custom mapping no yes no
Granularity of Replication Tables, fields, collections, endpoints Tables, files, fields Everything by default,
Tables
Detection of Deleted records no no yes
Extensible for custom connector Yes, own connectors feasible Custom webhook, REST API, file dump Webhook callbackI
Connector API Stitch API Alooma API None
Monitoring & Alert email Info, warnings, errors as they happen or in digests email
Dashboard Clean web UI Very visual UI Minimal UI
Latency 5 min and up immediate +15 min
High Availability yes yes yes
Uptime Monitor ? ? ?
Data backup Complete historical data S3 24 hours
noSQL Transforms nested JSON documents into relational schemas automatically Tailing OPLog of MongoDB Saves nested JSON objects into a String column
Security SOC-2 SOC-2, HIPAA, EU-US privacy shield SOC-2,

HIPAA

Trial 2 week 2 week Unknown
Free tier 5 M rows / month none none
Pricing
100 million rows for $500
Very transparent
Negotiated, but min. starting at  $1000 / month Negotiated
Live Support Built in chat Slack channel Built in chat

 

 

Facebooktwittergoogle_pluslinkedinmailby feather

The Looker 5 Analytics Platform

Looker is the next-generation data platform that delivers analytics and business insights to individual users, departments, applications and the enterprise as a whole. Users can integrate the insights to support data-driven decision making.

The five year old startup from Santa Cruz, California, just recently presented its new release 5 at its user conference JOIN 2017 at the Palace of Fine Arts Theater in San Francisco, attended by 700 business intelligence enthusiasts (including myself).

A review of Looker v5 is at https://www.cmswire.com/analytics/looker-unveils-an-amped-up-looker-5-at-join-2017. A few of the spectacular features are

  • Action Hub: one-click buttons to data points actionable by configuring custom action, like DropBox or Marketo integration, if something happens.
  • Blocks: pre-built visualization and data blocks offer ready-to-use public data like weather and demographic to join with your data. Many blocks are available for common use cases like web analytics.
  • Applications: entire business workflows for marketing, event stream analytics, IT operations, and growing.
  • 57 new statistical functions, more security features, new landing page options, among other things.

October 5, 2017 was the actual release day of Looker v5, see https://looker.com/product/looker-5.

Facebooktwittergoogle_pluslinkedinmailby feather

JOIN 2017 Conference – Vendor Notes

On Sep 13-15, 2017, the beautiful Palace of Fine Arts in San Francisco hosted the very successful JOIN 2017 conference by Looker. These are my notes on vendors.

AWS

Sponsored the free coffee bar. Thank you very much!

Google Cloud

 

Snowflake

Novel data warehouse on S3 and EC2 with higher scalability for the masses.

Segment

 

BigSquid

Predictive analytics algorithms off the shelf on your data.

FiveTran

ETL with many connectors. Not good with noSQL.

Matillion

Data governance layer on top of many sources

Unifi

 

panoply.op

 

bytecode.io

 

das42

Boutique consultancy doing only Looker.

denodo

 

databricks

 

clarity insights

 

MemSQL

All in-memory MySQL compliant SQL database.

HEAP

 

Intricity

 

Productops

 

Qubole

 

Stitch

Low cost data pipeline tool written in Python. SaaS model. Allows injection of custom code in most recent version.

 

 

Facebooktwittergoogle_pluslinkedinmailby feather