AWS Aurora Serverless

AWS now offers Aurora Serverless [1] a new version of its MySQL compatible database that scales up or down on-demand or automatically based on load. It is very cost-effective with per-second billing and ideal for intermittent, unpredictable workloads.

The significance of this cannot be overstated. Scaling databases used to be a lengthy project of many months, if not years, carried out by highly technical introverts or DBA (database administrators). Amazon just automated away this tedious, expensive task, saving the customer money while improving operational readiness. Although AWS is not the first to do this – the Cloud Data Warehouse Snowflake [2] comes to mind – it leaves little reason for not using the AWS RDS service for everyday database needs in this day and age.

Facebooktwittergoogle_pluslinkedinmailby feather

Choosing between DynamoDB vs MongoDB

This post compares DynamoDB vs. MongoDB, the two most popular noSQL databases.

NoSQL databases are developer friendly because data is stored in JSON-like documents that easily map to objects of a programming language. The documents (rows in SQL) can be of different structure each because no upfront definition of a schema is required. The data schema may change with each release. This is good news for developer agility. Also the speed of queries is fast because of data denormalization. However, noSQL can also become a nightmare if you have to organize that data for something downstream like reporting or ERP integrations. In the realm of narrowly defined problems data formats with a schema dominate: XML, REST, SQL.

So let’s assume we have our mind set on a noSQL solution for whatever reason. Then MongoDB and DynamoDB are the top contenders. There are a other reviews on the internet, but I found them to be biased depending on who wrote them. [1], [2].

Below is a table that compares both systems feature by feature (searchable and sortable table if you are in a hurry).

DynamoDB vs. MongoDB

 DynamoDBMongoDB
DescriptionFully managed noSQL database
in the AWS Cloud
Most popular noSQL
document database
Database modelBoth document and
key-value store
Document store
Websitehttps://aws.amazon.com/dynamodbhttps://www.mongodb.com
DeveloperAmazon, since 2012MongoDB Inc.,
since 2009
License CommercialOpen Source
Cloud hostedyesMongoDB Atlas
Implementation LanguageJavaC++
Server OShostedLinux, OS X,
Solaris, Windows
Data scheme (model enforcement)noSchema Validation
optional
Data TypesString, Boolean,Byte, Date, Calendar, Long, Integer, 
Double, Float, BigDecimal, BigInteger
String, Boolean, Date,
Timestamp, Integer, Double,
Decimal, Object, Regular Expression,
Javascript, ObjectId, Undefined,
Null, Binary data
Data Size400 KB16 MB
Queryingby key and partial keyby key, ranges, faceted search, graph traversals, geospatial, aggregations
Indexhash or range-hash, primary plus 5 local (same partition) and 5 global indexes (eventual consistency)primary and on
any field strongly
consistent always.
Indexing strategies include compound,
unique, array, partial,
TTL, geospatial,
sparse, hash, and
text
Secondary Indexyesyes
API accessAWS DynamoDB
REST API
REST API since 4.0
Supported Programming Languages.Net,ColdFusion,
Erlang,Groovy,
Java,JavaScript,
Perl,PHP,Python,
Ruby
Actionscript,C,C#,
C++,Clojure,ColdFusion,
D,Dart,Delphi,Erlang,
Go,Groovy,Haskell,
Java,JavaScript,Lisp,
Lua,MatLab,Perl,PHP,
PowerShell,Prolog,
Python,R,Ruby,Scala,
Smalltalk
Server-side scriptsnoJavascript
Triggersyes, Lambda no,
but Change Events
since 4.0 can be
used to build one
Partitioningshardingsharding
Map ReduceTight integration with AWS EMRbuilt in
Consistencyeventual (default)
and immediate
eventual and
immediate (default)
Transactionsnoyes, since 4.0
Concurrencyyesyes
In-memory capabilitiesDAXin-memory storage engine
since 3.2
User accessUsers, groups and
roles via AWS IAM
Access rights for
users and roles
MonitoringAWS CloudWatchMongoDB Atlas
tracks 100+ metrics
Backup
DR Options
On demand and continuouscontinuous, queryable
backups with
point-in-time recovery (Atlas)
Throughput Limits10k ops / sec, more
available on request
limited by hardware
PricingBased on throughput
(read, write) and index size
Based on Memory
and CPU
Free Tier250 million requests / month, 25 GB storage512 MB
Publicly Traded AMZNMDB
Prominent companiesAmazon, Samsung,
Snapchat, Netflix,
New York Times,
AdRoll, HTC,
Dropcam, Twitch
Google, UPS,
Facebook, Cisco,
eBay, BOSH, Adobe,
SAP, Forbes

So which DB is right for you?

It depends, of course, on your situation. There is a good discussion on Stack Overflow [3], but things constantly evolve and some long held opinions are not longer true.

Amazon DynamoDB

Amazon DynamoDB is a fully managed cloud NoSQL database service. Create your tables, set your throughput, after that the heavy lifting is up to Amazon (managing distributed database cluster, installing patches, configurations, partitioning, scaling, etc.).

DynamoDB supports an Event Driven Architecture by integrating with AWS Lambda. The advantage here is (asynchronous) extensibility and integration to other systems. Fine-grained user access and control is achieved with AWS Identity and Access Management (IAM).

DynamoDB regularly ranks in Gartner’s Magic Quadrant. With its recent DAX (Amazon DynamoDB Accelerator) feature, query times below 1 milli seconds are only a one-click configuration away. This is a 10 X performance improvement that opens up entire new possibilities where such hyper-speed is of the essence.

MongoDB

MongoDB is very popular with developers. It provides the ‘M’ in the so-called MEAN stack (Mongo, Express, Angular, NodeJS). One developer can build an entire application rapidly with MEAN (or React, VueJS, and similar JS frameworks). However, with time people started complaining about MongoDB that it was missing features essential to enterprise-grade applications, for instance
– no support for multi-document transactions
– no support for money type
– no support for schema validation (constraints)
– single write node (scalability bottleneck)

Over the last few years MongoDB addressed these concerns one by one. With 4.0 there is multi-document transaction support which puts it on par with SQL databases. Earlier versions saw the introduction of schema validations, and a DecimalObject data type to model money. The hosted service solution AtlasDB takes care of the management aspect, so it is not much behind DynamoDB in this respect.

Mongo has a myriad of advanced features, data types, language support, indexing strategies which DynamoDB doesn’t. It’s aggregation pipeline alone has over 100 operators for just any use case imaginable. Mongo University offers free online courses.

Conclusion

Use DynamoDB if you have a robust use case with a simplistic data model, if you care more about operational stability than sophistication. This is also a consideration if you are short on skills or staff.

Use MongoDB if you build a product with requirements that change rapidly and in unpredictable ways. MongoDB has superior features (20 MB record size, nested arrays and objects graphs, various indexing strategies, many different ways to query and aggregate) and fewer limitations. It can be used for more advanced projects. The quibbles from the past are no longer justified. With version 4.0 MongoDB has become an enterprise-grade database like Oracle back in the days. With the hosted Atlas solution you have an alternative to DynamoDB or Azure CosmosDB or Google Bigtable, you’ll be able to focus on the implementation more.

 

References

[1] Comparing DynamoDB and MongoDB (MongoDB)

[2] Dynamo vs MongoDB, Panoply

[3] DynamoDB vs MongoDB NoSQL (Stack Overflow)

Facebooktwittergoogle_pluslinkedinmailby feather

Embedded Analytics Dashboard

Powered by Looker is the term coined by Looker for embedding entire analytics dashboards into your proprietary software, i.e. white labeling Looker as your own solution.

A lot of SaaS systems, especially enterprise grade, need a reporting layer. Analytics and Reporting is an important if not the most essential aspect of a business system. Few people would argue that building your own reporting layer is better than buying a best-of-breed one. Having recently been added to the Gartner Magic Quadrant for Analytics and BI Platforms, Looker is the perfect solution for white labeling reporting into your product. This is the story of an embedded analytics dashboard by Looker (5 min read).

Requirements

The project was to provide a set of canned user activity reports together with specific business reports embedded into a multi-tenant SaaS platform. Security and data privacy is of the utmost concern. Reporting needs to scale to tens of thousands of users and be realtime as much as possible. The user experience must be seamless, i.e. no hint of an interruption between system boundaries allowed. 

 

Data Pipeline and Data Warehouse

The origin of all business data is in a clustered Mongo database self-hosted in AWS. Data is extracted with the cloud ETL tool StitchData described here.

StitchData allows creation of instant analytics data pipelines without writing a single line of code. It has an intuitive realtime dashboard with timely information of replication status for each pipeline. Here is a screenshot of a database connector:

AWS Redshift is a fast, scaleable, inexpensive data warehouse solution in the Amazon cloud. It is SQL based and can be used as the persistent datastore for the Looker BI platform. The data warehouse is the consolidating single point of truth where data from various sources is collected, cleaned, normalized, and brought into a usable form for reporting. This data pipeline is not quite realtime, but “near realtime”. A connector in StitchData can be set up to replicate new data as frequently as every 5 minutes (up to once a day). There are more realtime solution such as AWS Kinesis, Kinesis Firehose, and others, though they are usually more involved to set up and/or at a higher price.

An interesting fact of this setup is it successfully transports MongoDB data (noSQL) over to AWS Redshift (SQL). StitchData unpacks the nested arrays of a JSON document into child tables in SQL with a synthetic foreign key for the parent-child relationship injected during replication. It correctly and automatically normalizes noSQL data into SQL data. Other data pipeline providers (which I leave unnamed here) wouldn’t get past reading a nested JSON object out as String, which means all followup work of parsing out to a useful data structure is on you.

Security

Authentication to the hosted Looker account must be provided on the fly. The solution was a web service endpoint which provided the URL of the embedded Looker dashboard, customized to the user already logged into the SaaS application. The security context of the user must be adequate and correct according to the user’s roles and permissions (authorization), as well as the initial session duration of the Looker session must be of equal duration.

 

Data Access Restriction in Multi-Tenant Environment

Looker allows specification of additional required user attributes per user session. The sample LookML snippet below shows a left outer join of table sales__items with table sales (where clause is  “where  sales__items._sds_source_key__id = sales._id” a parent child relationship introduced by StitchData automatically when unpacking the nested array items in document sales).
Secondly, the access_filter attribute enforces an additional filter on sales.client_id to match user.client_id. This provides the security context for a user in the multi-tenant SQL table.

explore: sales__items {

  join: sales {

    type: left_outer

    relationship: many_to_one

    sql_on: ${sales__items._sdc_source_key__id} = ${sales._id} ;;

  }

  access_filter: {

    field: sales.client_id

    user_attribute:client_id

  }

}

 

Building the URL for the iFrame

Step-by-step instructions for the embedded URL are at https://docs.looker.com/reference/embedding/sso-embed.
A few settings need to be changed in the Admin console. Code needs to be written to produce a URL like such

https://analytics.mycompany.com/login/embed//embed/dashboards/1?
nonce=”22b1ee700ef3dc2f500fb7″&
time=1407876784&
session_length=86400&
external_user_id=”user-4″&
permissions=[“access_data”,”see_user_dashboards”,”see_looks”]&
models=[“model_one”,”model_two”]&
group_ids=[4,3]&
external_group_id=”Allegra K”&
user_attributes={“vendor_id”:17,”company”:”acme”}&
access_filters={“model_one”:{“vendor.id”:17,”company.name”:”acme”},
 “model_two”:{“vendor.id”:12,”company.name”:”widgets-r-us”}}&
first_name=”Alice”&
last_name=”Jones”&
user_timezone=”US/Pacific”&
force_logout_login=true&
signature=123456789ABCDEFGHIJKL

Example of a Looker dashboard that can be embedded as an iFrame into a SaaS application.

 

Summary

Looker was used to deliver an embedded Analytics and Reporting solution for a large enterprise multi-tenant SaaS platform. A so-called cloud analytics data pipeline was set up from MongoDB to AWS Redshift using StitchData without a line of coding. Reports and dashboards were created in Looker on top of the AWS Redshift data warehouse. The URL of the top-level analytics dashboard that was assigned to an iFrame of the SaaS application, was created by a backend API to fit the security context of the client company and the authorization level of the user.

Powered by Looker delivers a white labeled solution with all the conveniences of a hosted service like automatic upgrades, along with the powerful features of Looker itself such as report scheduling, data distribution, API integration, access restriction, among others.

Lastly, it is very nice from an operational point of view, that new reports can be added to the reporting layer independently of the release schedule of the rest of the SaaS platform.

 

 

Facebooktwittergoogle_pluslinkedinmailby feather

MongoDB 3.6 New Features

1. MongoDB Change Streams

Change streams can benefit architectures with reliant business systems, informing downstream systems once data changes are durable. For example, change streams can save time for developers when implementing Extract, Transform, and Load (ETL) services, cross-platform synchronization, collaboration functionality, and notification services, for any real-time event driven data integration.

    changeStream = db.inventory.watch()
    document = next(changeStream)


The five  Change Operation Types are  INSERT | DELETE | REPLACE | UPDATE | INVALIDATE.

Change Streams

  1. Only need read-access to collection subscribed on any node
  2. Come with Defined API
  3. Changes remain ordered across shards – client receives operations in correct order
  4. durable (to failure)
  5. resumable, use resumeToken, – pick up where you left off, on client side,
    but need to persist the resumeToken.
    changeStream = db.inventory.watch(resume: resumeToken)
  6. power of aggregation to target specific operation types
  7. Use documentKey for sharding, uniqueness is enforced only at shard level 

Example

var coll = client.db('demo').collection('peeps');

var cursor = coll.aggregate([

   { $changeStream: { fullDocument: 'updateLookup' } },

   { $match: { operationType: { $in: ['update', 'replace'] } } },

   { $project: {secret:0}}

]);

 

2. MongoDB Stitch

MongoDB Stitch is a backend as a service (BaaS) with declarative access control and offers ‘composability’ of data integration to popular third party systems. Availability is only on MongoDB Atlas so far. Stitch is an exciting new feature which will put Mongo back on the map to compete with other easy-to-use BaaS systems such as Firebase or Apigee.

Stitch allows building REST APIs and microservices declaratively. SDKs in Javascript, Swift, and Java exist. The control layer allows to define which fields a user can read or write with simple JSON rules.  Data privacy and compliance can be implemented at this way for read-through data access but also for data aggregations.

Built-in integrations make it simple for your app to use leading third party services and auth APIs, e.g. for AWS,  Facebook, Google, Twilio and others. Your Stitch microservices can thus be defined with external integrations included. Think of data orchestration or safely provide a new service endpoint for new applications.

The Stitch execution cycle consists of:

  1. Request from client is received
  2. Request is parsed, Stitch applies security rules
  3. Stitch orchestrates the database access and services  
  4. Stitch aggregates data and applies rules
  5. client receives result

An API defined in Stitch contains the calling context, such as client IP address or user authentication, which can be used in the audit log for security.

Price is per downloaded data size. Up to 25 GB per month are free. 100 GB per month are $75.

 

2. Aggregations in MongoDB

Mongo has a powerful aggregation pipeline that can take on any analytics task.

Input stream { } { } { } —-> Result { } { }

One of the more known aggregation operators are $match, $unwind, $group, $sort, $skip, $limit, $project.

https://docs.mongodb.com/manual/reference/operator/aggregation/ has over 120 aggregation operators documented. According to Asya Kamsky they can be categorized by their action into following seven categories:

  1. Group and Transform ($group, $count)
  2. Aliases ($bucket)
  3. Special Input/Output ($facet, $collStats)
  4. Re-order ($sort)
  5. Transform ($project, $lookup)
  6. Decrease ($skip, $limit)
  7. Increase ($unwind)

The aggregation pipeline is a toolbox full of tools for everything, operations on arrays, nested objects, strings and date manipulations, logic, math, and statistics operations. There is a considerable learning curve, but the MongoDB University courses get you started for free.

Data aggregations and transformations (ETL) are absolutely possible in Mongo. Even switching to the more popular ELT approach with data transformed in the same location after it is loaded is conceivable.

However, analytics in Mongo is not as widespread as data warehouses in SQL. The reason I suspect to be the extra verbosity of the Javascript syntax of Mongo (too many curly braces), the learning curve, and the fact that MongoDB is only one of many document databases, which traditionally are good for semi-structured or unstructured data, whereas fully structured data is a perfect fit for SQL data warehouse.

Example

db.books.aggregate([

   {$unwind: “$subjects”},

   {$match: [“language”:”English”, “subjects”:/^[ABC]/}},

   {$group:{_id:$”subjects”, count:{$sum:1}}},

   {$sort:{count:-1}},

   {$limit:3}

])

Analytic question: “Find top 3 books in English with subject beginning with A, B, or C”

 

3. MongoDB Compass

Finally, MongoDB provides a GUI for its database to visually explore data, run ad hoc queries, edit data with full CRUD support, performance tune queries, etc. It is available on Mac, Linux, and Windows and has an extensible architecture for plugins. There is a free community edition and a commercial version.

 

4. MongoDB Spark connector

Spark, the successor to Hadoop, provides a framework for massive parallel processing. MongoDB now offers a Spark connector which allows for heavy weight analytics with data from MongoDB directly. Spark is known for its parallelism, stream processing, caching layer, performance, interactive shell, APIs in Java, Scala, Python, R, and its wide adoption.

While MongoDB comes with a powerful aggregation framework, the bridge to Spark allows to do a lot more faster with the data. The analytics pipeline with the Spark connector could look like so

                    Mongo —> Spark —> Mongo (result)

 

For instance, Spark comes with the machine learning library MLlib, donated by IBM.

MLlib has distributed algorithms for the classic ML problems of classification, regression, decision trees, recommenders, clustering, linear algebra, and many more.

The MongoDB Spark connector opens the road to many powerful ML algorithms to your MongoDB.

 

 

 

Facebooktwittergoogle_pluslinkedinmailby feather