Choosing between DynamoDB vs MongoDB

This post compares DynamoDB vs. MongoDB, the two most popular noSQL databases.

NoSQL databases are developer friendly because data is stored in JSON-like documents that easily map to objects of a programming language. The documents (rows in SQL) can be of different structure each because no upfront definition of a schema is required. The data schema may change with each release. This is good news for developer agility. Also the speed of queries is fast because of data denormalization. However, noSQL can also become a nightmare if you have to organize that data for something downstream like reporting or ERP integrations. In the realm of narrowly defined problems data formats with a schema dominate: XML, REST, SQL.

So let’s assume we have our mind set on a noSQL solution for whatever reason. Then MongoDB and DynamoDB are the top contenders. There are a other reviews on the internet, but I found them to be biased depending on who wrote them. [1], [2].

Below is a table that compares both systems feature by feature (searchable and sortable table if you are in a hurry).

DynamoDB vs. MongoDB

 DynamoDBMongoDB
DescriptionFully managed noSQL database
in the AWS Cloud
Most popular noSQL
document database
Database modelBoth document and
key-value store
Document store
Websitehttps://aws.amazon.com/dynamodbhttps://www.mongodb.com
DeveloperAmazon, since 2012MongoDB Inc.,
since 2009
License CommercialOpen Source
Cloud hostedyesMongoDB Atlas
Implementation LanguageJavaC++
Server OShostedLinux, OS X,
Solaris, Windows
Data scheme (model enforcement)noSchema Validation
optional
Data TypesString, Boolean,Byte, Date, Calendar, Long, Integer, 
Double, Float, BigDecimal, BigInteger
String, Boolean, Date,
Timestamp, Integer, Double,
Decimal, Object, Regular Expression,
Javascript, ObjectId, Undefined,
Null, Binary data
Data Size400 KB16 MB
Queryingby key and partial keyby key, ranges, faceted search, graph traversals, geospatial, aggregations
Indexhash or range-hash, primary plus 5 local (same partition) and 5 global indexes (eventual consistency)primary and on
any field strongly
consistent always.
Indexing strategies include compound,
unique, array, partial,
TTL, geospatial,
sparse, hash, and
text
Secondary Indexyesyes
API accessAWS DynamoDB
REST API
REST API since 4.0
Supported Programming Languages.Net,ColdFusion,
Erlang,Groovy,
Java,JavaScript,
Perl,PHP,Python,
Ruby
Actionscript,C,C#,
C++,Clojure,ColdFusion,
D,Dart,Delphi,Erlang,
Go,Groovy,Haskell,
Java,JavaScript,Lisp,
Lua,MatLab,Perl,PHP,
PowerShell,Prolog,
Python,R,Ruby,Scala,
Smalltalk
Server-side scriptsnoJavascript
Triggersyes, Lambda no,
but Change Events
since 4.0 can be
used to build one
Partitioningshardingsharding
Map ReduceTight integration with AWS EMRbuilt in
Consistencyeventual (default)
and immediate
eventual and
immediate (default)
Transactionsnoyes, since 4.0
Concurrencyyesyes
In-memory capabilitiesDAXin-memory storage engine
since 3.2
User accessUsers, groups and
roles via AWS IAM
Access rights for
users and roles
MonitoringAWS CloudWatchMongoDB Atlas
tracks 100+ metrics
Backup
DR Options
On demand and continuouscontinuous, queryable
backups with
point-in-time recovery (Atlas)
Throughput Limits10k ops / sec, more
available on request
limited by hardware
PricingBased on throughput
(read, write) and index size
Based on Memory
and CPU
Free Tier250 million requests / month, 25 GB storage512 MB
Publicly Traded AMZNMDB
Prominent companiesAmazon, Samsung,
Snapchat, Netflix,
New York Times,
AdRoll, HTC,
Dropcam, Twitch
Google, UPS,
Facebook, Cisco,
eBay, BOSH, Adobe,
SAP, Forbes

So which DB is right for you?

It depends, of course, on your situation. There is a good discussion on Stack Overflow [3], but things constantly evolve and some long held opinions are not longer true.

Amazon DynamoDB

Amazon DynamoDB is a fully managed cloud NoSQL database service. Create your tables, set your throughput, after that the heavy lifting is up to Amazon (managing distributed database cluster, installing patches, configurations, partitioning, scaling, etc.).

DynamoDB supports an Event Driven Architecture by integrating with AWS Lambda. The advantage here is (asynchronous) extensibility and integration to other systems. Fine-grained user access and control is achieved with AWS Identity and Access Management (IAM).

DynamoDB regularly ranks in Gartner’s Magic Quadrant. With its recent DAX (Amazon DynamoDB Accelerator) feature, query times below 1 milli seconds are only a one-click configuration away. This is a 10 X performance improvement that opens up entire new possibilities where such hyper-speed is of the essence.

MongoDB

MongoDB is very popular with developers. It provides the ‘M’ in the so-called MEAN stack (Mongo, Express, Angular, NodeJS). One developer can build an entire application rapidly with MEAN (or React, VueJS, and similar JS frameworks). However, with time people started complaining about MongoDB that it was missing features essential to enterprise-grade applications, for instance
– no support for multi-document transactions
– no support for money type
– no support for schema validation (constraints)
– single write node (scalability bottleneck)

Over the last few years MongoDB addressed these concerns one by one. With 4.0 there is multi-document transaction support which puts it on par with SQL databases. Earlier versions saw the introduction of schema validations, and a DecimalObject data type to model money. The hosted service solution AtlasDB takes care of the management aspect, so it is not much behind DynamoDB in this respect.

Mongo has a myriad of advanced features, data types, language support, indexing strategies which DynamoDB doesn’t. It’s aggregation pipeline alone has over 100 operators for just any use case imaginable. Mongo University offers free online courses.

Conclusion

Use DynamoDB if you have a robust use case with a simplistic data model, if you care more about operational stability than sophistication. This is also a consideration if you are short on skills or staff.

Use MongoDB if you build a product with requirements that change rapidly and in unpredictable ways. MongoDB has superior features (20 MB record size, nested arrays and objects graphs, various indexing strategies, many different ways to query and aggregate) and fewer limitations. It can be used for more advanced projects. The quibbles from the past are no longer justified. With version 4.0 MongoDB has become an enterprise-grade database like Oracle back in the days. With the hosted Atlas solution you have an alternative to DynamoDB or Azure CosmosDB or Google Bigtable, you’ll be able to focus on the implementation more.

 

References

[1] Comparing DynamoDB and MongoDB (MongoDB)

[2] Dynamo vs MongoDB, Panoply

[3] DynamoDB vs MongoDB NoSQL (Stack Overflow)

Facebooktwittergoogle_pluslinkedinmailby feather

Embedded Analytics Dashboard

Powered by Looker is the term coined by Looker for embedding entire analytics dashboards into your proprietary software, i.e. white labeling Looker as your own solution.

A lot of SaaS systems, especially enterprise grade, need a reporting layer. Analytics and Reporting is an important if not the most essential aspect of a business system. Few people would argue that building your own reporting layer is better than buying a best-of-breed one. Having recently been added to the Gartner Magic Quadrant for Analytics and BI Platforms, Looker is the perfect solution for white labeling reporting into your product. This is the story of an embedded analytics dashboard by Looker (5 min read).

Requirements

The project was to provide a set of canned user activity reports together with specific business reports embedded into a multi-tenant SaaS platform. Security and data privacy is of the utmost concern. Reporting needs to scale to tens of thousands of users and be realtime as much as possible. The user experience must be seamless, i.e. no hint of an interruption between system boundaries allowed. 

 

Data Pipeline and Data Warehouse

The origin of all business data is in a clustered Mongo database self-hosted in AWS. Data is extracted with the cloud ETL tool StitchData described here.

StitchData allows creation of instant analytics data pipelines without writing a single line of code. It has an intuitive realtime dashboard with timely information of replication status for each pipeline. Here is a screenshot of a database connector:

AWS Redshift is a fast, scaleable, inexpensive data warehouse solution in the Amazon cloud. It is SQL based and can be used as the persistent datastore for the Looker BI platform. The data warehouse is the consolidating single point of truth where data from various sources is collected, cleaned, normalized, and brought into a usable form for reporting. This data pipeline is not quite realtime, but “near realtime”. A connector in StitchData can be set up to replicate new data as frequently as every 5 minutes (up to once a day). There are more realtime solution such as AWS Kinesis, Kinesis Firehose, and others, though they are usually more involved to set up and/or at a higher price.

An interesting fact of this setup is it successfully transports MongoDB data (noSQL) over to AWS Redshift (SQL). StitchData unpacks the nested arrays of a JSON document into child tables in SQL with a synthetic foreign key for the parent-child relationship injected during replication. It correctly and automatically normalizes noSQL data into SQL data. Other data pipeline providers (which I leave unnamed here) wouldn’t get past reading a nested JSON object out as String, which means all followup work of parsing out to a useful data structure is on you.

Security

Authentication to the hosted Looker account must be provided on the fly. The solution was a web service endpoint which provided the URL of the embedded Looker dashboard, customized to the user already logged into the SaaS application. The security context of the user must be adequate and correct according to the user’s roles and permissions (authorization), as well as the initial session duration of the Looker session must be of equal duration.

 

Data Access Restriction in Multi-Tenant Environment

Looker allows specification of additional required user attributes per user session. The sample LookML snippet below shows a left outer join of table sales__items with table sales (where clause is  “where  sales__items._sds_source_key__id = sales._id” a parent child relationship introduced by StitchData automatically when unpacking the nested array items in document sales).
Secondly, the access_filter attribute enforces an additional filter on sales.client_id to match user.client_id. This provides the security context for a user in the multi-tenant SQL table.

explore: sales__items {

  join: sales {

    type: left_outer

    relationship: many_to_one

    sql_on: ${sales__items._sdc_source_key__id} = ${sales._id} ;;

  }

  access_filter: {

    field: sales.client_id

    user_attribute:client_id

  }

}

 

Building the URL for the iFrame

Step-by-step instructions for the embedded URL are at https://docs.looker.com/reference/embedding/sso-embed.
A few settings need to be changed in the Admin console. Code needs to be written to produce a URL like such

https://analytics.mycompany.com/login/embed//embed/dashboards/1?
nonce=”22b1ee700ef3dc2f500fb7″&
time=1407876784&
session_length=86400&
external_user_id=”user-4″&
permissions=[“access_data”,”see_user_dashboards”,”see_looks”]&
models=[“model_one”,”model_two”]&
group_ids=[4,3]&
external_group_id=”Allegra K”&
user_attributes={“vendor_id”:17,”company”:”acme”}&
access_filters={“model_one”:{“vendor.id”:17,”company.name”:”acme”},
 “model_two”:{“vendor.id”:12,”company.name”:”widgets-r-us”}}&
first_name=”Alice”&
last_name=”Jones”&
user_timezone=”US/Pacific”&
force_logout_login=true&
signature=123456789ABCDEFGHIJKL

Example of a Looker dashboard that can be embedded as an iFrame into a SaaS application.

 

Summary

Looker was used to deliver an embedded Analytics and Reporting solution for a large enterprise multi-tenant SaaS platform. A so-called cloud analytics data pipeline was set up from MongoDB to AWS Redshift using StitchData without a line of coding. Reports and dashboards were created in Looker on top of the AWS Redshift data warehouse. The URL of the top-level analytics dashboard that was assigned to an iFrame of the SaaS application, was created by a backend API to fit the security context of the client company and the authorization level of the user.

Powered by Looker delivers a white labeled solution with all the conveniences of a hosted service like automatic upgrades, along with the powerful features of Looker itself such as report scheduling, data distribution, API integration, access restriction, among others.

Lastly, it is very nice from an operational point of view, that new reports can be added to the reporting layer independently of the release schedule of the rest of the SaaS platform.

 

 

Facebooktwittergoogle_pluslinkedinmailby feather