Sunday, January 8, 2017

Thoughts on Converting from SQL to NoSQL

From our previous posts, you know that while we do support conversion to NoSQL using ODM libraries, we invested significant effort building our own customized approach to managing cross-collection references.

We want to take this opportunity to clarify some important points, so you understand why we developed our own approach and how this approach can help you achieve your goals.

We also explain how you can use the 1Schema Database Converter to help perform this process quickly, easily, and correctly.

The Use-Case for Consideration:

We are focused on the use case where an engineering team is trying to optimize performance and scalability of a website or app by switching to a NoSQL database technology.

In this case, the promise of NoSQL technologies is that by storing larger clusters of information (including nested information), the amount of server processing necessary to render a page view can be minimized.

Ideally, you could render a page view with only 1 read from the database (although this goal is often not possible in practice).

The cost of this approach is that more disk space is used to store these data structures, as information is often duplicated in multiple places.

However, since disk space is relatively cheap, it can be extremely beneficial to sacrifice disk space to improve processing time.

If you believe in the benefits of this approach, you should try converting your SQL database using our custom document caching techniques, which solve the problems laid out below.

The Problem with this Use-Case:

Often, when SQL-based engineers start learning about NoSQL technologies, they see the ability to embed (or nest) information within a given document (i.e. NoSQL equivalent of a row) as a really promising feature.

They come from a world of flat rows, where storing nested information requires creating a separate child table and creating a foreign key constraint between the two tables.

The idea of just adding nested information is often quite appealing.

However, the shine usually wears off when they start to realize that they need to nest the same information in multiple places.

To prevent the confusion caused by duplication, they often change their NoSQL design to use a more normalized approach to avoid duplication.


This lack of clarity really hurts NoSQL adoption, because here is where many engineers revert to using normalized schemas in traditional SQL databases.

So the question becomes, how do we
  1. Take advantage of NoSQL technologies to reduce server processing and increase performance and scalability AND
  2. Still know what the heck is going on

How 1Schema Solves This:

At 1Schema, we built our custom NoSQL converter from the ground-up to help you differentiate between data that is truly embedded within your root document (contained data) and data that you are copying there to improve performance (cached data).

NOTE: For a more detailed explanation of the difference between contained data and cached data, see this post...

All contained data actually exists within its parent document, so you know to go to the parent document to update it. In the 1Schema Converter, you specify a single "Parent Table" if you want that your table to be contained within the parent table. The options for "Parent Table" will be obtained by examining the Foreign Key Constraints specified for a particular table.

Conversely, all cached data is copied locally by our update management code, but actually exists elsewhere. All Foreign Key Constraints other than those used to specify the "Parent Table" can be used to configure caching.

Caching actually occurs when our change management code traverses an ID reference and copies the relevant information locally. Furthermore, the change management code will automatically update this data when the document in the foreign collection changes.

By using caching, you can accomplish the goal of larger clusters of information that reduce the processing necessary to render a page view, but prevent the confusion that duplication often causes within a data model.

The catch is that you are not guaranteed this cached data is always up-to-date, since it is copied from the place that it is actually stored.

However, our converter provides you with code that automatically updates your cached data, and you can configure this code to run as you desire.

Also, please note that to simplify the number of settings in the converter, you can either specify to use caching or not for a particular table, but you cannot choose individual foreign keys to use. It is all or nothing.

If you need more granularity on which ID references are used for caching, you should sign-up for the beta release of our visual Schema Designer.

Concluding Thoughts

At 1Schema, we do not advocate either normalization or denormalization... instead, we want to give you control of this choice on a case-by-case basis.

In fact, it can actually be quite beneficial to start with a normalized SQL schema and proceed from there.

Then you can intelligently choose when to denormalize specific parts of your data model, depending on the criteria important to you.

No comments:

Post a Comment