How to select database for your next use case

Being architect in 21st century is a tough job with the rise of so many databases and varied kind of persistent needs. No single database technology can meet every use case or satisfy every latency requirement.

This blog acts as quick reference and provides considerations we have to keep in mind as architects for selecting persistence layer for our next use case.

Usually, first thing is to decide whether you need RDBMS versus NoSql solution. If you end up choosing nosql, next step is to find out which nosql solution fits your business use case.

Relational databases

RDBMS’s have been de facto persistence solutions in last few decades due to their strong durability and consistency features. They support ACID transactions (Atomicity, Isolation, Consistency, Duration) out of box and will continue to be used for cases where ACID features are required.

When to select RDBMS’s:

  • if the query and reporting needs are very critical OR if there is a need in complex queries over all the data set.
  • You’re building a low-volume, medium-complexity suite of applications that will evolve over time.
  • If You need normalization and do not want to duplicate lot of data
  • Not strict latency requirements (0.1 - 5 sec is acceptable?)
  • For financial application with strict data consistency guarantees.
  • If you have strict schema requirements or think you will benefit from normalization.

Criteria above is not hard criteria and lines are diminishing nowadays between RDBMS’es and NoSQL databases. ex. OrientDB is ACID compliant and way faster than MySQL. Also, RDBMS’s were hard to scale historically. It has changed a lot recently with all databases supporting easier to scale mechanisms.

  • MySQL, Maria DB, Sql Server, Oracle, VoltDB, Clustrix, Scalebase, NimbusDB, Megastore

NoSQL Databases

NoSQL databases became popular as RDBMS’s were not able to handle scale and type of data efficiently. As data size grew companies needed to store hundreds or thousands of terabytes (TB) data and relation databases were not meeting the need.

Nosql databases were introduced which handles massive volume of data and varied application needs like High availability, scalability, fault tolerance, operational simplicity. Additionally, No sql databases are usually Schema-less and hence easier to adopt to new requirements. Nosql solutions generally provide 2 of 3 of CAP theorem (Consistency, Availability, Partitioning).

When to select NoSql databases:

  • If application requires the ability to store varying attributes i.e. needs richer data model with schema less design
  • You need to handle massive volume of data
  • Your application has very low-latency requirements.
  • You need really high throughput i.e websites with high volumes of reads and writes
  • When your problem domain naturally fits into nosql solutions such as document databases, graph databases or you have petabytes of data for which you might require wide column databases.

There are multiple type of nosql solutions such as document databases, wide column databases, graph databases, in memory databases. Read further to find out which database best fit your needs.

  • Document Databases: document databases contains collections that contain semi-structured data, such as JSON or XML. ex. MongoDB, REDIS
  • Graph Databases: In graph database, the data model has nodes and edges, each of which may have properties. ex. Neo4J, FlockDB, HyperGraphDB, OrientDB
  • Wide Column Databases: Column family databases usually have an extended key/value data model in which the value data type can also be a sequence of key/value pairs common examples - BigTable, Cassandra
  • In Memory Databases: In memory databases, data model is key/value pair. Common examples are Redis.

NoSQL Document databases

Document databases are probably the most popular of the NoSQL databases because of their flexibility, performance and ease of use. If one can tolerate an “eventually consistent” model with limited atomicity and isolation, a document store is the right choice.

When to select Document databases:

  • If application requires the ability to store varying attributes along with large amounts of data
  • If you you have an incredibly data intensive workload ex. 5-10+ TB of data every year
  • Your application has very low-latency requirements.
  • You need really high throughput i.e websites with high volumes of reads and writes
  • If you can live without relational data.

Common Use cases

These databases are well-suited to a number of use cases, including:

  • Managing data types with variable attributes, such as blog posts and discussion threads, product catalogs, orders and similar entities.
  • Widely used in healthcare domain for storing electronic medical records such as patient records, hospitals records, MRIs, X-Rays etc. A web application may provide search and lookup for such objects based on multiple fields (say, a patient’s name, social security number, or birth date).
  • Many times document stores are implemented as a layer between an application and a relational database to hold the output of certain types of queries. For example, it might be convenient to aggregate information that is typically requested together such as a set of user preferences or name and address information and store it as one object.

SimpleDB, CouchDB, MongoDB, Terrastore, BaseX

  • CouchDB is used by several software and web applications, including many Facebook games and applications, internal use at the BBC, and Ubuntu One etc.
  • MongoDB is similar to CouchDB in that both are designed as document stores for JSON objects and, like Cassandra, is designed for replication and high-availability. MongoDB is widely used by likes of Craigslist, Shutterfly, SourceForge, the New York Times, and GitHub
  • BaseX is designed to store document objects in XML. It supports standard XML tools like XPath and Xquery and also includes a lightweight GUI. BaseX creates indexes, supports W3C recommendations and standards, ACID-safe transactions, large documents, and various APIs like REST/JAX-RX and XML:DB

Graph Databases

Graph databases are designed to store inter-connected data.

When to select Graph databases

Graph databases are well-suited for use cases where domain can be naturally modeled in a graph. ex. The relationships between people in social networks, The relationships between items and attributes in recommendation engines

Common Use cases

Facebook uses Graph Search, Google has the Knowledge Graph, and Twitter uses graphs to recommend people to follow. Thousands of other companies are using graphs for everything from recommendations, logistics, bio-informatics to network management.

  • Social Networks
  • Real-Time Recommendation Engines
  • Network and IT infrastructure management
  • Identity and access management
  • Geo Routing, BioInformatics, Gene Sequencing
  • Fraud Detection and money laundering
  • Master data management, Business process management, Insurance Risk analysis

Neo4J, HyperGraphDB, OrientDB, FlockDB

  • FlockDB is used by Twitter to store social graphs, such as who follows whom, and for some secondary indices. FlockDB is designed for online, low-latency, high throughput environments such as websites like Twitter; even then, it’s only for storing specific types of data.
  • Neo4j is used by likes of ebay, walmart etc. As per ebay, “Our Neo4j solution is literally thousands of times faster than the prior MySQL solution, with queries that require 10-100 times less code. At the same time, Neo4j allowed us to add functionality that was previously not possible”. As per walmart, “With Neo4j, we could substitute a heavy batch process with a simple and real-time graph database”. IRCC(Candiolo Cancer Institute) use MongoDB to store the raw, complex data and rely on Neo4j for all the rest: finding complex relationships, analyzing their experimental procedures, and modeling the genomic domain and complex semantics for genomic knowledge.

Wide column Databases

Wide column stores tend to be very efficient for databases that are mostly used for reading stored data, especially from very large data sets.

When to select column family databases

Column family databases are well-suited for use with:

  • Applications that are geographically distributed over multiple data centers
  • Applications that can tolerate some short-term inconsistency in replicas
  • Applications with dynamic fields
  • Applications with the potential for truly large volumes of data, such as hundreds of terabytes

Common Use cases

Several areas can use this kind of big data processing capability, such as:

  • Security analytics using network traffic and log data mode
  • Big Science, such as bioinformatics using genetic and proteomic data
  • Stock market analysis using trade data
  • Web-scale applications such as search
  • Social network services
  • Google demonstrated the capabilities of Cassandra running the Google Compute Engine. Google engineers deployed:
    • 330 Google Compute Engine virtual machines
    • 300 1 TB Persistent Disk volumes
    • Debian Linux
    • Datastax Cassandra 2.2
    • Data was written to two nodes (Quorum commit of 2)
    • 30 virtual machines to generate 3 billion records of 170 bytes each
    • With this configuration, the Cassandra cluster reached 1 million writes per second, with 95% completing in under 23 milliseconds. When one-third of the nodes were lost, the 1 million writes were sustained, but with higher latency.
  • Hbase is main database used by Hadoop. The main feature of HBase is its ability to host very large tables, on the scale of billions of rows across millions of columns. It is designed to host them on commodity hardware. HBase provides a RESTful web service interface that supports many formats and encodings and is optimized for real-time queries.
  • Google introduced Bigtable to address the needs of its services.
  • Riak is a fault-tolerant, distributed database designed for scalability and use in the cloud. Based on Dynamo paper published by Amazon.
  • Cassandra is a key/value store as well as a wide column store. Facebook developed Cassandra to back its Inbox Search service. Cassandra is being used by Facebook, Twitter, Reddit, and many others.

NoSQL in-memory databases

In memory Key/value stores are the simplest of the NoSQL databases.

Common Use cases

Key-value databases are used in a wide range of applications, such as the following:

  • Key/value stores are great for things like contents of a website shopping cart, user preference lists, a post in a social media site.
  • Most commonly used for high-traffic web sites as server side caching solution that need to make sure that a local user has quick and accurate access to the information, but where the information can take time to replicate to other database nodes or where there is heavy access to the database itself, but where users are not necessarily using the same data concurrently.
  • Also, used for Session management using distributed cache stores like Redis as an alternative to legacy session replication approach.
  • Storing configuration and user data information for mobile applications

Redis, Memcached and MemcachedDB, Scalaris, Tokyo Tyrant, Voldemort, Riak, Cassandra, BerkleyDB.

  • Redis is intended for applications where performance and flexibility are more important than persistence and absolute data integrity
  • Memcached is used by sites like Twitter, Reddit, YouTube, and Facebook, and it is also supported and often used by websites based on content management systems like Drupal and WordPress.

References


# Reference
1 http://neo4j.com/graphacademy/online-course-getting-started/
2 http://www.zdnet.com/article/rdbms-vs-nosql-how-do-you-pick/
3 http://blog.nahurst.com/visual-guide-to-nosql-systems
4 http://www.slideshare.net/maxdemarzi/graph-database-use-cases
5 http://searchdatamanagement.techtarget.com/feature/Key-criteria-for-choosing-different-types-of-NoSQL-databases
6 https://s3.amazonaws.com/daily-builds/RavenDBMythology-11.pdf

Version History


Date Description
2016-08-24 Initial Version