An Overview of NoSQL Cheng Lei Department of Electrical and Computer Engineering University of Victoria, Canada rexlei86@uvic.ca I. Introduction The relational databases have dominated the database markets and applied in data storage and data query for years till the NoSQL database arise. The relational database management system has its unique advantages, which make it ubiquitous and famous among incorporates and developers. Since the DBMS has been run decades in different operating systems, it has the mature technology to maintain, to query, to optimize. Also, there are more developers who master these technology and more resources for the fresher to learn such techniques. However, the bottleneck of relational database management system arises as the Internet boosts. The population of Internet boosts as exponential level. The data access workload of the servers becomes the burden. The distribution of users is worldwide, which requires the servers are deployed worldwide so that there is little impact for the delay. This mechanism of global distribution of servers enhances the efficiency and provides better user-friendly experience. Although this architecture of server distribution has its own pros, it also brings the limits to the relational database management systems as it is not specifically designed for the distributed system. Thus causes the difficulty to guarantee the data integrity. Meanwhile, the JSON data structure has widely used in almost all kinds of Internet communications and its structure becomes increasingly complicated so that more storage spaces and tables are required to satisfy these requirements. Therefore, a new welldesigned database is called for to meet these requirements and NoSQL is that kind of database. As the NoSQL is specifically designed for this purpose, it is flexible to store data as it is unnecessary to create the schemas to form the data structures. Moreover, the design for the distributed system provides more advantages to deploy the databases globally and secure the data integrity and consistency effectively. This design is also scalable for largescale datasets. There are many NoSQL databases already applied, for instances, MangoDB, Redis. II. Data Models For the NoSQL databases, it uses the aggregate data models to model the datasets. The aggregate data models are consists of keyvalue data model, document data model and column-family stores model, and the aggregate is defined as a collection of related objects treated as a unit. In particular, it is a unit for data manipulation and management of consistency. Typically, the update operation is to update the aggregates with atomic operations and communicate with our data storage in terms of aggregates. The Key-Value data model as one of aggregate models is strongly aggregate-oriented, which means these databases are treated as primarily constructed through aggregates with each aggregate having a key that is used to get at the data (the value). The aggregate is opaque to the databases, but is just treated as some big blob of mostly meaningless bits. This model provides the aggregate access interfaces by using lookup based on the keys. The document data model, similar to the Key-Value data model, treats the aggregate as that it is built by the structure of the document and each document owns a document ID. The document ID as a key is used to lookup the aggregate. The difference from the Key-Value data model is that the document database has the ability to see the aggregate structure. Unlikely, the aggregate is transparent to the key-value data model databases. The column-family stores concept comes from the structure of Google’s BigTable. This model defines a table as a parse matrix or a two-level map. Each row as a record has a row key and several column keys. The row key (the first key) is often described as a row identifier while column values (second-level values) as the column keys. Each column can be tuples with different types of datasets. III. Distribution Models As mentioned in the Introduction section, the NoSQL databases works well incorporated with distributed system. Therefore, there are some distribution models designed for the distribution purpose, namely, single server model, sharding model, master-slave replication model, peer-to-peer replication model and the combination model with sharding and replication. The single server model, by the name directly, is to deploy only server to store the datasets and all data operations happens on this server. This policy is simple and ensures the data integrity and consistency, as well as the operation conflicts. Apparently, the overload of all kinds of data operations occurs in the server, thus it cannot guarantee to minimize the request delay and the failover when there are hundreds of thousands requests from all over the world. Moreover, once the only existed sever goes down, it cannot provides service any more. What the worse is that the data is lost forever once the server’s hard drive is broken. Another distribution model is named sharding. By the name of it, it shards the datasets into different parts, stores the different sub-datasets into different servers. The sharding strategy can be based on the location, or data categories or other rules. Each shard reads and writes its own data. Because of the imbalance of data throughput in different shards, it may cause inefficient usage among some shards. Therefore, this policy has its own pros and cons. The master-slave replication selects one server as the master node, and the others are treated as slave nodes. At the initial state, each node has all the datasets. The master node has the authority to control the slave nodes. All the write requests occur on the master node, then, the master node propagates these changes to all the slaves. While, in order to balance the node burden, the read requests can be done from the master or the slaves. Unlike the single server model, this model has the benefit for continuing service when the master is down as the slave node can be assigned as the master node if the master fails. The peer-to-peer replication has the similar structure except without the master node. Each node in the this model has the same dataset, and each node has the same authority and deals with both read and write requests and communicates the write requests with each other to keep the datasets up-to-date in each node. The combination with sharding and master-slave replication absorbs the advantages from both sharding and masterslave replication. This model implements multiple masters, but each data item only has single master so that it can balance the write/read operation burden on each node and provides the recovery ability once some nodes fail. The combination of peer-topeer replication and sharding is another way, and replicates the dataset on each node three times so that even if one node fails, the data on that failed node is still available and continue to serve.