YouTip LogoYouTip

Nosql

NoSQL Introduction

NoSQL (NoSQL = Not Only SQL), meaning "Not Only SQL".

In modern computing systems, massive amounts of data are generated on the network every day.

A large portion of this data is handled by Relational Database Management Systems (RDBMS). The relational model proposed in E.F. Codd's 1970 paper "A relational model of data for large shared data banks" simplified data modeling and application programming.

Practical application has proven that the relational model is highly suitable for client-server programming, offering benefits far beyond expectations. Today, it is the dominant technology for storing structured data in network and business applications.

NoSQL is a revolutionary database movement. It was proposed early on and gained significant momentum by 2009. NoSQL advocates promote the use of non-relational data storage. Compared to the ubiquitous use of relational databases, this concept undoubtedly injects a completely new way of thinking.

A transaction in English is called a transaction, similar to a real-world transaction. It has the following four characteristics:

1. A (Atomicity) Atomicity

Atomicity is easy to understand: all operations within a transaction are either completed entirely or not at all. A transaction succeeds only if all its operations succeed. If any single operation fails, the entire transaction fails and needs to be rolled back.

For example, a bank transfer of 100 yuan from Account A to Account B involves two steps: 1) Withdraw 100 yuan from Account A; 2) Deposit 100 yuan into Account B. These two steps must either complete together or not complete at all. If only the first step is completed and the second step fails, 100 yuan will mysteriously disappear.

2. C (Consistency) Consistency

Consistency is also relatively easy to understand: the database must always remain in a consistent state. The execution of a transaction does not alter the original consistency constraints of the database.

For example, if there is an integrity constraint a + b = 10, and a transaction changes the value of a, then b must also be changed so that after the transaction ends, a + b = 10 still holds. Otherwise, the transaction fails.

3. I (Isolation) Isolation

Isolation means that concurrent transactions do not interfere with each other. If the data a transaction wants to access is being modified by another transaction, as long as the other transaction has not been committed, the data it accesses is not affected by the uncommitted transaction.

For example, if there is a transaction transferring 100 yuan from Account A to Account B, and this transaction is not yet complete, if Account B queries its balance at this moment, it will not see the newly added 100 yuan.

4. D (Durability) Durability

Durability means that once a transaction is committed, the changes it made will be permanently saved in the database, even if a system crash occurs.

A distributed system consists of multiple computers and communication software components connected via computer networks (local area networks or wide area networks).

A distributed system is a software system built on top of a network. It is precisely because of the characteristics of software that distributed systems possess high cohesion and transparency.

Therefore, the difference between networks and distributed systems lies more in the higher-level software (especially the operating system) rather than the hardware.

Distributed systems can be applied to various platforms such as PCs, workstations, local area networks, and wide area networks.

Reliability (Fault Tolerance):

An important advantage of distributed computing systems is reliability. The crash of one server's system does not affect the other servers.

Scalability:

In a distributed computing system, more machines can be added as needed.

Resource Sharing:

Sharing data is essential for applications such as banking and reservation systems.

Flexibility:

Because the system is very flexible, it is easy to install, implement, and debug new services.

Faster Speed:

Distributed computing systems can leverage the computational power of multiple computers, giving them faster processing speeds than other systems.

Open System:

Because it is an open system, the service can be accessed locally or remotely.

Higher Performance:

Compared to centralized computer network clusters, it can provide higher performance (and better cost-effectiveness).

Troubleshooting:

Troubleshooting and diagnosing problems.

Software:

Limited software support is a major disadvantage of distributed computing systems.

Network:

Issues with network infrastructure, including: transmission problems, high load, information loss, etc.

Security:

The open nature of the system poses risks to data security and sharing in distributed computing systems.

NoSQL refers to non-relational databases. NoSQL is sometimes also an abbreviation for "Not Only SQL," a general term for database management systems that differ from traditional relational databases.

NoSQL is used for storing extremely large-scale data (e.g., Google or Facebook collect terabytes of data for their users daily). These types of data storage do not require a fixed schema and can be scaled horizontally without complex operations.

Today, we can easily access and scrape data through third-party platforms (such as Google, Facebook, etc.). User personal information, social networks, geographic locations, user-generated data, and user operation logs have increased exponentially. If we want to mine this user data, SQL databases are no longer suitable for these applications. The development of NoSQL databases can handle this large data very well.

Image 1

Social Network:

Each record: UserID1, UserID2

Separate records: UserID, first_name, last_name, age, gender, ...

Task: Find all friends of friends of friends of ... friends of a given user.

Wikipedia Page:

Large collection of documents

Combination of structured and unstructured data

Task: Retrieve all pages regarding athletics of Summer Olympics before 1950.

RDBMS

  • Highly organized structured data
  • Structured Query Language (SQL)
  • Data and relationships are stored in separate tables.
  • Data Manipulation Language, Data Definition Language
  • Strict consistency
  • Basic transactions

NoSQL

  • Represents Not Only SQL
  • No declarative query language
  • No predefined schema
  • Key-value store, column store, document store, graph database
  • Eventual consistency, not ACID properties
  • Unstructured and unpredictable data
  • CAP theorem
  • High performance, high availability, and scalability

Image 2

The term NoSQL first appeared in 1998, referring to a lightweight, open-source relational database developed by Carlo Strozzi that did not provide SQL functionality.

In 2009, Johan Oskarsson from Last.fm initiated a discussion on distributed open-source databases . Eric Evans from Rackspace reintroduced the concept of NoSQL, which at that time mainly referred to non-relational, distributed database design patterns that did not provide ACID guarantees.

The "no:sql (east)" conference held in Atlanta in 2009 was a milestone, with the slogan "select fun, profit from real_world where relational=false;". Therefore, the most common interpretation of NoSQL is "non-relational," emphasizing the advantages of Key-Value Stores and document databases, rather than simply opposing RDBMS.

In computer science, the CAP theorem (also known as Brewer's theorem) states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (every request receives a response, whether success or failure)
  • Partition tolerance (the system continues to operate despite arbitrary information loss or failure of part of the system)

The core of CAP theory is: a distributed system cannot simultaneously satisfy consistency, availability, and partition tolerance perfectly; it can at best satisfy two of them well.

Therefore, based on the CAP principle, NoSQL databases are divided into three categories that satisfy the CA principle, the CP principle, and the AP principle:

  • CA - Single-point cluster, systems that satisfy consistency and availability, usually not very strong in scalability.
  • CP - Systems that satisfy consistency and partition tolerance, usually not particularly high in performance.
  • AP - Systems that satisfy availability and partition tolerance, usually may have lower consistency requirements.

Image 3

Advantages:

  • High scalability
  • Distributed computing
  • Low cost
  • Architectural flexibility, semi-structured data
  • No complex relationships

Disadvantages:

  • No standardization
  • Limited query functionality (so far)
  • Eventual consistency is not intuitive for programming

BASE: Basically Available, Soft-state, Eventually Consistent. Defined by Eric Brewer.

The core of CAP theory is: a distributed system cannot simultaneously satisfy consistency, availability, and partition tolerance perfectly; it can at most satisfy two of them well.

BASE is the principle of weak requirements for availability and consistency typically adopted by NoSQL databases:

  • Basically Available -- Basically Available
  • Soft-state -- Soft state / Flexible transaction. "Soft state" can be understood as "connectionless," while "Hard state" is "connection-oriented."
  • Eventually Consistency -- Eventual consistency, which is also the ultimate goal of ACID.
ACID BASE
Atomicity Basically Available
Consistency Soft state
Isolation Eventual consistency
Durable
Type Representative Features
Column Store HBase, Cassandra, Hypertable As the name suggests, data is stored by column. Its biggest feature is convenient storage of structured and semi-structured data, easy data compression, and significant IO advantages for queries targeting one or several columns.
Document Store MongoDB, CouchDB Document storage generally uses a format similar to JSON, storing document-type content. This also allows indexing certain fields to implement some functions of relational databases.
Key-Value Store Tokyo Cabinet/Tyrant, Berkeley DB, MemcacheDB, Redis Value can be quickly queried via key. Generally, storage does not care about the format of the value and accepts anything. (Redis includes additional functionality)
Graph Store Neo4J, FlockDB Best storage for graphical relationships. Using traditional relational databases results in poor performance and inconvenient design and use.
Object Store db4o, Versant Operates the database using syntax similar to object-oriented languages, storing and retrieving data via objects.
XML Database Berkeley DB XML, BaseX Efficiently stores XML data and supports internal XML query syntax, such as XQuery, XPath.

Many companies are now using NoSQL:

  • Google
  • Facebook
  • Mozilla
  • Adobe
  • Foursquare
  • LinkedIn
  • Digg
  • McGraw-Hill Education
  • Vermont Public Radio
← Mongodb IntroDtd Intro β†’