Skip to main content

mongoDB


MongoDB

A cross-platform open-source document-oriented NoSQL database with an Enterprise level distribution that includes additional security and management features. MongoDB is supported and steered by the company MongoDB.

What do we mean by NoSQL?

The NoSQL database category loosely incorporates all databases which do not have a rigid schema and are not queryable via a structured query language. They are normally designed specifically to scale easily to support very large datasets.

What is a document based database?

Instead of storing information as a table with a series of records or rows, MongoDB stores data as searchable documents. A document is analogous to the object model in object orientated programming languages.

How does MongoDB represent documents?

MongoDB uses JSON to represent its documents. JSON is a simple key-value pair textual serialization of an object, for example, a bunch of grapes could be represented as the following JSON string:
{
  Name: "Bunch of apples",
  Origin: "UK",
  Colour: "Red",
  Weight: 0.15,
  PickDate: 2014-06-10T12: 20: 12Z,
  Apples: [
    {
      Name: "Apple",
      Weight: 0.00068
    },
    {
      Name: "Apple",
      Weight: 0.00073
    },
    ...
  ]
}
JSON supports different value types like string, integer, double and dates; it also supports nesting and depth to objects. MongoDB stores JSON on disk in a proprietary binary format called BSON, which is optimised for retrieval.

Document-based databases are schema-less in the sense that there is no requirement for any object stored in the database to have the same, or similar, structure or document keys.

How does MongoDB efficiently search for data?

A user can programmatically specify indexes on document keys. Just like indexes in relational databases, they make querying times shorter by enabling MongoDB to more rapidly locate documents matching the query constraints. MongoDB can have up to 64 indexes per collection and these can be compound, i.e., indexes that incorporate multiple document keys into a single index. Additionally, MongoDB can scale horizontally, distributing its data and the query load across multiple servers ('nodes'). MongoDB calls this process 'sharding'.

What is Sharding?

To support large data sets efficiently, MongoDB allows collections to be physically partitioned and spread across multiple hosts. Each collection can specify a set of keys, which each document in that collection must have, that is used to determine how the documents are distributed across the nodes. This shard key defines an index, of which each host in the cluster is allocated a portion. MongoDB supports up to 1024 shards in a single cluster, and new shards can be added over time as the data grows (without taking the database offline). Shard members should be co-located, and the shard key should be chosen carefully, for optimum performance.

How is the cluster accessed?

MongoDB is shipped with a small routing process (called mongos) that sits in between the cluster and client app. Its job is route client requests into the cluster and then to manage the streaming of the response back to the client. If the query is predicated on the shard key then MongoDB can direct it to the particular node that is responsible for that part of the shard key index; otherwise, it must send it to all nodes in the cluster and aggregate the responses for the client. A cluster can have multiple routing processes and a typical deployment will have one or two per application process for redundancy.

How can I secure MongoDB

Security can be applied at the database level. Accounts can be created within the database to give users read-only or full read-write access. An imminent version of MongoDB will have Active Directory integration. Interprocess communications can be secured using SSL.

How does MongoDB deal with redundancy and backup?

Each shard in the MongoDB cluster can become part of a replica set. A replica set is a group of computers which, under normal load conditions, are kept in sync. Consistency across the entire replica set is not guaranteed since writes are not (by default) synchronous across the entire replica set, but members eventually become consistent over time. ACID-like updates are implemented by nominating a single primary member in each set to which all writes must be made and to which (by default) all reads are made. For applications where read consistency is not important secondary members of the set can be used for reads, easing the load on the primary. If the original primary fails then a new primary is elected amongst the survivors.

Replica set members need not be located in the same physical location and a set can contain up to 6 members. Members can also be marked as delayed members, where writes are only propagated to these members after a specified time period.

Snapshot backups of the database can be taken by temporarily taking the cluster formed by secondary replica set members offline and creating copies of the database files.
What APIs are available to use with MongoDB?

Client APIs are available for all popular programming languages, including Java, C#, C++, C and Python. These are open-source and maintained by 10gen, the company behind MongoDB.

Setup

Download MongoDB distribution:

  • Windows:
    • I tested with 2.x version and installed it in C:\work\mongoDB
    • Create C:\work\mongoDB\dbpath directory for database storage to provide a -dbpath command argument when starting mongodb server.
    • Run C:\work\mongoDB\bin\mongod.exe in the distribution folder.
  • Linux
    • Download the latest stable distribution, unpack it to /work/mongodb/ (the mongodb home folder).
    • Create /work/mongodb/dbpath directory its data storage.
    • Run mongoDB:
      cd /work/mongodb
      ./bin/mongod -dbpath 
      
    • If you are having a shortage of disk space, then you can try running MongoDB instace with -smallfiles option
      ./mongod -dbpath  -smallfiles
      

MongoDB clients

Both for Windows and Linux you can use the default mongo javascript shell that comes with the distribution.
It connects to localhost and test database by default.
After connecting, you can run commands to play with it.
Help is available by tying help.

MongoDB also comes with a variety of tools that can be used to administer or monitor MongoDB instances. Most basic operations include connecting to a local or any remote MongoDB instance, viewing collections (Mongo term for tables), running queries, creating indices, dropping collections, etc.

Drivers

MongoDB supports drivers for various different programming languages. For Java you can use MongoDB Java Driver. Here is a documentation to get started writing simple apps with with MongoDB Java Driver.
There is also a Spring Data MongoDB project with POJO centric model for interacting with a MongoDB DBCollection and easily writing a Repository style data access layer.

Comments

Popular posts from this blog

MPlayer subtitle font problem in Windows

While playing a video with subtitles in mplayer, I was getting the following problem: New_Face failed. Maybe the font path is wrong. Please supply the text font file (~/.mplayer/subfont.ttf). Solution is as follows: Right click on "My Computer". Select "Properties". Go to "Advanced" tab. Click on "Environment Variables". Delete "HOME" variable from User / System variables.

Kafka performance tuning

Performance Tuning of Kafka is critical when your cluster grow in size. Below are few points to consider to improve Kafka performance: Consumer group ID : Never use same exact consumer group ID for dozens of machines consuming from different topics. All of those commits will end up on the same exact partition of __consumer_offsets , hence the same broker, and this might in turn cause performance problems. Choose the consumer group ID to group_id+topic_name . Skewed : A broker is skewed if its number of partitions is greater that the average of partitions per broker on the given topic. Example: 2 brokers share 4 partitions, if one of them has 3 partitions, it is skewed (3 > 2). Try to make sure that none of the brokers is skewed. Spread : Brokers spread is the percentage of brokers in the cluster that has partitions for the given topic. Example: 3 brokers share a topic that has 2 partitions, so 66% of the brokers have partitions for this topic. Try to achieve 100% broker spread

wget and curl behind corporate proxy throws certificate is not trusted or certificate doesn't have a known issuer

If you try to run wget or curl in Ununtu/Debian behind corporate proxy, you might receive errors like: ERROR: The certificate of 'apertium.projectjj.com' is not trusted. ERROR: The certificate of 'apertium.projectjj.com' doesn't have a known issuer. wget https://apertium.projectjj.com/apt/apertium-packaging.public.gpg ERROR: cannot verify apertium.projectjj.com's certificate, issued by 'emailAddress=proxyteam@corporate.proxy.com,CN=diassl.corporate.proxy.com,OU=Division UK,O=Group name,L=Company,ST=GB,C=UK': Unable to locally verify the issuer's authority. To connect to apertium.projectjj.com insecurely, use `--no-check-certificate'. To solution is to install your company's CA certificate in Ubuntu. In Windows, open the first part of URL in your web browser. e.g. open https://apertium.projectjj.com in web browser. If you inspect the certifcate, you will see the same CN (diassl.corporate.proxy.com), as reported by the error above