Tuning Linux for MongoDB

What is MongoDB?


================



MongoDB is a document-based database. MongoDB is not schemaless, but the schema is very fluid. Every item in the document has a local label akin to a key value pair rather than just the value. Documents are not required to have entries for every possible field.



Key benefits: data redundancy, horizontal scale.



The two primary layouts for MongoDB deployments are replica sets and sharded clusters. Replica sets have a primary server and at least two replicas[0] all on similarly capable hardware. Each node has a full copy of the data. An election can happen automagically at any time promoting any of the nodes to be the primary.



Replica set writes go to the primary. Reads can be distributed across the secondaries if the application design allows. Writes use vertical scaling. Need more bandwidth, more storage or more CPU? Provide more capable (virtual) hardware for each of the machines in the replica set.



For horizontal scaling, adding new systems rather than biggering the servers, use a sharded cluster. Sharded clusters split the dataset across multiple replica sets and put proxies in front to route queries and responses. Each shard is essentially a replica set for the data on the shard. The proxies can also be horizontally scaled by adding new servers to distribute throughput.



This article covers highlights from my upcoming Tuning Linux for MongoDB talk at the Southern California Linux Expo (SCaLE). SCaLE is a large, community run Open Source conference March 5th through 8th. It is once again at the fantastic Pasadena Conference Center. SCaLE18x accepted my MongoDB talk as part of the sysadmin track. I’m glad to be in the same track as "What’s new in Sudo 1.9", "Command Line Efficiency" and "IPv6 Multicast and the Next Generation Internet" as I might be able to catch those.



MongoDB Daemons and Tuning


==========================



MongoDB’s two primary daemons are mongos and mongod. mongos are the proxies used by sharded clusters. mongod are the primary daemons, running on each replica set node. mongod fulfill data requests and perform background management operations.



Both daemons should have dedicated systems, whether they be bare metal, virtual machines or containers. For CPUs generally choose more cores rather than faster cores. Also, turn off dynamic speed and verify CPUs are full-throttle in bios and in the operating systems (see power savings settings). CPUs that support the AES-NI extension can give performance improvements for encryption. Time is important, so make sure every node is time-syncing such as by running ntpd.



Default kernel and operating system configurations often need to be adjusted in server environments. This is especially true for databases such as MongoDB.



For sharded clusters the mongos proxies provide several key functions where OS tuning is important.



-   TLS endpoint connection for the application servers



-   authentication



-   take requests from the application servers



-   route requests to the appropriate shards and nodes



-   aggregate and manipulate responses from the shards



-   return data to application servers



The mongos can have many TCP connections, process a lot of network IO and use a decent amount of CPU. They don’t do much disk IO.



For user limits, throw the door open for the mongos processes. mongos should be the priority process on the node. Increase processes/threads and open files, usually to 64000. Set CPU time, memory resources and time resources to unlimited. Set both soft and hard limits the same.



-   processes / threads (-u) = 64000



-   open files (-n) = 64000



-   cpu time (-t) = unlimited



-   virtual memory (-v) = unlimited



-   locked-in-memory (-l) = unlimited



-   file size (-f) = unlimited



-   memory size (-m) = unlimited



Every mongos needs to connect to each shard node. Also, the mongos take the incoming connections from the application servers. Usually keepalives are low to reduce the number of idle connections. Increase the number of connections that can be queued and half-open connections. To help with the load on the latter, make sure applications pool connections. Increase connection tracking if running a firewall on the mongos node. See security below for more context.



-   net.core.somaxconn (increase the value)



-   net.ipv4.tcp_max_syn_backlog (increase the value)



-   net.ipv4.tcp_fin_timeout (reduce the value)



-   net.ipv4.tcp_keepalive_intvl (reduce the value)



-   net.ipv4.tcp_keepalive_time (reduce the value)



-   net.ipv4.netfilter.ip_conntrack_max (increase the value)



Tuning for mongod


=================



As the primary data server, mongod does a lot of IO for both disk and network. It also has many TCP connections. In a Replica Set, the mongod nodes are also taking application connections. Upon startup, mongod provides some helpful feedback if it thinks some minimum standards haven’t been met.



Like with mongos, throw the door open for user limits. mongod should be the priority process on the node. Increase processes/threads and open files, usually to 64000. mongod uses a file descriptor per active data file. Set CPU time, memory resources and time resources to unlimited. Set both soft and hard limits the same.



-   processes / threads (-u) = 64000



-   open files (-n) = 64000



-   cpu time (-t) = unlimited



-   virtual memory (-v) = unlimited



-   locked-in-memory (-l) = unlimited



-   file size (-f) = unlimited



-   memory size (-m) = unlimited



For mongod, get the best, fastest (virtual) disks you can. Use XFS, though ext4 is a non-preferred option (mongod requires fsync()). Use a seperate data partition. MongoDB documentation says to use RAID 10, but RAID 0 is faster and you have redundancy via the replicas.



If you can, put logs and mongod journals on other filesystems as well. The seperated filesystems need to be on different sets of disks to be of advantage. As a sysadmin, be prepared for lots of logs, especially if a node goes down. MongoDB logging can be "are we there yet" verbose and dump gigabytes of logs in minutes. Quickly ship them to centralized logging, aggressivly rotate them or use circular logs, because default log rotation leads to a full /var/ partition.



mongod does mostly random reads, so reduce readahead. Between 8 and 32 is recommended. Set noatime for the mount. If using NFS, also set bg and nolock. Use the deadline scheduler on bare metal, the noop scheduler on containers and VMs.



It’s important to reduce vm.swappiness for databases. Setting vm.swappiness to 0 is no longer recommended as kernel behavior was modified. We still want it as low a practical, so set vm.swappiness to 1.



Reduce vm.dirty_ratio, the maximum amount of system memory that needs to be written to disk before the kernel requires dirty memory to be committed to disk. Also reduce vm.dirty_backgroun_ratio, the point where background flushes are triggered.



Disable non-uniform memory architecture, ‘NUMA`, if your platform allows it. Run irqbalance and review `/proc/interrupts’ to make sure IRQs are being balanced between multiple CPUs. Disable transparent hugepages on Red Hat and CentOS systems that enable it.



Virtualization


==============



In virtual environments, it’s even more important to have rack awareness. Make sure to not have multiple nodes of a replica set on the same host system.



Like with hardware systems, virtual environments deployments should choose reliable throughput, fast storage options. They should also reduce the risk of memory overcommitment by mapping and reserving full memory.



For KVM, OpenVZ and VMware, be cautious of memory overcommitment. Also, check affinity rules to keep replica set nodes from colocating to the same hardware.



In AWS environments, use enhanced networking and also provisioning IOPS rather than ephemeral IOPS. Disable DVFS, CPU power saving modes and hyperthreading. Bind memory to a single socket.



For Azure, use premium storage and adjust your TCP idle timeout.



Security


========



Like with any other TLS service, monitor certificate expiration. Also test full-chain verification from the application perspective. Regularly review TLS versions you want to allow on your network and update both server and client configurations as appropriate.



SELinux causes segfaults on MongoDB operations that require server-side JavaScript. Disable one or the other.



MongoDB firewalls can be restrictive as the application servers are on the edge and MongoDB should be isolated. Linux stateful firewalls use connection tracking to maintain state information on connections. Monitor the conntrack tables to verify the firewall has enough resources to track all of your TCP connections.



For a sharded cluster, the mongos servers need to allow incoming TCP connections from the application servers. They also need to allow TCP connections between each of the mongos servers and each mongod node.



If running MongoDB 3.4 or newer, you’ll have a config replica set. The mongos need to be able to establish TCP connections with the config servers.



The mongod need to accept incoming TCP connections from the mongos. They also need to create and accept new TCP connections to/from all the other mongod nodes.



In a replica set there are no mongos, applications talk directly to the mongod nodes. Therefore, replica set mongod nodes need to accept TCP connections from the application servers. The mongod nodes in a replica set also need to be able to create and accept new TCP connections to/from all the other mongod nodes.



Up until MongoDB 3.6 there was an option to have an http API server running on a different port.



All nodes need to allow connections from the appropriate monitoring and admin hosts. For instance, mongostat is an essential tool and pulls data from all nodes. Don’t forget to allow the ntp updates as well.



Footnotes


=========



[0] There are options for single instance with no replicas or to use an arbiter rather than a second replica. The article is on performance, If you need performance, run at least a full MongoDB replica set. The mongod sections of the article will still apply the single instance running mongod and the mongod nodes of a replica set with an arbiter.



Resources


=========



-   UNIX ulimit Settings from MongoDB docs



-   MongoDB production notes from MongoDB docs



-   Tuning Linux for MongoDB by Tim Vaillancourt for Percona



-   Optimizing Your Linux Environment for MongoDB by Onyancha Brian Henry for severalnines