CephFS requires an additional component to coordinate client access and metadata; this component is called the Metadata Server, or MDS for short. Although the MDS is used to serve metadata requests to and from the client, the actual data read and written still goes directly via the OSDs. This approach minimizes the impact of the MDS on the filesystem's performance for more bulk data transfers, although smaller I/O-intensive operations can start to be limited by the MDS performance. The MDS currently runs as a single-threaded process and so it is recommended that the MDS is run on hardware with the highest-clocked CPU as possible.
The MDS has a local cache for storing hot portions of the CephFS metadata to reduce the amount of I/O going to the metadata pool; this cache is stored in local memory for performance and can be controlled by adjusting the MDS cache memory-limit configuration option, which defaults to 1 GB.
CephFS utilizes a journal stored in RADOS mainly for consistency reasons. The journal stores the stream of metadata updates from clients and then flushes them into the CephFS metadata store. If an MDS is terminated, the MDS that takes over the active role can then replay these metadata events stored in the journal. This process of replaying the journal is an essential part of the MDS becoming active and therefore will block until the process is completed. The process can be sped up by having a standby-replay MDS that is constantly replaying the journal that is ready to take over the primary active role in a much shorter amount of time. If you have multiple active MDSes, whereas a pure standby MDS can be a standby for any active MDS, standby-replay MDSes have to be assigned to a specific MDS rank.
As well as the active and replaying states, an MDS can also be in several other states; the ones you are likely to see in the ceph status are listed for reference for when operating a Ceph cluster with a CephFS filesystem. The states are split into two parts: the part on the left side of the colon shows whether the MDS is up or down. The part on the right side of the colon represents the current operational state:
- up:active: This is the normal desired state, as long as one MDS is in this state, clients can access the CephFS filesystem.
- up:standby: This can be a normal state as long as one MDS is up:active. In this state, an MDS is online but not playing any active part in the CephFS infrastructure. It will come online and replay the CephFS journal in the event that the active MDS goes online.
- up:standby_replay: Like the up:standby state, an MDS in this state is available to become active in the event of an active MDS going offline. However, a standby_replay MDS is continuously replaying the journal of MDS it has been configured to follow, meaning the failover time is greatly reduced. It should be noted that while a standby MDS can replace any active MDS, a standby_replay MDS can only replace the one it has been configured to follow.
- up:replay: In this state, an MDS has begun taking over the active role and is currently replaying the metadata stored in the CephFS journal.
- up:reconnect: If there were active client sessions active when the active MDS went online, the recovering MDS will try to re-establish client connections in this state until the client timeout is hit.
Although there are other states an MDS can be in, it is likely that during normal operations they will not be seen and so have not been included here. Please consult the official Ceph documentation for more details on all available states.