Distribution in GlusterFS is handled by the DHT or the Distributed Hash Table which is loaded on the client stack. All operations are driven by the clients which are all equal. There are no metadata servers or special nodes which have any additional information about where the files are present or should go. Any additional information about file or directories are stored in the extended attributes or xattrs. Xattrs are filesystem features that enable users to associate files/dirs with metadata. They store information as key-value pairs. There are mainly two DHT related xattrs- linkto and layout.
DHT creates directories on all the bricks. When directory is created a layout range is assigned to it which is stored in the extended attribute called trusted.glusterfs.dht .The range varies from 00000000 to 0xffffffff and each brick is assigned a specific subset of this range. The layout is complete and healthy when the range 00000000 to 0xffffffff is distributed across the volume without any gaps or overlap. The directory creation and the setting of layout is part of mkdir operation. While setting the layout we can either do
- Homogeneous setting of layout, where each brick gets the equal range. This is the default option.
- Heterogeneous weight assigning where based on the size of the bricks we assign the layout range to a brick. This means that larger bricks have larger layout, increasing the probability of data on these bricks.
Lets take an example, consider that your volume has three bricks (b1, b2, b3). If you create a directory dir1 then the layout would look something like this:
# file: export/testvol/brick1/dir1
# file: export/testvol/brick2/dir1
# file: export/testvol/brick3/dir1
There are mainly two types of anomalies that can be seen w.r.t layout:
- Holes – on a brick if a directory does not have a layout it is called a hole. If there is no layout on a directory no files can be stored on that brick.
- Overlaps – all brick must have exclusive layout ranges, if the layout ranges overlap it is an overlap.
Unlike directories file have to be present on only one subvol. Given a file we find its hash value and the brick on which the hash value falls. This brick is known as the hashed brick. The brick on which the data file actually exists is the cached brick. For a newly created file the hashed and the cached brick will usually be the same. Considering the above example if we create a file under the directory dir1 then the file will be created on only one of the brick.
However while renaming a file the destination file’s hashed brick may be different from the source file’s hashed brick. In this case instead of actually moving the entire data file to the new hashed brick we create a linkto file. This is a 0 byte file which is created on the new hashed brickhave mode equal to _____T (no permissions except for the sticky bit ‘T’). The purpose of the linkto file is to act as a pointer to the brick where the data file actually exists (which is still located on the old hashed brick). They have an xattr called the rusted.dht.linkto xattr which stores the name of the brick on which the data file actually exists. Now the brick on which the linkto file exists is the hashed brick and the file on which the actual data file exists is the cached brick. All fops on the file will first land to the hashed brick and will be redirected to the cached brick by reading the linkto file’s xattr.