DHT – Distributed Hash Table

Distribution in GlusterFS is handled by the DHT or the Distributed Hash Table which is loaded on the client stack. All operations are driven by the clients which are all equal. There are no metadata servers or special nodes which have any additional information about where the files are present or should go. Any additional information about file or directories are stored in the extended attributes or xattrs. Xattrs are filesystem features that enable users to associate files/dirs with metadata. They store information as key-value pairs. There are mainly two DHT related xattrs-  linkto and layout.

DHT creates directories on all the bricks.  When directory is created a layout range is assigned to it which is stored in the extended attribute called trusted.glusterfs.dht .The range varies from 00000000 to 0xffffffff  and each brick is assigned a specific subset of this range. The  layout is complete and healthy when the range 00000000 to 0xffffffff is distributed across the volume without any gaps or overlap. The directory creation and the setting of layout is part of mkdir operation. While setting the layout we can either do

  1. Homogeneous setting of layout, where each brick gets the equal range. This is the default option.
  2. Heterogeneous weight assigning where based on the size of the bricks we assign the layout range to a brick. This means that larger bricks have larger layout, increasing the probability of data on these bricks. 

Lets take an example, consider that your volume has three bricks (b1, b2, b3). If you create a directory dir1 then the layout would look something like this:

# file: export/testvol/brick1/dir1
trusted.gfid=0x3c3f4bc85885438f98825ef348a95984
trusted.glusterfs.dht=0x00000001000000000000000055555554

# file: export/testvol/brick2/dir1
trusted.gfid=0x3c3f4bc85885438f98825ef348a95984
trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9

# file: export/testvol/brick3/dir1
trusted.gfid=0x3c3f4bc85885438f98825ef348a95984
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff

There are mainly two types of anomalies that can be seen w.r.t layout:

    1. Holes –  on a brick if a directory does not have a layout it is called a hole. If there is no layout on a directory no files can be stored on that brick.
    2. Overlaps – all brick must have exclusive layout ranges, if the layout ranges overlap it is an overlap.

Unlike directories file have to be present on only one subvol.  Given a file we find its hash value and the brick on which the hash value falls. This brick is known as the hashed brick. The brick on which the data file actually exists is the cached brick. For a newly created file the hashed and the cached brick will usually be the same. Considering the above example if we create a file under the directory dir1 then the file will be created on only one of the brick.

 However while renaming a file the destination file’s hashed brick may be different from the source file’s hashed brick. In this case instead of actually moving the entire data file to the new hashed brick we create a linkto file. This is a 0 byte file which is created on the new hashed brickhave mode equal to _____T (no permissions except for the sticky bit ‘T’). The purpose of the linkto file is to act as a pointer to the brick where the data file actually exists (which is still located on the old hashed brick).  They have an xattr called the  rusted.dht.linkto xattr which stores the name of the brick on which the data file actually exists. Now the brick on which the linkto file exists is the hashed brick and the file on which the actual data file exists is the cached brick. All fops on the file will first land to the hashed brick and will be redirected to the cached brick by reading the linkto file’s xattr.

 

Advertisements

GlusterFS and types of volumes

As discussed in another post a volume is a logical collection of bricks and most file operations happen on the volume. GlusterFS supports various types of volumes.

Some volumes are good for scaling storage size, some for improving performance and some for both.

  1. Distributed Glusterfs Volume – This is the default glusterfs volume i.e, while creating a volume of you do not specify the type of the volume the default option is to create a distributed type of volume. In this volume, files are randomly stored across the bricks in the volume. There is no data redundancy. The purpose for such a storage volume is to easily scale the volume size. However this also means that a brick failure will lead to complete loss of data and one must rely on the underlying hardware for data loss protection.
  2. Replicated Glusterfs Volume – In this volume we overcome the data loss problem faced in the distributed volume. Here exact copy of the data is maintained on all bricks. The number of replicas in the volume can be decided by client while creating the volume. One major advantage of such a volume is that even if one brick fails the data can still be accessed from its replica brick. Such a volume is used for better reliability and data redundancy.
  3. Distributed Replicated Glusterfs Volume – This volume combines the advantages of both a distribute and a replicate volume. In this volume files are distributed across replicated sets of bricks. The number of bricks must be a multiple of the replica count. Also the order in which we specify the bricks matters since adjacent bricks become replicas of each other. This type of volume is used when high availability of data due to redundancy and scaling storage is required. So if there were eight bricks and replica count 2 then the first two bricks become replicas of each other then the next two and so on. This volume is denoted as 4×2. Similarly if there were eight bricks and replica count 4 then four bricks become replica of each other and we denote this volume as 2×4 volume.
  4. Striped Glusterfs Volume – Consider a large file being stored in a brick which is frequently accessed by many clients at the same time. This will cause too much load on a single brick and would reduce the performance. In striped volume the data is stored in the bricks after dividing it into different stripes. So the large file will be divided into smaller chunks (equal to the number of bricks in the volume) and each chunk is stored in a brick. Now the load is distributed and the file can be fetched faster but no data redundancy provided.

GlusterFS – a distributed file system

Looking at the topic the first question arises as to what is a distributed file system and why do we need it. Distributed file system gives us a way for storing and accessing files in a client/server architecture. Here we can use more that one server to store data and use multiple clients (local or remote) that can access data from these servers. Distributed file system organizes and displays  files and directories from multiple servers as if they were stored in your local system, thereby projecting a simple interface to the user or application. Their main advantage is that it is easier to distribute documents to multiple clients and they provide a centralized storage system so that client machines are not using their resources to store the data.

Lets look at some of the GlusterFS terminologies:

  1. Brick is the basic unit of storage, represented by an export directory on a machine
  2. A storage pool is a trusted network of storage servers. You can dynamically add or remove nodes from the pool. Only nodes that are in a trusted storage pool can participate in volume creation.
  3. A volume is a logical collection of bricks. Most of the gluster management operations happen on the volume. Multiple volumes can be hosted on the same node.
  4. These are building blocks of GlusterFS. All functions are implemented as translator They are stacked together for achieving a desired functionality.
  5. The machine which mounts the volume (this may also be a server).

 

GlusterFS is a distributed file system capable of scaling up to several petabytes and can  handle thousands of clients. GlusterFS clusters together storage building blocks over RDMA or TCP/IP and aggregates disk and memory resources for managing data in a single global namespace. It does not need any special hardware to store data. It is based on a stackable user space design.

 

GlusterFS is a user space file system and hence uses FUSE (Filesystem in user space) to hook itself with VFS layer. FUSE is a loadable kernel module that lets non-privileged users create their own file systems without editing kernel code. Gluster uses already tried and tested disk file systems like ext4, xfs, etc as the underlying filesystem to store the data.

My talk at Vault Linux Filesystem and Storage

Recently I got an opportunity to present a talk for the Vault Linux Filesystem and Storage conference at held at Raleigh, North Carolina. This was my first experience giving a talk and I got to learn many things. My talk was on GlusterFS and its distribution model. You can have a look at my slides from [1]. The conference was a two day event and my talk was one of the first talks on day 1. For me this was a huge advantage since I could finish my talk first and then could enjoy the remaining talks without any worries. Though I had been preparing since a long time I was pretty nervous once the talk started. the talk on the whole went smoothly without anything bad happening (before the event I was imagining a variety of things that could go wrong ranging from me going blank, mic not working, etc). Having given one talk has made me much more confident about talks in the future and I realized that the only way of overcoming one’s stage fright is to keep giving more and more talks. Also this talk showed me some of my mistakes and has helped me to improve. Also I got to meet a lot of people and interact with them.

Go Goa Gone…!

One of my friend had once suggested me that Goa is a must visit place for bachelors. Hence though I had been to Goa as a kid with my family and family friends (huge group), I decided to be there with my ‘bachelor’ friends. We were a group of five (four girls and  a guy :P) and decided to go for our Goa trip.

Day 0: We boarded our bus for Karwar which is 40 km away from south goa. Now this is was a pretty uneventful bus ride, if you exclude sleeping uncomfortably.

Day 1: On reaching Karwar, we took a bus to south goa and finally reached our hotel which was alongside palolem beach. We dumped our luggage, rented bikes and rode to the turtle beach which was a few km from palolem. Finally we got a feeling that we were in Goa. We had a great time riding bikes but were pretty tired and exhausted by the time we reached turtle beach. There was a small restaurant nearby, so we decided to have a brunch.  Turtle beach has become one of my favorite beaches, it was quite, clean and we had a great time there. Finally after 3 hours of playing at the beach we decided to ride back. After reaching back we took bath and went to palolem beach to watch the sunset which I cannot specify enough was beautiful to watch. We spent sometime walking, building sand castle, talking on the beach and then crashed.

Day 2: Next day we got ready, bid farewell to south goa and went to Dhoodh sagar. It was a couple of hour drive from palolem beach.  After a small trek we reached the falls, which was amazing and the water had some great de-tanning effect on us. After spending a couple of hours at the fall we returned and had food. We then proceeded towards north goa. The drive was amazing, as we were singing all the way, watching pretty good scenes on our sides. We reached north goa sometime later that night.

Day 3: This was one of the most exciting days for us as we were to go for water sports this day. We rented bikes and set off to anjuna beach for water sports. There were various water sports to try at anjuna beach and obviously we tried most of them. We went for jet-ski, banana ride, bumper ride, para-sailing. For me the highlight of the trip was when we got to jump in deep water after para-sailing. It was so much fun to just float in the deep water! We then played in the beach for almost 3 hours. After this we were completely hungry and tried and rented some shack, ordered some burgers and watched the sunset. After a couple of hours of relaxing we rode back. On reaching our hotel we quickly got ready and headed for a casino. Though I was very tired and had no interest to goto the casino, I actually enjoyed a lot and was one of the highlights of the trip for me. Initially we just roamed from table to table at the casino having no idea what to do, but soon we found a game (which sadly I don’t remember the name) and got really interested in the game. We actually played the game later and won for sometime:). After around four hours we returned to out hotel.

Day 4: This was our last day of the trip and though nobody was mentioning, we were all a bit sad. We got up a bit late, packed and checked out. We checked out tito’s lane a bit, but it was hot so we went to baga beach and played at the beach for a few hours. We came back to our hotel and changed again. We returned our bikes and booked a taxi to goto Panjim to board our bus. Since we reached Panjim a little early we just went to a park, where we rested, talked and had some snacks. We then boarded our bus and immediately slept.

Next morning we reached Bangalore and with heavy hearts and a lot of memories we said our good byes, with promises to travel again soon…

 

Implementing ls -l command in Linux

Bash shells allows us to have an interactive shell such that the user can type various commands and get back results. One of the very common command used all the time is ls – which will list all the files in that particular directory. The command ls also has numerous variations like:

ls -a : lists all the hidden files

ls -l: lists all file names along with other details like owner, permission, date of creation etc.

ls -R: Recursively lists files present in the directories inside the main directory

Given below is a program which is a simple implementation of the ls -l command:

#include <unistd.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>

int main(int argc, char **argv)
{
    DIR *dp;
    struct dirent *dirp;

    if ((dp = opendir(argv[1])) == NULL)
        printf(“can’t open %s”, argv[1]);

    while ((dirp = readdir(dp)) != NULL){
        struct stat fileStat;
        stat(dirp->d_name,&fileStat);   
 
        printf(dirp->d_name);
        printf(“—————————\n”);
        printf(“File Size: \t\t%d bytes\n”,fileStat.st_size);
        printf(“Number of Links: \t%d\n”,fileStat.st_nlink);
        printf(“File inode: \t\t%d\n”,fileStat.st_ino);
 
        printf(“File Permissions: \t”);
        printf( (S_ISDIR(fileStat.st_mode)) ? “d” : “-“);
        printf( (fileStat.st_mode & S_IRUSR) ? “r” : “-“);
        printf( (fileStat.st_mode & S_IWUSR) ? “w” : “-“);
        printf( (fileStat.st_mode & S_IXUSR) ? “x” : “-“);
        printf( (fileStat.st_mode & S_IRGRP) ? “r” : “-“);
        printf( (fileStat.st_mode & S_IWGRP) ? “w” : “-“);
        printf( (fileStat.st_mode & S_IXGRP) ? “x” : “-“);
        printf( (fileStat.st_mode & S_IROTH) ? “r” : “-“);
        printf( (fileStat.st_mode & S_IWOTH) ? “w” : “-“);
        printf( (fileStat.st_mode & S_IXOTH) ? “x” : “-“);
        printf(“\n\n”);
 
        printf(“The file %s a symbolic link\n”, (S_ISLNK(fileStat.st_mode)) ? “is” : “is not”);
    }
    return 0;
}

The opendir function returns a pointer to a DIR structure, and we pass this pointer to the readdir function.  We then call readdir in a loop, to read each directory entry.  The readdir function returns a pointer to a dirent structure or, when it’s finished with the directory,  a null pointer.

X86 Assembly

There are three basic kids of instructions:

  • Transfer data between memory and register:
    • Load data from memory to register – %reg = Mem[address]
    • Store register data into memory – Mem[address] = %reg
  • Perform arithmetic function on register or memory data.                  Eg: c = a+b
  • Transfer Control:
    • Change the execution of program.
    • Unconditional jumps
    • Conditional branches

Moving data:
movX source, dest – where X is {b, w, l}

movl source, dest – move 4 bytes long word
movb source, dest – move 1 byte byte
movw source, dest – move 2 bytes word

Operands:

  • Immediate – constant int data. Eg: $0x400, &-533. All constants are prefixed with $.
  • Register – eg: %eax, %edx
  • Memory – 4 consecutive bytes of memory address given by register. Eg: (%eax) – this means that %eax stores an address.

There can be various operand combinations like:
Source: Immediate
Destination: Register – movl $0x4, %eax – var a = 0x4
Destination: Memory – movl $0x4, (%eax) – *p = 0x4

Source: Register
Destination: Register – movl %edx, %eax – var a = b
Destination: Memory – movl %edx, (%eax) – *p = b

Source: Moemory
Destination: Register – movl (%eax), %edx – var a = *p

We cannot do memory – memory transfer, so we do memory-register transfer followed by register-memory transfer.

Memory addressing modes:

  • Indirect: register contains an address and we use the content of register as an address to memory – Memory[Register[R]]
    Displacement – consider a constant D(R) which is added to the base register – Memory[Register[R] + D]

The general form is D(Rb, Ri, S) where D is the constant displacement, Rb is the base register, Ri is the index register(except for esp or eip) and scale is 1,2 4, 8 bytes.

Observations:

  • In the assemble code the instruction might be in different order from the C code.
  • Some expressions require multiple instructions.
  • We get the exact same code after compliling.

Conditionals and Control flow:

There are four conditon codes which can be set while evaluating some expression or executing branch:
CF – carry flag
ZF – Zero flag
SF – sign flag
OF – overflow flag