Getting to know the file system

What is a file system?

They support many different file systems in known operating systems. Windows supports FAT, FAT32, and NTFS. Linux supports more than 10 file systems, such as EXT2, EXT3, EXT4, NFS, and NTFS.

We’re going to learn about file systems, so we need to know what is a file system? Wikipedia explains the file system this way:

In computing, a file system or filesystem (often abbreviated to fs) controls how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stops and the next begins. By separating the data into pieces and giving each piece a name, the data is easily isolated and identified. Taking its name from the way paper-based data management system is named, each group of data is called a “file”. The structure and logic rules used to manage the groups of data and their names is called a “file system”.

In a computer, the file system controls how data is accessed. Without a file system, the data placed in the storage medium would be a vast body of data, with no way to tell where one piece of data stops and the next begins. By dividing the data into pieces and giving each piece a name, the data can be easily isolated and identified. Each set of data is called a file, named after the paper management system. The management of a set of data and names through structural and logical rules is called a file system.

When using a file system in a computer, users do not need to care about how data is stored on disk or which blocks the data is stored. They only need to remember the directory and name of the file to access the data in the file. The user does not care which blocks of the disk are not being used. The operating system allocates and frees space on the disk, and only needs to know which files the data is written to.

Simply put, a file system is an abstract data structure that an operating system provides users with access to data, making it easy for users to store and find data.

What is a file?

File systems access files, so what are the files we use every day?

A file is a meaningful collection of information/data

Start with the Windows operating system, which we are most familiar with.

File properties

What properties does a file have?

  • File name: The file name is determined by the user who creates the file. Files with the same name cannot exist in the same directory.
  • Identifiers: Each file in a system has a unique identifier that is unreadable to the user, so identifiers are just internal names that the operating system uses to distinguish between files.
  • File type: Indicates the type of the file, for example, TXT and JPG.
  • Location: the path where the file is stored (for users) and the disk address (for the operating system).
  • Size: Specifies the size of the file.
  • Protected information: Access control information used to protect files.

The information listed above is the main information, but there are many other information, such as: file owner information, creation time, modification time, etc.

In fact, there is a space occupied property above, we can see that it is larger than the size property value. How does it relate to the size property? Why do they have different numbers? Which is the true size of the data? Why is the size of the data different from the space occupied by the data?

This depends on how the file system divides storage units. The storage unit of a disk is a sector (usually 512 bytes), while the storage unit of a file system is a cluster (usually 4k). These problems will be explained later.

Logical file structure

(1) Unstructured files (such as text files), consisting of some binary or character streams, also known as stream files. (2) structured files (such as database tables), composed of a group of similar records, also known as record files.

The directory structure

As we all know, there are many files, so how are files organized?

In a file system, there are not only files, but also directories. A directory is a special file. It can store not only files, but also other directories, and files are placed in a directory.

On a Windows operating system, the root directory is drive C and drive D. On a Linux operating system, the root directory is /. The so-called directory is actually the folder we are familiar with, the user can create a layer of directory, each layer of directory can store other directories, but also can store files, files in the system is through a layer of directories reasonably organized.

A directory is also a special kind of structured file.

The system calls

Since the file system is provided to users, the operating system should provide users with the following functions.

  • Create file:createSystem call.
  • Delete file:deleteSystem call.
  • Read the file:readSystem call to read file data from disk into memory.
  • Write file:writeSystem call to write file data from memory back to disk.
  • Open file:openSystem call, need to “open file” before reading/writing file.
  • Close the file:closeAfter reading/writing the file, you need to close the file.

These basic functions can be combined to form some complex functions, such as copying files, you can create a new file, then read other file memory into memory, and then write the data in memory to the new file.

Logical structure of the file

The logical structure of a file refers to how the file is organized internally as it appears to the user. The logical structure can be realized by different physical structures. For example, a linear list is a logical structure, which can be realized by using arrays or linked lists. Arrays are logically and physically continuous, while linked lists are logically continuous and physically discrete. Therefore, arrays can achieve random access, whereas linked lists cannot.

We learned in the previous section that the logical structure of files is divided into unstructured files and structured files.

Unstructured file: A file consists of a series of sequential binary streams or character streams, commonly text documents.

Structured files: Our focus is on structured files. It is composed of a group of similar records, and each record item is composed of a number of data items, common have, database table. Generally, each record has a data item as a key. Records can be divided into fixed-length records and variable-length records, depending on whether the length of each record is the same.

No structure file

Both UNIX and Windows use unstructured files because, in fact, the operating system doesn’t care what the contents of the file are. All the operating system sees are bytes, and what the file represents is interpreted by the user program. Seeing files as byte sequences gives the operating system great flexibility. User programs can add anything to files and name them in any convenient form. The operating system does not help, nor does it hinder.

Structured file

Sequential file

Records in a file are ordered (logically) one after the other, and may be of fixed or variable length. Individual records can be physically stored sequentially or chained.

How are the records in a sequential file arranged? Because different permutations can also affect the efficiency of the search. In a sequential file, the order of the records is of two forms:

  • String structure: The order between records is independent of the keyword and is usually determined by the time the records are stored.
  • Sequential structure: The order between records is arranged in keyword order.

Suppose you already know the starting address of the file, where the first record is stored. Now let’s think about two questions:

  1. Can the address of the ith record be found quickly, that is, can random access be realized?
  2. Can you quickly find the record location corresponding to a keyword?

First of all, if you use chain storage to implement sequential files, then the above two problems are inevitably impossible to implement. Because of the nature of chained storage, it can’t do that. So let’s think about whether sequential storage can be satisfied.

When using sequential storage, it is important to know that there are two types of records: fixed-length records and variable-length records. The following figure shows the structure of records stored sequentially.

According to the above figure, it can be concluded that variable length records cannot realize random access because of the uncertainty of record length. Random storage can be realized if the records of fixed length are physically continuous, and fast retrieval can be realized if the sequential structure of records can be guaranteed.

Index file

For variable-length records, the previous I-1 record must be searched before the i-th record is searched. In many cases, however, variable-length records must be used. So, there’s a new file structure, index files.

It speeds up file retrieval by creating an index table, and each record is an index item.

The index table itself is a sequential file of fixed-length records, so you can quickly find the index entry corresponding to the ith record. Keywords can be used as the index number. If the index is sorted by keywords, binary search by keywords can be supported. Because index files are fast retrieval speed, so it is mainly used for the occasion of high requirements on the timeliness of information processing.

In addition, you can create multiple index tables with multiple different data items. If you have studied databases, you can easily imagine that this is a common index in a database. You can create different indexes for the same table based on different fields to quickly find data.

Indexed sequential file

Let’s look at the disadvantages of index files. In an index file, each record in the file has an index entry corresponding to it. If the file record content is small, the index table occupies more space than the file itself, resulting in low disk utilization. Therefore, we should try to reduce the index entries in the index table.

As a result of the above problems, a new file storage structure – indexed sequential files emerged. As the name suggests, this is a combination of the idea of indexed files and sequential files.

In an indexed sequential file, an index table is also created for files, but unlike an index file, it does not create an index entry for each record, but for a group of records.

Data items in the index table do not need to be sorted by keyword, which facilitates insertion of new data items. Each data item in the index table points to a group. Each group is a sequential file, and the records in the group do not need to be sorted by keyword.

Index sequential files can reduce the space occupied by index tables, but it can be seen that index tables and file records are not sorted, so it is really fast for variable length records?

Assuming that there are 10,000 records in a sequential file, and that files are retrieved by keyword and can only be retrieved from scratch (the records here are indeterminate records), the average number of retrievals is 5000.

In the case of an indexed sequential file, 10000 entries are divided into 100 groups with 100 records in each group. We retrieve files based on keywords, first in the index table, because the index table is unordered, so we can only search from the first index item, which needs to be searched 50 times on average, and then based on the resulting address pointer, we can get the groups to be retrieved, because each group is assumed to be 100 records. And the 100 entries are unordered, so you can only start with the first one, and the average number of searches is 50. So the total number of lookups is 50 plus 50 is 100. Therefore, indexed sequential files are relatively faster than sequential files.

Multi-level index sequential file

As mentioned above, with 10000 records, indexed sequential files are relatively fast to retrieve. However, if there are 10^5^, 10^6^, or even more records, the retrieval speed will still be slow. You can create another index table based on the index table instead of using a single index table. Instead of creating an index table for records, create an index table for items in the index table.

Because multiple levels of indexes are created, the search should start from the top level index, then work down to the lower level index, group to the record group, and search for the corresponding record in the group.

So this is essentially a prototype of a B+ tree, except that the terms are unordered, whereas B+ trees are ordered. This file structure should be very easy to understand if you have studied the principles of indexing in a database. Conversely, it’s easy to learn the index structure of a database once you understand the file structure.

The physical structure of the file

With the help of the operating system, the data on the disk is presented to the computer in a readable form, and we don’t need to care about how the data is stored on the disk, where the data is stored on the disk, etc., these are all done by the operating system.

So what does file data look like on disk? Let’s find out!

File blocks

Storage units on a disk are divided into blocks, also known as sectors. Sectors are usually 512 bytes in size. This means that even if a piece of data is less than 512 bytes, it still occupies 512 bytes of disk space. (The structure of the disk will be described later.)

Almost all file systems store files in chunks of fixed size, usually 4K. If the sectors in the disk are 512 bytes and the block size of the file system is 4K, the storage unit of the file system is 8 sectors. This is another question mentioned earlier, what is the difference between file size and footprint? The file size is the actual size of the file, and the occupied space is because even if the actual size is not that large, the space is actually occupied, and other file data cannot use the space. So we write 1byte of data into the text, but it also takes 4K of space.

Note here that on the NTFS file system under Windows, if the file data is less than 1K to start with, disk blocks are not allocated for storage, but are stored in a file table. However, once the file data is greater than 1K, the disk space in 4K will be allocated for storage regardless of the size of the file in the future.

As with memory management, the logical addresses of files are divided into blocks for easy disk management. The logical address of the file is then (logical block number, in-block address). Users use logical addresses to operate files, and the operating system is responsible for mapping logical addresses to physical addresses.

File allocation mode

Different file systems have different ways of allocating disk space for files, each of which has advantages and disadvantages.

Continuous distribution

Continuous allocation requires that each file occupies a group of contiguous blocks on disk, which is relatively simple.

As can be seen from the figure above, the logical block numbers of the file are in the same order as the physical block numbers. In this way, random access can be realized. As long as the physical address of the first logical block is known, the physical address of the other logical blocks can be quickly accessed. So how does the operating system map logical blocks to physical blocks? In fact, files are stored in directories, and directories are structured files, so the records of file directories store information about all files in the directory, and each file or directory is a record. This information includes the starting block number of the file and the number of occupied blocks.

So how does the operating system map logical blocks to physical blocks? (Logical block ID, internal block ADDRESS) -> (Physical block ID, internal block address). You only need to know the physical block ID corresponding to the logical block id. The internal block address remains unchanged.

When a user accesses a file, the operating system finds the FCB directory entry based on the file identifier. The physical block id = start block ID + logical block ID. Of course, you also need to check whether the logical block number is valid, whether it exceeds the length, etc. Because physical block numbers can be computed directly from logical block numbers, sequential allocation supports both sequential and random access.

Because reading/writing files requires moving the head, it can take longer to move the head if two disk blocks are accessed far apart. Sequential allocation is used as the file allocation mode, so that the disk blocks of the file are adjacent, so that the file read/write speed is the fastest.

However, the disadvantages of continuous allocation are also obvious, such as poor file scalability, high external fragmentation and low disk utilization, as shown in the following figure:

If file A needs to apply for a new disk block, no free disk block is available behind the disk block occupied by file A. In this case, you have to find another continuous disk block that meets the requirements of file A and copy the contents of file A. As you can see from the figure, only the white area below is eligible, so the disk looks like the following:

At this point, there are two free areas, one with three consecutive free blocks and one with one free block. Suppose that a file needs four disk blocks. Although the disk has space for the file, there is no space for four consecutive disk blocks. Therefore, the file system cannot allocate the file. Compact arrangement takes a long time and costs a lot.

Distribution of chain

Chain allocation adopts the discrete allocation mode, which can allocate discrete disk blocks to files. It can be assigned in two ways: explicit links and implicit links.

Implicit linking

In an implicit link, only the address of the first disk block and the address of the last disk block occupied by the file are recorded in the directory entry. Then, a pointer pointing to the next disk block is stored in each disk block so that the next disk block can be found according to the pointer. If a new disk block needs to be allocated, use the pointer in the last disk block to point to the new disk block, and then modify the new disk block to be the last disk block.

Let’s consider a question: how do we convert the implementation logical block number to a physical block number using implicit linking?

The user specifies the logical block ID (I) to be accessed. The operating system needs to find the directory entry (FCB) of the file to be accessed. From the directory entry, you can know the start block number of the file, and then read the data of logical block 0 into the memory to know the physical block number of logical block 1, and then read the data of logical block 1 into the memory. Finally, you can find the logical block NUMBER I that the user needs to access. Accessing logical block I requires I + 1 disk I/O operations in total.

Conclusion: implicit link allocation can only be sequential access, not support random access, search efficiency is low.

Let’s consider another question: is implicit linking convenient for file expansion?

We know that the directory entry contains the physical address of the end block number, so if we want to expand the file, we just need to mount the newly allocated disk block number after the end block number, change the pointer of the end block number to the newly allocated disk block, and then modify the directory entry.

Conclusion: implicit link assignment is very convenient for file expansion. All free disk blocks can be used without fragmentation and high storage utilization.

Explicit links

To display a link, you explicitly place the Pointers used to link physical blocks in a Table called a FAT (File Allocation Table).

Only the start block number of each file is displayed in the record entry. Then, the system searches for the corresponding disk block in the FAT based on the start block number. Each disk has only one FAT, and the entire FAT is loaded into the memory when the disk is turned on and stored in the memory. FAT entries are physically contiguous and each entry has the same length, so the physical block number is implicit and can be computed without recording.

As can be seen from the figure above, file A is stored from disk block number 5, and then the next physical block number can be obtained by finding the entry with physical block number 5 in FAT, and so on, all physical block numbers of the whole file can be obtained.

How do you convert a logical block number to a physical block number using explicit linking?

The user first provides the logical block NUMBER I to be accessed, and the operating system finds the FCB corresponding to the file. The start block number can be obtained from FCB. If I > 0, the physical block number corresponding to logical block I can be found later in FAT. Conversion between logical and physical block numbers does not require disk I/O operations because FAT is in memory.

Conclusion: Explicit link allocation supports sequential access and random access (as long as the logical block number I is given to be accessed, it is not necessary to access the previous logical block 0 ~ I-1, directly from FAT). Because the block number conversion process does not require disk access, access is much faster than implicit joins.

In addition, explicit allocation will not generate disk fragmentation, high storage utilization, and good file expansion.

The index distribution

Index allocation allows files to be distributed among disk blocks discretely. The system creates an index table for each file that records the physical blocks corresponding to each logical block in the file. The disk blocks for index tables are called == index blocks ==, and the disk blocks for file data are called == data blocks ==.

Distinguish this from explicit links, where FAT is one per disk, and index allocation, where index tables are one per file.

And because the length of a logical block number (say, 4 bytes) is fixed, the logical block number can be computed without explicit recording. For example, the first 4byte indicates the physical block number corresponding to the first logical block number, the second 4byte indicates the physical block number corresponding to the second logical block number, and so on.

I come back to the original question, how do you convert logical block numbers to physical block numbers using indexed links?

First, the user needs to access the logical block number is given, I need to find the required operating system access to the file directory entry (FCB), can be found in FCB index blocks of physical block number, through the physical block number, you can find the index table, the index table into memory, and need to look for in the index table logical block I corresponding physical block.

Conclusion: sequential access and random access are supported, and file expansion is easy to implement (just allocate a free block to the file and add an index item). But index tables need to take up some space.

Now we need to consider another problem. If a disk block size is 512 bytes and an index entry size is 4 bytes, then a disk block can only record 128 index entries. However, if a file size is larger than 128 disk blocks, the index entries cannot be stored in a disk block. What should I do now?

All we could think of was, since one disk block wouldn’t fit, use two. Yes, that’s true, but what we need to consider is that when we use multiple disk blocks, how can we associate these disk blocks without reducing query efficiency?

There are three solutions to this problem: linked solutions, multi-tier indexes, and mixed indexes.

Link scheme

The linking scheme is similar to implicit linking in that a pointer to the next disk block is stored in the disk block and the next index block can be found through this pointer.

If the size of a disk block is 512 bytes and the index entry is 4 bytes, a disk block can record only 128 index entries. If the size of a file is 128 * 128KB = 16384KB = 16M, then the file occupies 128 * 128 * 2 disk blocks, which also requires 128 * 128 * 2 index items, so 64 index blocks are required to store the file. These index blocks are linked using a link scheme.

If you want to access the last logical block, you need to find the last index block, and each index block is linked by Pointers, so you must access the first 63 index blocks to get the data of the last logical block.

The biggest problem with this approach is slow access, so what can be improved?

Multilayer index

We know that when the link scheme is used to store index blocks, the access to index block I needs to access I-1 index blocks first, which leads to low access efficiency.

Since using link scheme is slow, then install our previous, can you consider using index?

Establish a multi-layer index, so that the first layer of the index block to the second layer of the index block, but also according to the size of the file to establish the third layer, the fourth index block.

Using the example above, assuming a disk block size of 512 bytes and an index entry size of 4 bytes, a disk block can only record 128 index entries. If the size of a file is 128 x 128KB = 16384KB = 16M, the file occupies 128 x 128 x 2 disk blocks, which requires 128 x 128 x 2 index items and 64 index blocks for storage. We use a layer 2 index to store this file.

Suppose we need to find the disk block whose logical number is 1002. First, we need to find the primary index block, that is, to find the directory entry of the file, and then read the primary index block into the memory according to the address of the index block recorded in the directory entry. According to calculation, The secondary index block to be accessed is 1002/128 = 7, and the index entry in the secondary index block to be accessed is 1002% 128 = 106, so the final physical block to be accessed is (7, 106), the 7th index entry in the primary index table, and the 106th physical block in the secondary index table. There are three disk I/O operations, namely, read file directory entry, read primary index table, and read secondary index table.

Hybrid index

A mixed index is a combination of multiple index allocation methods. A file’s top-level index contains both direct indexes (pointing directly to data blocks), primary indirect indexes, and secondary indirect indexes.

Because the top-level index table contains a portion of the disk block addresses, fewer I/O operations are required for small files to access the disk.

conclusion

The file system is an indispensable part of the operating system, shielding users from the details of the disk and providing users with easy access to data.