next up previous
Next: Mount Parameters for Caching Up: An Enhanced Disk-Caching NFS Previous: Lookup Caching

Read Caching Implementation

The Linux 2.0.x NFS client translates directly between the VFS interface and the NFS protocol (which is basically an RPC-based version of the standard UNIX file system operations).8 We intercept readpage system calls, and attempt to pass them off to the caching filesystem (i.e., ext2fs on the local disk). See Figure [*] for details of how page reads are satisfied. Files are cached (as decided at the ``Should we cache file?'' decisions in the flowchart) only if they are larger than a specified minimum size (currently 1 page, 4KB), smaller than a specified maximum size (currently 4096 pages, 16 MB), and not changed recently (currently, this timeout is 30 seconds). When a cached file is subsequently read, those reads are served from the local cache file instead of sending read requests to the server.

Figure: Read caching flowchart. Heavy boxes are decision points.

Note that we cannot avoid all server traffic even when reading unchanged files. The stateless nature of the NFS protocol requires us to confirm with the server that the file has not changed since we last read it. However, when reading large files, the majority of the disk traffic is due to many read RPCs, all of which are eliminated.9

All cached file data exists only in disk-based structures. Each remote file from which a page is cached has a corresponding local cache file. The name of the local file is the concatenation of the server name, the server's superblock number, and the inode number of the file being cached. For example a local cache file called holden,2-23 stores page data from the remote file with inode number 23 on the server named ``holden'' in the partition whose superblock is on device 2. These files are all stored in a single-level directory structure (thus precluding disconnected operation, see section [*], on page [*]). We exploit ext2fs's ability to efficiently store a sparse file (i.e., one where only a small portion of the total blocks have had any data written to them), and store only those data blocks that the client has actually read (not whole files). Also note that after a read is satisfied out of the local disk cache, we send a setattr to the server to update its last-accessed time (atime) if the prior access was more than 30 seconds ago. Since the disk cache can persists for months or longer, it is important that the access times are accurate.10

Figure: Kernel data structures. Bold text denotes new data members, heavy boxes denote added data structures.

Obviously, files may not be read in their entirety (e.g., executables, which are paged-in on demand). Thus, we must maintain in-kernel data structures to track which pages of each inode have been cached to local disk. We use a simple packed binary array representation, and also include the number of total pages, and the number valid, along with some information needed by nfsfillind to finishing reading a file after the NFS inode may have left memory. See Figure [*] for details. When we recognize that a file has had its last page written to the local cache (done in constant time with the count of valid pages, not with the bitmap), we mark the cache file as complete using the u+x mode attribute bit, and can then deallocate the bitmap of valid pages.

In order to support the relationship between an inode for a file on the server and the inode of the file on the local disk that is caching the remote file, we made two significant changes to the kernel's data structures: 1) all inode's now have a pointer to a structure about the inode they are caching; and 2) all inode's can specify a clear_inode_hook to be called when that inode is chosen to be reused (``putting'' an inode to a 0 count does not remove it from memory). See Figure [*] for details.

Figure: The life-cycle of an inode, and its corresponding cache file's inode. From A to B a remote file is first accessed, thus assigning an inode structure to that file. From B to C the first page is read from that remote file, so another inode is assigned to the corresponding local cache file; the NFS inode keeps a pointer to the cache inode so read pages can be written to the local disk. We stay in state C while the file is open, and then move to either D or F when the closing of the remote file results in putting of the NFS inode. When the remote file is closed, we choose to discard the cached pages if we've only read less than 10% of the file--that case corresponds to F, where we put the cache inode back on the free list, and can reuse both inodes (back to state A). If we read more that 10% of the file, we move instead from C to D. While in D, the nfsfillind reads subsequent pages of the file in the background until the local cache contains all of the remote file's data pages. When the cache file is complete, we progress to E, where we have put the cache inode back on the free list, and can then return to A after calling the appropriate inode-clearing hooks. States G and H represent a slight complication of the filling-in procedure when the inode that was used for the remote file needs to be reused.

Because files may change on the server (due to either our machine or another client on the system), we must also invalidate cache entries occasionally. When we notice that the NFS inode that we are caching has a new modification time or file size, we mark our local cache as invalid by turning off its u+x bit (if it was complete) and updating the in-kernel data structures (e.g., removing the bitmap, and resetting the count of valid pages to zero). If the cache file's inode is later cleared from memory without having read sufficiently many pages to justify filling in, the cache file is unlinked, and the space is reused.

Ideally we would be able to update the disk cache for local writes. However, the NFS protocol has no way of letting a host know that it is the only writer to the file.11 After changing only a single byte of a 4MB file that exists in our local disk cache, all the NFS client can subsequently tell when it reads that file again is that the modification time (mtime) has changed. Since we cannot conclusively confirm that it was only our client that affected the change, we must invalidate all of the pages we had cached locally.12

next up previous
Next: Mount Parameters for Caching Up: An Enhanced Disk-Caching NFS Previous: Lookup Caching
Greg Badros