简介
postgres中的磁盘管理器SMGR对外提供了管理磁盘介质的接口,其主要实现在md.c文件中。磁盘管理器并非对磁盘上的文件直接进行操作,而是通过VFD机制进行文件操作。凡是对存储在磁盘中的表进行访问操作均会与磁盘管理器打交道,由它进行统一处理。
VFD机制
和Linux系统一样,访问每个文件时都会为之分配一个文件描述符fd。但是Linux系统所提供的文件描述符数量是有限的,对于数据库这种频繁访问数据文件的软件而言,很容易超过操作系统对文件描述符数量的支持。针对这一问题,postgres借鉴计算机中的“虚拟技术”扩充文件描述符,即虚拟文件描述符机制VFD,实现对文件的管理机制。
1 实现原理
VFD机制的实现原理并不复杂,当进程在打开一个文件时,总是能够返回一个虚拟的文件描述符,本质是对物理文件操作的进一步封装.
所谓虚拟文件描述符,是指一个VFD的数据结构,其中记录了操作系统为文件分配的真实文件描述符。实际上,一个进程能打开的最大文件数目仍为系统规定的最大值,但是进程使用了VFD机制,导致其自身觉得可以打开任意多的文件。如果多个进程同时对同一个文件操作,那么每个进程均会获得一个VFD,这些VFD所对应的真实文件描述符是同一个,如下如所示。
2 VFD数据结构
vfd结构体为虚拟文件描述符,记录了真实文件描述符及其状态,下一个空闲VFD所在空闲链表位置和文件名等信息。
typedef struct vfd
{int fd; /* current FD, or VFD_CLOSED if none */ unsigned short fdstate; /* bitflags for VFD's state */ResourceOwner resowner; /* owner, for automatic cleanup */File nextFree; /* link to next free VFD, if in freelist */File lruMoreRecently; /* doubly linked recency-of-use list */File lruLessRecently;off_t fileSize; /* current size of file (0 if not temporary) */char *fileName; /* name of file, or NULL for unused VFD *//* NB: fileName is malloc'd, and must be free'd when closing the VFD */int fileFlags; /* open(2) flags for (re)opening the file */mode_t fileMode; /* mode to pass to open(2) */
} Vfd;/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
#define FD_TEMP_FILE_LIMIT (1 << 2) /* T = respect temp_file_limit */
VfdCache数组维护了进程所持有虚拟文件描述符信息。其中File类型字段为该数组的索引下标,VfdCache[0]仅做为头标识,无实质意义。
/** Virtual File Descriptor array pointer and size. This grows as* needed. 'File' values are indexes into this array.* Note that VfdCache[0] is not a usable VFD, just a list header.*/
static Vfd *VfdCache;
static Size SizeVfdCache = 0;
3私有函数
API | description |
---|---|
Delete | delete a file from the Lru ring |
LruDelete | remove a file from the Lru ring and close its FD |
Insert | put a file at the front of the Lru ring |
LruInsert | put a file at the front of the Lru ring and open it |
ReleaseLruFile | Release an fd by closing the last entry in the Lru ring |
ReleaseLruFiles | Release fd(s) until we’re under the max_safe_fds limit |
AllocateVfd | grab a free (or new) file record (from VfdCache) |
FreeVfd | free a file record |
4 VFD具体实现
如上图,LRU中第一个充当头信息,无实质意义,lruLessRecently指针指向访问频率逐渐降低的VFD,lruMoreRecently指针指向访问频率逐渐增加的VFD。
4.1 LruDelete(File file) : 该函数的功能是从VFD 缓冲池中删除指定的VFD。
1 首先根据 file从VFDCache数组中找到其对应的VFD结构体vfdP;
2 调用系统函数close关闭物理文件,将vfdP->fd置为VFD_CLOSED;
3 递减进程私有的全局VFD计数器;
4 最后调用 Delete函数进行真正的删除操作。
Delete函数比较简单,即更新LRU双向链表的前后指针域。
vfdP = &VfdCache[file];VfdCache[vfdP->lruLessRecently].lruMoreRecently = vfdP->lruMoreRecently;VfdCache[vfdP->lruMoreRecently].lruLessRecently = vfdP->lruLessRecently;
static void
LruDelete(File file)
{Vfd *vfdP;Assert(file != 0);DO_DB(elog(LOG, "LruDelete %d (%s)",file, VfdCache[file].fileName));vfdP = &VfdCache[file];/** Close the file. We aren't expecting this to fail; if it does, better* to leak the FD than to mess up our internal state.*/if (close(vfdP->fd) != 0)elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),"could not close file \"%s\": %m", vfdP->fileName);vfdP->fd = VFD_CLOSED;--nfile;/* delete the vfd record from the LRU ring */Delete(file);
}
4.2 LruInsert(File file) :该函数的功能是将指定的VFD插入LRU缓冲池头部。
1 首先找到 file对应的VFD在VfdCache中的位置;
2 如果文件未打开,先进行安全性检查,确保此时的文件描述符个数不得超过系统所支持的最大安全数目;
3 调用BasicOpenFilePerm打开文件,递增进程私有的全局VFD计数器;
4 最后调用Insert函数进行真正的插入操作;
Insert函数相比于Delete稍复杂,具体操作是将指定的VFD插入VfdCache[0]与VfdCache[0].lruLessRecently之间。
vfdP = &VfdCache[file];vfdP->lruMoreRecently = 0;vfdP->lruLessRecently = VfdCache[0].lruLessRecently;VfdCache[0].lruLessRecently = file;VfdCache[vfdP->lruLessRecently].lruMoreRecently = file;
/* returns 0 on success, -1 on re-open failure (with errno set) */
static int
LruInsert(File file)
{Vfd *vfdP;Assert(file != 0);DO_DB(elog(LOG, "LruInsert %d (%s)",file, VfdCache[file].fileName));vfdP = &VfdCache[file];if (FileIsNotOpen(file)){/* Close excess kernel FDs. */ReleaseLruFiles();/** The open could still fail for lack of file descriptors, eg due to* overall system file table being full. So, be prepared to release* another FD if necessary...*/vfdP->fd = BasicOpenFilePerm(vfdP->fileName, vfdP->fileFlags,vfdP->fileMode);if (vfdP->fd < 0){DO_DB(elog(LOG, "re-open failed: %m"));return -1;}else{++nfile;}}/** put it at the head of the Lru ring*/Insert(file);return 0;
}
4.3 ReleaseLruFiles:该函数的功能是从VFD 缓冲池中删除最近最不经常使用的VFD,只有在已分配的文件描述符超过系统所能支持的最大文件描述符个数后才会触发后续的操作。
/** Release one kernel FD by closing the least-recently-used VFD.*/
static bool
ReleaseLruFile(void)
{DO_DB(elog(LOG, "ReleaseLruFile. Opened %d", nfile));if (nfile > 0){/** There are opened files and so there should be at least one used vfd* in the ring.*/Assert(VfdCache[0].lruMoreRecently != 0);LruDelete(VfdCache[0].lruMoreRecently);return true; /* freed a file */}return false; /* no files available to free */
}/** Release kernel FDs as needed to get under the max_safe_fds limit.* After calling this, it's OK to try to open another file.*/
static void
ReleaseLruFiles(void)
{while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds){if (!ReleaseLruFile())break;}
}
4.4 AllocateVfd:该函数的功能是从空闲链表中获取下一个可用的VFD,根据具体情况会适当扩大空闲链表。也就是说总会获得一个VFD结构体,除非出现OOM。
1 如果空闲链表已满,则会以2倍方式扩大空闲链表,并初始化空闲链表中的VFD元素。
2 将VfdCache[0].nextFree设置为下一个可用的VFD在LRU缓冲池中的槽索引。
3 更新VfdCache[0].nextFree,即 VfdCache[0].nextFree = VfdCache[file].nextFree;
static File
AllocateVfd(void)
{Index i;File file;DO_DB(elog(LOG, "AllocateVfd. Size %zu", SizeVfdCache));Assert(SizeVfdCache > 0); /* InitFileAccess not called? */if (VfdCache[0].nextFree == 0){/** The free list is empty so it is time to increase the size of the* array. We choose to double it each time this happens. However,* there's not much point in starting *real* small.*/Size newCacheSize = SizeVfdCache * 2;Vfd *newVfdCache;if (newCacheSize < 32)newCacheSize = 32;/** Be careful not to clobber VfdCache ptr if realloc fails.*/newVfdCache = (Vfd *) realloc(VfdCache, sizeof(Vfd) * newCacheSize);if (newVfdCache == NULL)ereport(ERROR,(errcode(ERRCODE_OUT_OF_MEMORY),errmsg("out of memory")));VfdCache = newVfdCache;/** Initialize the new entries and link them into the free list.*/for (i = SizeVfdCache; i < newCacheSize; i++){MemSet((char *) &(VfdCache[i]), 0, sizeof(Vfd));VfdCache[i].nextFree = i + 1;VfdCache[i].fd = VFD_CLOSED;}VfdCache[newCacheSize - 1].nextFree = 0;VfdCache[0].nextFree = SizeVfdCache;/** Record the new size*/SizeVfdCache = newCacheSize;}file = VfdCache[0].nextFree;VfdCache[0].nextFree = VfdCache[file].nextFree;return file;
}
4.5 FreeVfd(File file):该函数的功能释放指定 file 对应的VFD,并将其置于空闲链表头部。
static void
FreeVfd(File file)
{Vfd *vfdP = &VfdCache[file];DO_DB(elog(LOG, "FreeVfd: %d (%s)",file, vfdP->fileName ? vfdP->fileName : ""));if (vfdP->fileName != NULL){free(vfdP->fileName);vfdP->fileName = NULL;}vfdP->fdstate = 0x0;vfdP->nextFree = VfdCache[0].nextFree;VfdCache[0].nextFree = file;
}