Don't Panic

fsync is Costly, But Don't Avoid It

tang-hi — Tue, 20 Aug 2024 00:00:00 GMT

Last week, our newly developed service experienced slow performance when processing real-time data. It took an unreasonably long time—up to 12 hours—to process just 8 million records. To simplify the process code for better readability, I've included a simplified version below.

int
process() {
  // .........

  // open many files, store the file descriptor in an array
  int *fd = create_files(FILE_NUMBER, O_RDWR | O_CREAT);

  // each file write some bytes and fsync
  for (int i = 0; i < FILE_NUMBER; i++) {
    if (write(fd[i], Buffer, 200) != 200) {
      printf("Failed to write to file\n");
      return -1;
    }
    if (fsync(fd[i]) == -1) {
      printf("Failed to fsync file\n");
      return -1;
    }
  }
  

  if (close_files(fd, FILE_NUMBER) != 0) {
    printf("Failed to close files\n");
    return -1;
  }
  // ......
  return 0;
}

At first glance, this code pattern is commonly seen in data-oriented systems and shouldn't cause any performance degradation. However, after a thorough investigation, I discovered that the root cause of the slow performance was the fsync system call, which is essential for guaranteeing data integrity but can significantly slow down system processes.

Why is this the case? fsync is used so widely in storage systems that you'll find it in almost any codebase. But why is it a bottleneck for our service? To answer this question, I conducted a survey of fsync and wrote this blog to delve into the details.

This blog will cover the following topics:

What is fsync?
How slow is fsync?
Why does it impact our system so much?
What can we do to alleviate its impact?
Lessons Learned

Please note: All experiments were conducted on my home PC. The measured numbers may vary due to differences in hardware and operating systems. However, the results and conclusions should remain consistent.

What is `fsync`?

The Linux fsync(2) man page provides a clear explanation:

fsync() transfers ("flushes") all modified in-core data of 
(i.e.,modified buffer cache pages for) the file referred to
by the filedescriptor fd to the disk device (or other permanent
storage device) so that all changed information can be retrieved
even if the system crashes or is rebooted. This includes writing
through or flushing a disk cache if present. The call blocks until
the device reports that the transfer has completed.

In short, fsync ensures data written to a file is safely stored on the storage device. We can delve deeper by examining the code itself. Here's a snippet from the Linux kernel's ext4/fsync.c file. This code demonstrates how fsync works at a lower level.

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
	int ret = 0, err;
	bool needs_barrier = false;
	struct inode *inode = file->f_mapping->host;
  // ......

	ret = file_write_and_wait_range(file, start, end);
	if (ret)
		goto out;

	/*
	 *  The caller's filemap_fdatawrite()/wait will sync the data.
	 *  Metadata is in the journal, we wait for proper transaction to
	 *  commit here.
	 */
	ret = ext4_fsync_journal(inode, datasync, &needs_barrier);

issue_flush:
	if (needs_barrier) {
		err = blkdev_issue_flush(inode->i_sb->s_bdev);
		if (!ret)
			ret = err;
	}
out:
	err = file_check_and_advance_wb_err(file);
	if (ret == 0)
		ret = err;
	trace_ext4_sync_file_exit(inode, ret);
	return ret;
}

ext4_sync_file is the function will invoked when you call fsync. It involves 3 steps.

file_write_and_wait_range will write all dirty pages belonging to the file to the disk.Then the dirty page will sit in the disk volatile cache (still will lost when power outrage)
ext4_fsync_journal will write the inode's metadata to the disk. After this step, This file will still exist even when OS kernel is crash.
blkdev_issue_flush will issue a flush operation to the block device and waits until it's finished. This operation will tell block device it should flush its volatile cache to the persistent storage.

As long as the block device manufacturer adheres to the Linux contract, fsync will reliably ensure your data is safe and sound

How Slow is `fsync`?

In this section, I’ll create a simple C program to measure the speed of fsync. The program will open a file, call fsync 10,000 times, and calculate the average fsync time. We’ll also compare the time it takes to write 200 bytes to files with and without using fsync

void
fsync_speed() {
  clr_pgcache();
  int fd = open("file", O_RDWR | O_CREAT, 0666);
  if (fd == -1) {
    printf("Failed to open file\n");
    return;
  }

  MEASURE_TIME_START();
  for (int i = 0; i < 10000; i++) {
    if (fsync(fd) == -1) {
      printf("Failed to fsync file\n");
      return;
    }
  }
  MEASURE_TIME_END();
  if (close(fd) == -1) {
    printf("Failed to close file\n");
    return;
  }
  if (remove("file") == -1) {
    printf("Failed to remove file\n");
    return;
  }
  unsigned long cost_ms = MEASURE_TIME_MS();
  printf("fsync avg time: %ld ms\n", cost_ms);
}

int
without_fsync() {
  if (clr_pgcache() != 0) {
    printf("Failed to clear page cache\n");
    return -1;
  }

  // open many files, store the file descriptor in an array
  int *fd = create_files(FILE_NUMBER, O_RDWR | O_CREAT);

  MEASURE_TIME_START();
  // each file write some bytes
  for (int i = 0; i < FILE_NUMBER; i++) {
    if (write(fd[i], Buffer, 200) != 200) {
      printf("Failed to write to file\n");
      return -1;
    }
  }
  MEASURE_TIME_END();

  if (close_files(fd, FILE_NUMBER) != 0) {
    printf("Failed to close files\n");
    return -1;
  }

  return MEASURE_TIME_US();
}

int
with_fsync() {
  if (clr_pgcache() != 0) {
    printf("Failed to clear page cache\n");
    return -1;
  }

  // open many files, store the file descriptor in an array
  int *fd = create_files(FILE_NUMBER, O_RDWR | O_CREAT);

  MEASURE_TIME_START();
  // each file write some bytes
  for (int i = 0; i < FILE_NUMBER; i++) {
    if (write(fd[i], Buffer, 200) != 200) {
      printf("Failed to write to file\n");
      return -1;
    }
    if (fsync(fd[i]) == -1) {
      printf("Failed to fsync file\n");
      return -1;
    }
  }
  MEASURE_TIME_END();

  if (close_files(fd, FILE_NUMBER) != 0) {
    printf("Failed to close files\n");
    return -1;
  }

  return MEASURE_TIME_MS();
}

The measurement results are shown below.

Test	Fsync	Write	Write(fsync)
Average time	2 ms	430 us	154 ms

As Jeff Dean suggested in his famous article 'Latency Numbers Every Programmer Should Know,' we can consider fsync equivalent to a disk seek. While fsync is relatively slow, it's the only way to guarantee our data is safely stored on the storage device.

Why does it impact our system so much?

"As we saw in the previous section, fsync is relatively slow. However, nearly all storage systems, including databases, distributed file systems, and object storage, use fsync to ensure data integrity. How can they achieve high performance while using fsync? After careful investigation, I discovered that the issue lies not with fsync itself but with our design flaws.

1. Too many files

To achieve high concurrency, we divided the entire dataset into multiple Index Files called Atomic Indexes. We initially believed this would reduce thread contention, allowing multiple threads to read their respective Atomic Indexes without interfering with each other. However, this approach introduced a significant problem: we needed to call fsync for each file when dumping in-memory data to disk. A typical service we maintain contains hundreds of Atomic Indexes, so the time spent waiting for fsync to complete can easily outweigh the benefits of high concurrency.

2. Misuse of `fsync`

fsync is a resource-intensive system call that should only be used for critical files that you cannot afford to lose. According to our design, the service will reply with real-time data even if the server crashes. Therefore, we can treat Atomic Indexes as files that can be regenerated, making it unnecessary to call fsync for them.

What can we do to alleviate its impact?

Although in my case simply removing the fsync call provided significant performance improvements, I'm curious about other strategies for mitigating the expensive cost of fsync

Direct I/O

One potential approach is to use Direct I/O, which bypasses the page cache and directly transfers data between the application and the block device. This could potentially reduce the overhead of fsync, as there would be no need to flush dirty pages. To validate this idea, I wrote the following code

int
direct_write_fsync() {
  if (clr_pgcache() != 0) {
    printf("Failed to clear page cache\n");
    return -1;
  }

  // open many files, store the file descriptor in an array
  int *fd = create_files(FILE_NUMBER, O_RDWR | O_CREAT | O_DIRECT);

  MEASURE_TIME_START();
  // each file write some bytes
  for (int i = 0; i < FILE_NUMBER; i++) {
    if (write(fd[i], Buffer, 200) != 200) {
      printf("Failed to write to file\n");
      return -1;
    }
    if (fsync(fd[i]) == -1) {
      printf("Failed to fsync file\n");
      return -1;
    }
  }
  MEASURE_TIME_END();

  if (close_files(fd, FILE_NUMBER) != 0) {
    printf("Failed to close files\n");
    return -1;
  }

  return MEASURE_TIME_MS();
}

Test	Direct I/O	Write(fsync)
Average Time	153 ms	154 ms

Unfortunately, our experiments showed that Direct I/O didn't provide significant performance improvements. While it's true that fsync wouldn't need to flush the page cache in this scenario, Direct I/O itself has increased overhead and may not fully utilize delayed allocation. As a result, the time saved by skipping page cache flushing was offset by the slower Direct I/O writes, leading to negligible overall performance gains

io_uring

Since fsync involves synchronizing data from the system cache to the disk, which can be a bottleneck, we can explore using io_uring to potentially improve performance. Introduced in Linux kernel version 5.1, io_uring allows submitting asynchronous I/O operations. The kernel worker threads handle these operations and notify the user space when they complete. I won't delve into the details of io_uring here, but you can refer to the Kernel Documentationfor further information. Below, I'll showcase code that tests performance improvements using io_uring

int
io_uring() {
  if (clr_pgcache() != 0) {
    printf("Failed to clear page cache\n");
    return -1;
  }

  int *fd = create_files(FILE_NUMBER, O_RDWR | O_CREAT);

  struct io_uring ring;
  if (setup_io_uring(&ring) != 0) {
    printf("Failed to setup io_uring\n");
    return -1;
  }

  MEASURE_TIME_START();
  for (int i = 0; i < FILE_NUMBER; i++) {
    submit_write_request(&ring, fd[i], 0, Buffer, BYTES_NUMBER);
    submit_fsync_request(&ring, fd[i]);
  }
  wait_for_all_operations(&ring, 2 * FILE_NUMBER);
  MEASURE_TIME_END();
  close_files(fd, FILE_NUMBER);
  return MEASURE_TIME_MS();
}

Test	Direct I/O	Write(fsync)	Io_uring
Average Time	153 ms	154 ms	30 ms

Our experimental results demonstrate that io_uring significantly improved performance. This is primarily due to the asynchronous nature of io_uring. Instead of waiting for each fsync call to return, we can submit I/O requests and then process them all at once when there are no other tasks to handle

Lessons Learned

Avoid excessive file fragmentation.

Having too many files can quickly consume your operating system's file descriptors and introduce higher latency compared to reading from or writing to a single file. You can find the experimental code here. If your critical data is split across multiple files, the situation worsens. You'll spend significant time waiting for fsync to complete. Compared to the various problems caused by dividing data into multiple files, the minor advantage of multithreading is insignificant.
Use fsync judiciously.

fsync is a resource-intensive system call. Use it sparingly but strategically. If you can't recover your data file once it's corrupted, fsync is essential to ensure its integrity. Otherwise, avoid using it and implement a mechanism to regenerate the file if necessary.
Harness the power of io_uring. io_uring is a powerful and user-friendly tool. Consider incorporating it into your next project.

设计一款自己的代码配色

tang-hi — Sat, 29 Jun 2024 00:00:00 GMT

最近看了一本设计相关的科普读物写给大家看的设计书, 因此萌生了利用所学的知识, 设计一款代码配色。

整个过程相当简单，如果目标只是设计出一款不丑的配色。那么在了解基本原理后, 大概10分钟就可以完成。

前置知识

色轮的概念

色轮是由三原色(红，绿，蓝)组成的。假定我们在一个圆环上只放置这三种颜色, 我们可以得到如下的色轮。我们再对这三种颜色其进行两两组合，我们就有了六种颜色。我们使用上述的方法不断重复下去，我们就得到了完整的色轮。 <div style="display: flex; justify-content: center; align-items: center"> <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/color_wheel_1.png"/> </div> <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/color_wheel_2.png"/> </div> <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/color_wheel_4.jpg"/> </div> </div>

暗色和亮色

上述色轮的概念仅仅只是色调，我们可以通过往色调的基础上添加黑色或者白色，来得到暗色和亮色。这样我们就可以得到更多的颜色。下面则是一个简单的例子(从左至右依次是，原色调，暗色，亮色)。 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/dark-light.jpg"/> </div>

互补色(complementary)

色轮上相对的颜色即为互补色。在设计中我们往往采用一种作为主色，而另一种颜色用于强调。我们从下面的例子中，可以直观感受到互补色的对比。 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/complementary-color-banner.webp"/> </div>

三色组(Triadic)

在色轮上，我们可以找到三种颜色，它们之间的角度相差120度，这三种颜色就是三色组。三色组的颜色搭配会显得和谐, 看上去令人愉悦。比如红，黄，蓝就是一个三色组。儿童产品往往就会采用这种组合，最经典的例子就是超人。 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/superman.avif"/> </div>

分裂互补三色组(Split)

分裂互补三色组，是从色轮的一边选择一种颜色，再在色轮上找到他的互补色，但是并不直接使用这个互补色，而是使用这个互补色两侧的颜色。这样的组合往往会有一种更为细致的颜色边界。可以通过下面的例子，更直观的感受这一点。 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/split-complementry.jpg"/> </div>

类似色(Analogous)

类似色是指在色轮上相邻的颜色。因为这种颜色组合有相同的基础色，所以这种颜色组合看上去会很和谐，但是缺少对比。因此在设计中，我们往往会加入一些对比色，来增加视觉效果。下面的这幅画就是类似色的组合。 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/analogous.jpg"/> </div>

开始设计

在介绍了上述的基本知识后，我们就可以开始设计我们的代码配色了。

颜色配置文件

VSCode 的颜色配置在setting.json中, 我们可以通过 cmd + shift + P 唤起VSCode的命令面板, 然后输入setting, 找到我们的配置文件。 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629163617.png"/> </div>

然后在setting.json中增加一个配置选项

"editor.tokenColorCustomizations": {

    "[Default Light+]": {  // 这里的Default Light+ 是你当前使用的主题
        "keywords": "#C752A5", // 关键字的配色
        "comments": "#A6B3E2", // 注释的配色
        "variables": "#3E4D19", // 变量的配色
        "functions": "#52A5C7", // 函数的配色
        "types": "#C46E4A",  // 类型的配色
        "strings": "#C7526A", // 字符串的配色
        "numbers": "#7F6D29", // 数字的配色
    },
}

配色选择

推荐使用Figma的色轮工具，可以很方便的找到你需要的颜色。Figma色轮工具

我们先将所有的颜色都设置为黑色。 <div style = "text align: left; width:auto; height:20%"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629164926.png"/> </div>

函数，关键字，变量

因为我认为代码是由函数组成的，所以我决定将函数的颜色作为主色，同时我不想让整个配色显得太热烈，因此我选择了蓝色作为主色。最终选择了#52A5C7作为主色。这个颜色的三色组为#C752A5和#A5C752，我们可以将这两种颜色作为关键字和变量的配色。
<div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629170833.png"/> </div>

呈现出的效果如下图所示 <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629171135.png"/> </div>

可以看到变量的颜色过于亮了，我们可以将其调暗一些。最终选择了#3E4D19这一暗色作为变量的配色。

类型，注释

函数和类型因为往往是在一起的，因此我们选择主色的补色，同时为了让注释不是那么显眼，我们选择了主色的相似色。从左至右依次为主色，补色，相似色。 <div style="display: flex; justify-content: center; align-items: center"> <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629172449.png"/> </div> <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629172515.png"/> </div> <div style = "text align: left"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/20240629172532.png"/> </div> </div>

呈现出的效果如下图所示

为了让效果更加和谐，我们修改了对应颜色的亮度，最终选择了#A6B3E2作为注释的配色，#C46E4A作为类型的配色。

字符串，数字

为了让字符串, 数字和函数可以和谐共处，我们采用了主色的分裂互补三色组,并相应的修改了亮度。最终呈现的效果为

完整配置文件

"keywords": "#C752A5",
"comments": "#A6B3E2",
"variables": "#3E4D19",
"functions": "#52A5C7",
"types": "#C46E4A",
"strings": "#C7526A",
"numbers": "#7F6D29",

总结

当然色彩领域的知识还有很多, 我学到的也只是皮毛，但是能通过学到的知识创造出一款还看得过去的代码配色，也是一件令人开心的事情。希望这篇文章能够帮助到你，也希望你能够设计出一款属于自己的代码配色。

期权学习笔记

tang-hi — Sun, 19 May 2024 00:00:00 GMT

什么是期权?

期权可以认为是一种权力。它赋予你，可以在某个时间段以特定价格买入或者卖出股票的权利。

这么说可能不是很好懂，我们通过几个例子来解释这件事。

假设有这么一个人(甲)，他的意见总是和你相反。你觉得阿里巴巴的股票会涨，他就觉得会跌。你觉得阿里巴巴的股票会跌，他就觉得会涨。同时，你们都希望对方为自己的认识付出代价。你们由此设计出了一套协议叫做期权 (option)

假设现在阿里巴巴的股价是80美金

1. 你认为会涨，甲认为会跌

在这种情况下，甲对你提议说，既然你认为阿里巴巴的股价仍旧会涨，那我就卖给你看涨期权 (call option)，行权价为82，有效期为一个月。拥有这份期权就意味着在这一个月的时间里，无论阿里巴巴的股价如何变化，你都可以用一股82美金的价格来向甲买入阿里的股票。当然，购买这份期权，意味着你需要先支付少许的权力金 (premium)。

那么在完成交易后会发生什么事呢？

a. 阿里的股价超过了82美金

在这种情况下，你可以使用便宜的价格来购买阿里股票(你可以用82美金向甲买入阿里股票，然后马上在市场上卖掉，从而赚取利润)。假设目前阿里的股价为T美元, 那么你的收益则为 $$ T - 82 $$。同时别忘了，为了购买这份期权，我们预先支付了一笔权利金。因此你的实际收益为 $$ T - 82 - premium $$

b. 阿里的股价没有超过82美金.

在这种情况下，你并不会选择以82美金的价格向甲购买阿里的股票(放弃行权)，因为这完全是一笔亏本买卖。所以你会亏损你之前购买期权所付出的权利金。而甲则什么都没有损失，反而白赚了一笔你所付的权利金。

2. 你认为会跌，甲认为会涨

在这种情况下，甲对你提议说，既然你认为阿里巴巴的股价会下跌，那我就卖给你看跌期权 (put option)，行权价为78，有效期为一个月。拥有这份期权就意味着在这一个月的时间里，无论阿里巴巴的股价如何变化，你都可以用一股78美金的价格向甲卖出阿里的股票。同样的，购买这份期权，你也需要先支付少许的权力金。

那么在完成交易后会发生什么事呢？

a. 阿里的股价超跌破78美金

在这种情况下，你可以用高于市场的价格来卖出阿里股票(在股票市场上买入阿里股票，然后再高价卖给甲，从而赚取利润)。假设目前阿里的股价为T美元, 那么你的收益则为 $$ 78 - T $$。同时别忘了，为了购买这份期权，我们预先支付了一笔权利金。因此你的实际收益为 $$ 78 - T - premium $$

b. 阿里的股价高于78美金.

在这种情况下，你并不会选择以78美金的价格卖给甲(放弃行权)，因为这也是一笔亏本买卖。所以你会亏损你之前购买期权所付出的权利金。而甲则白赚了一笔你所付的权利金。

上述的看涨期权(call option) 和 看跌期权(put option) 就是我们一直所说的的期权。在现实生活中，期权的购买单位为张(100股为一张), 也就是说一张期权等于100股股票购买/卖出的权利。

值得注意的是, 期权的卖出并不需要你有相应的股票，你可以直接在市场上卖出期权，并立刻获得买方所付的权利金，只要你的保证金足够。

期权有什么用?

投机

期权相对于股票而言，它对价格的波动更加敏感。假设，阿里巴巴当前股价为80美金，你购买了一张行权价为81美金的期权。期权的价格为1.2美金一股。那么当阿里巴巴的股价涨到83美金时, 期权的价格至少为2美金(83 - 81)。

期权与股票的利润对比，可参考如下表格。

期权/股票	本金	利润	浮盈
期权	120$	200$	66.7%
股票	8000$	8300$	3.75%

可以看到在股价相同的波动下，期权可以获得更高的利润率。

对冲

期权因为其高杠杆，它不仅可以进行投机，还可以用来对冲风险, 在股票市场上我们一般有如下策略。

保护性看跌期权(protective put)

如果你想要持有一只股票，但是你不想在它下跌时蒙受太大的损失，那么你可以在购买这只股票的同时，买入它的看跌期权。

假定你在阿里巴巴股价为80美金时，买入了100股，并且买入了行权价为78美金的看跌期权(put)。那么你的最大损失为每股2美金加权利金。这样你就给你的损失设置了一个下限。利润与股价的关系如下图所示。

抛补看涨期权(covered call)

如果你在购买股票时，设置了一个止盈价位，那么你可以在买入股票的同时，同时卖出等量的看涨期权(call)。这种策略可以进一步的增加你的利润。

假定你在阿里巴巴股价为80美金时，买入了100股，且你的目标价位为85美金。那么你可以直接卖出100股85美金的看涨期权，这样子当股价到达了85美金时。你每股不仅赚到了5美金的差额，同时还赚到了出售期权获得的权利金。当然这种策略的缺点就是，当股价继续上涨时，超出85美金的部分就和你没有关系了。

跨式期权(straddle)

当你觉得一家公司的后续股价，要么大涨，要么大跌(比如面临一个重要的官司)。你可以通过同时买入行权价相同，过期日期相同的看涨期权(call)和看跌期权(put), 来赚取利润.

假定你在阿里巴巴股价为80美金时, 买入了行权价为81美金的看涨期权和看跌期权。当阿里巴巴的股价变为70美金时, 你的看跌期权每股将价值11美金，完全可以覆盖你付出的权利金。同样的当阿里巴巴的股价变为90美金时，看涨期权每股将价值9美金, 同样可以覆盖你付出的权利金。

但是当股价不变或者小幅波动时，你将会损失你购买期权所付出的权利金。

双限期权(collar)

这种策略会对你的资产组合设置一个上限和下限。

假设你持有100股阿里巴巴的股票。假定每股价格为80美金, 你通过买入行权价为70美金的保护性看跌期权，给你的股票价值设置了一个下限。但是购买期权需要付出权利金，如果你不想出这笔钱，你可以卖出和看跌期权(put)价值差不多的看涨期权。这样你相当于零成本的，为你的股票设置了上下限。

价差套利(spread)

这种策略是当你发现了，不同价格的两个期权的定价有错误(一个比另一个便宜), 你可以通过买入相对便宜的期权，同时卖出相对贵的期权，来进行套利。

假设，你觉得行权价为90美金的看涨期权比行权价100美金的看涨期权更便宜。那么你可以买入行权价为90美金的看涨期权,同时卖出行权价为100美金的看涨期权。这样只要股价超过90美金, 你就是有利可图的。注意，这里说的便宜是指购买期权的权利金。

如何对期权定价?

那么一张行权价为90美金的看涨期权(call)该如何定价呢？如果当前股价是100美金, 毫无疑问这个期权是有价值的，每一股可以赚到10美金的差价。但如果当前股价是80美金呢？这个期权的价值就一文不值吗？答案当然是否定的。因为期权除了内在价值(intrinstic value), 还有其时间价值(time value)。毕竟只要还没到期，股价就有可能一飞冲天。

那么有没有一个合理的方式对期权进行定价？想要绝对客观的定价，是不可能的。我们只能基于一些假设来对期权进行定价。常用的方法有二项式期权定价，以及布莱克-斯科尔斯公式(Black-Scholes pricing formula).

这里主要介绍一下二项式期权定价

假定一个股票的价格为100美元，到年底这个股票有两种可能升到120，以及跌到90。当前无风险年利率为10%。我们现在计算一股行权价格为110美金，到期时间一年的期权应该价值(权利金)多少？

我们首先可以得到，这个期权在到期后可能得到的收益，如下图所示。

我们如果可以构建出一笔资产组合，使得其年底收益与该期权的收益相等，那么我们就可以对该期权进行定价。假设我们现在有这么一个资产组合，价值100美元的股票和81.82元的借款。因为我们购买股票的钱，有81.82是借来的，因此我们个人出资18.18。同时因为利率为10%，到期后我们需要还90元。这个资产组合的收益如何呢？如下图所示。

可以看到这个资产组合的收益是期权的三倍。而我们的个人出资为18.18, 因此我们可以得到该期权的价格为$$18.18 / 3 = 6.06$$

这里介绍的定价方法，仅仅是简化版的方法，现实生活中的定价方法复杂得多，要考虑的因素也更多，有兴趣的读者可以自己探索。

[译] Binary quantization

tang-hi — Sat, 13 Apr 2024 00:00:00 GMT

原文链接

什么是 binary quantization?

目前的向量数据库会构建大规模的向量索引，并将向量索引放在内存中进行搜索。这样可以实现实时查询，但相应的成本也相应增加。Binary quantization(BQ) 是一种向量压缩算法，可以在内存占用和查询准确性之间做出权衡。

我们可以通过类比的方式来了解这一技术。假设每个要存储的向量都像是一个家庭地址。这个地址可以精确地定位某人的居住位置，包括国家、州、城市、街道号甚至门牌号。但为了获得这种精确性，需要占用大量内存来存储、搜索和读取每个地址（详细的地址占用的内存更多）。同样地，在多维空间中定位一个向量，可以将向量中每个维度的数字视为沿该维度指定的方向移动的距离。

Binary quantization(BQ) 的压缩过程是根据每个数字的符号将向量中的每一维转换为0（负数）或1（正数）。这听起来似乎有些不可思议，因为丢失了每个维度上的具体数字，那么如何精确定位该向量呢？尽管BQ听起来似乎不太可靠，但在高维度矢量上却能取得不错的效果。接下来让我们来看看原因！

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/bq_vecs1.png"/> </div> 二值化不仅适用于向量的压缩，我们还可以从其他领域理解其用途，比如在计算机视觉中。如果对一幅图像进行二值化，即对每个像素，如果大于某个阈值，则替换为1；否则，替换为0。这样生成的图像是黑白二进制图像。虽然会丢失图像细节，但显著地压缩了图像大小。

现在，让我们考虑一下将一个句子向量embedding进行二值化后会呈现怎样的形式。在下面的图示中，我们将句子：“All your vector embeddings belong to you!”转化为一个384维的向量。第一张图展示了向量的所有384个数字，每个数字都是一个32位的浮点数，在热力图上用颜色渐变的方式显示。每个向量维度上的数字决定了热力图上颜色渐变的程度。下面的图展示了相同的向量，但我们对向量进行了阈值处理，使得所有正值维度转换为1（白色），而负值维度转换为0（黑色）。因此，我们得到了一个黑白相间的热力图，看起来有点像条形码。这就是对向量进行Binary quantization(BQ)的效果！得到的向量大小要小得多，但也丢失了很多细节。

Binary quantization(BQ)通过仅保留向量的方向来简化向量的编码。每个向量维度都用一个比特位编码，表示它是正还是负。例如，像 [12, 1, -100, 0.003, -0.001, 128, -1000, 0.0001] 这样的向量将被压缩为单个字节，结果是二进制序列[1,1,0,1,0,1,0,1]。通过将每个维度存储的数字从float32转换为1-bit，将每个向量占用的空间减少了32倍。然而，我们无法从BQ后的向量还原出原始向量——这使得这成为一种有损压缩技术。

使用二值化向量的细节

二值化向量的距离计算

首先，我们考虑如何计算两个二值化向量之间的距离。计算方法很简单：因为我们仅关注它们的方向性，我们只需要评估它们每个维度的bit是否一致。即计算两个向量不同比特位的数量。在这里，因为可以利用位操作，会比计算非二值化向量之间的距离要快得多。

例如，将向量[12, 1, -100, 0.03, -0.01, 128, -100, 0.01]压缩为11010101，以及将第二个向量[11, 4, -99, -0.01, 0.02, 130, -150, 0.02]压缩为11001101，它们之间的距离由不同比特位的数量决定 (距离为2)。这实际上也就是两个向量之间的汉明距离。

BQ下数据分布的重要性

与我们介绍的product quantization不同，BQ并不适用于所有类型的数据——我们会在后面解释原因。然而，如果我们正在处理归一化的数据，特别是在利用余弦度量距离时，无需担心，因为Weaviate会为您无缝处理数据归一化。现在，让我们讨论增加维数的影响。

一维向量的BQ

在下面的图片中，我们绘制了一个归一化后，在一维空间中唯一可能的两个点(0, 1)，用红色表示。量化器会分配给正向向量 1，分配给负向向量 0。

让我们扩大我们的视角，涵盖两个维度。因为我们考虑的是归一化后的向量，我们预期所有向量都位于以(0,0)为中心，半径为1的圆内。我们的重点是理解量化器如何将数据分成四个不同的区域，利用两个可能的bit值和两个维度来实现二次幂的效果。

在这种情况下，绿色区域（编码11）包含了两个维度都为正的点，而蓝色区域（编码为00）则包含了两个维度都为负的点。红色区域（编码为10）表示第一个维度为正，第二个维度为负的情况，而黄色区域（编码为01）表示第一个维度为负，第二个维度为正的情况。

重要的是，在每个区域内，任何点与同一区域内的任何其他点之间的距离都是零。而相邻区域中的点之间的距离为1。和完全相反的区域中的点，距离延伸到2。

这种区分强调了数据分布的关键作用。虽然我们使用的是归一化数据，但是归一化并不是强制性的，但归一化后的数据与所描述的情景非常一致。那么让我们分析另一种情况。

我们所有的数据都落在第一象限中。因此，所有向量都被编码为11，这使得所有向量之间难以区分。这种情况说明了一个不好的数据分布可以使Binary quantization无法使用。正如先前所述，虽然归一化不是强制性的，但选择归一化的数据在数据分布方面提供了一定程度的保证，有助于使用Binary quantization。

然而，如果你的数据没有归一化，确保区域的平衡和逻辑划分就变得至关重要。考虑以下例子。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/bp-vecs12.png"/> </div>

在这种情况下，使用Binary quantization会表明黄色点距离红色点更远，而与蓝色和绿色点更接近。虽然在基于角度的度量（如余弦）中这是成立的，但它与在L2度量下的距离相矛盾，我们可以看到黄色和红色点实际上更接近。

N维向量的BQ

让我们考虑数据量以及在应用Binary quantization后能够唯一表示向量的能力。在得知了维数和数据量后，我们对碰撞的程度可以有一个预期，我们将两个向量之间的碰撞定义为两个不同的向量进行Binary quantization后具有相同的表示。如前面的例子所示，在对二维向量进行二Binary quantization时，我们只能构建四个区域。因此，当向量数超过四个时，就会发生碰撞，使得两个不同的向量无法区分。

然而，好在随着维度的增加，可划分的区域数量呈指数增长。对于每一个维度的增长，区域数量翻倍$2^d$，提供了更强大的向量表示能力。例如，当维度数为$756$时，你已经有令人惊讶的$2^{756}$个区域可供使用——即使你有数十亿或数万亿个向量，向量之间的碰撞也几乎不可能发生。而当维度数来到了$1500$，区域的数量可以轻松容纳任何数量的向量，而不会发生任何碰撞。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/bp-vecs13.png"/> </div>

BQ的性能提升

让我们重新审视一下Binary quantization的优势。通常，我们使用量化的方法来节省内存，将每个数字编码为1-bit。在Weaviate中，浮点向量被表示为float32数组，从而产生了1:32的压缩比，这已经值得令人称赞了。

然而，Binary quantization还有一个显著的次要好处：现在，位操作可以被用来计算量化后的向量之间的距离计算。仅需要对两个二进制数组之间进行简单的异或（XOR）操作，统计结果中的1的数量。而Go语言提供了针对这些二进制函数进行SIMD优化的操作，从而计算速度比使用原始向量快得多。但确切地说快多少呢？

为了回答这个问题，我们展示了使用我们的Binary quantization和原始向量进行的暴力搜索结果。我们对维度范围从768、1536到4608的10,000个向量进行100次查询搜索

维数	原始向量延迟 (microseconds)	压缩后向量延迟 (microseconds)	Recall
768d	1771.85	230.72 (13%)	0.745
1536d	3703.68	353.3 (9%)	0.744
4608d	16724.41	896.37 (5%)	0.757

虽然召回率并不是很高，但我们可以通过超额获取候选邻居并重新评分来解决这个问题。值得注意的是，随着向量维度的增加，我们可以观察到更显著的加速。例如，当暴力搜索768维的压缩向量时，与使用未压缩向量相比，我们只需花费13%的时间。同样地，对于1536维的压缩向量，我们只需花费9%的时间，而对于4608维的压缩向量，只需花费未压缩向量时间的5%。

一般而言，我们依靠构建图来进行ANN搜索，因为一个个搜索数百万个向量是不切实际的。然而，由于时间显著减少，暴力搜索数据现在成为了一种可行的选择。例如，在768维的情况下，暴力搜索100万个向量只需花费23毫秒。即使在最坏的情况下（4608维），现在也是切实可行的，大约需要90毫秒。

那么，最终结论是什么呢？Weaviate能够为您的数据提供闪电般快速的暴力搜索吗？答案取决于您的数据大小和搜索速度的期望。

暴力搜索有几个优点。首先，你可以不需要索引，节省构建索引所需的时间。虽然在Weaviate中建立索引并不是特别缓慢，但暴力搜索允许您完全跳过此步骤。其次，你不再需要存储邻接点，从而进一步节省内存。事实上，如果你选择直接从磁盘上暴力搜索数据，内存使用量将会变得微不足道——仅仅100MB就足以托管您的应用程序。

最近，Weaviate引入了Flat Index，提供了从磁盘暴力搜索数据的选项（默认行为），或者只保留内存中的压缩数据，并从磁盘获取一小部分完整向量进行最终排序。与传统的HNSW索引相比，这两种方法都加快了数据加载速度，同时减少了内存消耗。然而，如果您的需求要求高性能，HNSW仍然是首选。尽管如此，Flat Index提供了一种经济高效、高性能的替代方案。此外，Weaviate现在支持二进制量化（BQ），可用于和HNSW索引。

索引时间的提升

现在，让我们讨论一些性能指标。所有实验均使用Go benchmark进行。在博客的最后，我们将提供有关如何使用自己的数据复现这些实验。首先，我们将使用来自DBPedia的一个中等规模的数据集，使用ADA002（1536维）和Cohere v2（4096维）的embedding向量，首先从索引时间开始。

Dimensions	1536	4096
Flat index	5s	7s
Hnsw index	47s	1m36s
Hnsw index+BQ	21s	25s

正如前所述，Flat Index没有数据索引的需要。因此，我们只需将数据发送给服务器并将其存储起来即可。相反，HNSW需要构建索引。值得注意的是，就索引时间而言，HNSW索引也可以从这种压缩中获得明显的的性能提升。

内存占用的提升

现在，让我们讨论内存占用。我们将区分Flat Index的不同配置项，因为它们具有不同的内存占用。当使用Flat Index时，无论数据大小如何，无论是否使用BQ, 所有数据都从磁盘中检索。如果我们选择缓存压缩数据，它将存储在内存中。由于不需要索引，Flat Index的内存占用低于HNSW+BQ的内存占用。此外，我们将展示HNSW不同情况下的内存占用。在这两种情况下，您可以预期内存占用量随维数和向量数量的增加而呈更多或更少线性增长。

Dimensions	1536	4096
Flat index	77MB	77MB
Flat index + BQ + Cache	141MB	183MB
Hnsw index	1.02GB	1.79GB
Hnsw index+BQ	214MB	297MB

延迟分析

最后，让我们来看一下QPS与召回率曲线，以了解不同方案之间的性能。为了生成这样的曲线，我们修改了HNSW下的ef参数，以及Flat Index下的rescoringLimit参数。我们还使用了10个并发核心来测量QPS。

请注意，纯Flat Index场景中的QPS较低（显示为右下角的绿点）。是因为这种情境下，我们需要在磁盘中检索所有完整的向量，并在未压缩的向量上执行暴力搜索。虽然这种方法性能较差，但它不需要内存分配。

接下来，我们采用相同的方式，但集成了Binary quantization（BQ）。在这个情景中，我们需要从磁盘中读取的数据较少，因为我们仅需要访问压缩后的向量（比未压缩相比小32倍）。此外，由于我们仅需要用位操作计算距离，因此暴力搜索也变得更快。暴力搜索后，我们会生成一个候选列表，然后对它们进行重新评分。在重新评分过程中，我们只需要检索少量完整向量来构建最终结果。这个方式仍然保持了磁盘操作，同时提供了更好的性能。需要注意的是，这种方法取决于BQ的兼容性；否则，可能无法实现最佳的召回率。此外，确保足够高的rescoringLimit对于保证良好的召回率至关重要。

最后，我们测试了具有缓存的压缩向量的平坦索引（用蓝色曲线表示）。这种方式QPS在600到1000之间。当然在这种情况下，内存占用量略微增加，因为压缩后的向量被保留在内存中，只有一小部分向量从磁盘中获取。

接下来，我们将考虑较大维度的情况。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/bq-vecs16.png"/> </div>

鉴于这些结果，值得考虑以下几点：对于一个相对较小的向量数据集(100,000)，如果你的目标是非常高的召回率，那么 flat-compressed-cached曲线与HNSW之间的性能差异并不十分明显。有人可能会认为100,000个向量并不是一个很大的数量，这是一个正确的观点。然而，让我们考虑将此功能与多租户结合。

Weaviate确保了每个租户的信息完全隔离。如果我们有1000个租户，每个租户都有100,000个向量。令人惊讶的是，预期的性能保持了差不多的一致性。而这1亿个向量则构成了大量的数据。此外，Weaviate支持租户快速停用/惰性重新激活，这可以创建出一个性能异常出色、内存占用极低的性能方案，前提是您已经设计了一个健壮的架构。

现在，让我们将数字进一步放大。对于更大的数据集，暴力搜索与数据大小呈线性关系。如果我们将数据集增加到1,000,000个向量，那么QPS将比这里展示的要慢大约10倍。然而，即使有了这种增加的延迟，对于某些应用程序来说，暴力搜索仍然是一个可行的选项。

PQ与BQ的对比

现在你在Weaviate中有多种量化技术可供选择，那么问题就来了，PQ和BQ哪个更好，应该在哪里使用PQ vs. BQ。这个决定将取决于你具体的数据，并且需要你运行自己的基准测试。我们在下一节提供了代码和说明，以便您进行这样的测试。下面的内存和性能实验旨在让你更容易地做出PQ vs. BQ的选择。

请注意，BQ的主要优势不仅仅是压缩向量。更高效的位计算也起着重要作用。这就是为什么我们上面讨论的flat+bq选项是一个如此好的选择。我们不仅需要从磁盘读取的数据更少，而且更快的距离计算使得在Weaviate中的暴力搜索更快。

Index	Indexing Time	Memory Usage
HNSW	8m42s	6437.05MB
HNSW+PQ	21m25s	930.36MB
HNSW+BQ	3m43s	711.38MB
FLAT+BQ	54s	260.16MB
FLAT+BQ+CACHE	53s	352.99MB

注意BQ是如何极大地缩短了与HNSW的索引时间。

用你自己的数据来测试BQ

在这里，我们提供了代码和说明，这将帮助你在你自己的数据上自行复现上述实验来找到召回率、延迟和内存占用量，最佳的平衡。

我们在这个仓库中包含了一些非常有用的工具。要轻松运行这些测试（或者使用你的数据运行任何测试），你需要将数据以hdf5格式存储，并且具有与ANN基准测试中描述的相同格式。您可以使用Go基准测试工具对数据进行索引。这个基准测试工具可以给您一个更好的QPS概念，同时使用并发查询。它接受几个参数，您可以探索，但在我们的运行中，我们使用以下命令：

go run main.go ann-benchmark -v ~/Documents/datasets/dbpedia-100k-openai-ada002.hdf5 -d cosine --indexType flat

注意参数-d用于距离和--indexType用于在hnsw和flat之间切换。

要运行压缩（启用BQ）：

go run main.go ann-benchmark -v ~/Documents/datasets/dbpedia-100k-openai-ada002.hdf5 -d cosine --indexType flat --bq enabled

注意参数-bq用于激活压缩。

一旦您运行脚本，您将在运行结束时在终端上看到不同的指标。特别注意QPS和召回率。结果将以JSON格式保存在与脚本相同路径下的名为results的存储库中。接下来，您还可以运行visualize.py脚本，生成我们在本文中显示的相同图形。您的图形将在与脚本相同路径下的output.png中可用。

祝愉快压缩！🚀

DuckDB -- 浮点数的压缩

tang-hi — Sun, 03 Mar 2024 00:00:00 GMT

浮点数的压缩一直是一个难以解决的问题。因为其在计算机中存储格式的特殊，导致浮点数的压缩率和解压速度都不是那么令人满意。 DuckDB采用了论文 ALP 中所提出的方法来对浮点数进行压缩，各方面都取得了不错的进展，这篇博客将介绍ALP中的压缩方法。

前置知识

IEEE 754 Double 的表示方法

首先我们来回忆一下浮点数在计算机内部的表示方式。浮点数由三部分组成

符号位 (sign)
指数位 (exponent)
分数位 (fraction)

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/double-represent.png"/> </div> 我们通过这三部分可以得到浮点数的值为 $$ double = (-1)^{\sign} \times 2^{(exp-1023)} \times \left(1 + \sum_{i=1}^{52}\left({b_{52-i}} {2^{-i}}\right)\right) $$

压缩

我们之所以难以对浮点数进行压缩，原因在于浮点数的整个二进制表示是分成三部分的，我们没有办法将其像整数一样作为一个整体进行压缩。

因此大体上对浮点数的压缩方式有

将浮点数转换为整数后进行压缩
分部份对浮点数进行进行压缩.

1. 将浮点数转化为为整数

将浮点数转化为整数的想法，看上去很简单，我们只需要乘上一个系数后，将其右边的小数部分消除即可

我们以8.0605为例, 我们仅需乘上$10^4$, 便可将其转化为整数, 同时我们只需要记录下这个系数, 我们便可以在解压的过程中，还原这个浮点数。 $$ 80605 = 8.0605 * 10^4 \newline 8.0605 = 80605 * 10^{(-4)} $$

但是这样的转换方式，因为计算机对浮点数表示的精度原因。我们没有办法在解压的时候获取于原来一样的数值，如下图所示。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/loss.png"/> </div> 我们可以看到解压后的数据与压缩前的数据存在细微的差别，这种有损压缩对于金融相关的业务而言是不可接受的。因此我们需要找到一个方法能够对其进行无损压缩，目前最常用的方式就是增加系数。

当我们将系数增加到$10^7$时，我们会发现该压缩方式变成了无损压缩。但问题也随之产生, 压缩率也下降了,在一些极端的例子中, 可能还不如不压缩。

2. 分部分进行压缩

当第一种方式无法进行有效压缩时, 我们会采取分部分进行压缩。这是因为通过观察我们发现，在一组浮点数中，指数位的方差较小，也就是指数位的值较为相似。因此我们可以对浮点数的数据集进行采样，决定一个分割点，左半部分是相似的指数位，我们对其使用dictionary encoding，而对于右半部分，我们使用bit packing进行压缩。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/alprd.png"/> </div> 通过这种方式我们也能进行有效压缩

ALP

ALP使用的是第一种压缩方法，它首先会对待压缩的浮点数数组进行采样，确定系数$10^e$。该系数确保大部分的数字可以做到无损压缩，同时它还会确定一个系数$10^{-f}$。这是因为如果我们在第一步为了保证精度，选择的系数过大，那么整数后面有大量的0，同样浪费空间，因此我们选择一个合适的系数$10^{-f}$,消除后置0。这里也许会有人担心再乘以一个系数可能导致引入新的误差。但是，根据论文的说法，其实并不会导致新的误差。因为论文中使用的round是自己实现的一个高效round，十分契合SIMD加速。

static const long long SWEET = (1ll << 51) + (1ll << 52);
long long fast_round(double d) {
  return static_cast<long>(d + SWEET - SWEET);
}

我们仍旧以 8.0605 为例, 假设系数分别为$10^{14}$, $10^{-10}$.我们用以下的代码测试

  #include <limits>
  #include <iomanip>
  #include <iostream>
  using namespace std;
  static const long long SWEET = (1ll << 51) + (1ll << 52);
  long long fast_round(double d) {
  return static_cast<long>(d + SWEET - SWEET);
  }

  int main (int argc, char *argv[])
  {
      double number = 8.0605;
      std:cout << "before compressd: ";
      std::cout << std::fixed << std::setprecision(std::numeric_limits<double>::digits10) << number << std::endl;
      long compressd = fast_round(number * 1e14 * 1e-10);
      double decompressed = (double(compressd * 1e10) * (double)1e-14);
      std::cout << "after compressd: ";
      std::cout << std::fixed << std::setprecision(std::numeric_limits<double>::digits10) << decompressed << std::endl;
      return 0;
  }

最终的测试结果为

before compressd: 8.060499999999999

after compressd: 8.060499999999999

因此ALP的算法流程为

采样确定系数$10^e$ , $10^{-f}$
对数组中的每一个数乘以第一步的两个系数，确定是否会损失精度

2.1. 如果不会损失精度，直接保存为整数

2.2. 如果会损失精度，作为异常值单独进行存储
对于第二步产生的整数数组使用算法FOR进行压缩

ALPRD

对于无法使用ALP的情况下(大部分数字无法无损压缩/压缩率不高)，我们会使用之前介绍的分部分压缩法。算法的流程为

对数据进行采样，确定从哪一位(P)开始分割。
P 位左边的二进制使用dictionary encode进行压缩
P 位右边的二进制使用bit packing进行压缩 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/alprd.png"/> </div>

总结

ALP使用非常简洁高效的算法对浮点数数组进行压缩，它不仅具有良好的压缩率，同时该算法是SIMD friendly, 可以充分利用硬件对该算法进行加速，提高解压和压缩的速度。这篇博客只是对ALP进行粗略地介绍，想要充分了解的读者还是推荐阅读论文原文ALP

DuckDB -- table's file format

tang-hi — Tue, 12 Dec 2023 00:00:00 GMT

In this article, we will explore how DuckDB organizes its table structures. We'll focus solely on the table structures while disregarding any irrelevant details.

Background Information

Block Types

DuckDB stands out from other databases by storing all of its data within a single file. To manage this data effectively, DuckDB utilizes two types of blocks: MetaBlocks and DataBlocks.

A DataBlock represents an individual unit of stored information.
On the other hand, a MetaBlock functions as a collection of blocks with its first 8 bytes indicating the value of 'next_block_id'. By employing such block lists when necessary due to excessive content volume.

Field Reader

In certain cases, when retrieving data from a Block, we employ a technique called "Field Reader" for reading purposes. This "Field Reader" is independent of the table's fields and serves as an initial step before accessing specific data by first extracting information about two key parameters: max_field_count and total_size.

max_field_count: Indicates the number of fields that will be retrieved afterwards.
total_size: Represents the overall size (in bytes) of the subsequent data retrieval process.

Segment Tree

The concept of Segment refers to a block of data, and we use a data structure called segment tree (or "SegementTree") for managing these segments. Despite its name, a segment tree internally utilizes vectors for storing segments rather than trees. Moreover, binary search is employed for locating specific segments within this structure; thus, it assumes that segments are stored in an ordered manner. One notable characteristic of segment trees is their support for lazy loading: instead of reading all segments into memory at once, they retrieve individual segments from disk on-demand.

文件结构

In this section, we will begin by introducing the file structure of DuckDB.

Looking at the diagram, we notice that DuckDB has three headers - but don't worry! These headers won't confuse us when it comes to understanding how tables are stored; they're just some extra information we'll briefly cover.

Let's start with the MainHeader:

Checksum: A handy way to verify data integrity.
Magic bytes: These special bytes confirm that this file belongs to duckDB.
Version numbers: Keep track of software versions for compatibility purposes.
Flags: Indicate if you have permission to read or write on this database.

Now let's move on to DataBaseHeader:

Iteration: How many times things have been iterated over (processed repeatedly).
Meta block: The unique identifier for the first data block in storage.
Free list: Blocks ready for reuse, saving space and resources.
Block count: The total number of blocks in the database.

Now, let's take a look at how the actual data is stored. It consists of a schema count and ${schema_count} individual schemas. In DuckDB, think of a schema as a database.

Let's take a closer look at how schemas are stored in DuckDB databases. The first piece of information is the schema's name, followed by a count of each type it contains. Now let's briefly go over these different types together! For more detailed definitions, feel free to visit their official website.

Now, let's shift our attention specifically to tables and explore their structures further!

From the diagram, we can observe that the first three item in the table are labeled as catalog name,schema name and table name. By using these three items, we can identify to which file (catalog), database (schema), and specific table this particular one belongs. The field named "constraints" provides information about certain restrictions imposed on this table, such as Not Null or Unique properties. However, for now, let's not dive too deep into this aspect; instead, let's shift our attention towards exploring what lies within both the 'Columns' section and 'table data' field.

Columns

Within this section resides a collection of definitions for each individual column found within a given table.

We can observe that in our dataset, the first attribute called column count which represents how many columns exist in total within our dataset file; following this attribute are individual definitions for each specific column present within our dataset file.

Now let's delve into what each attribute signifies:

column name - Field name
column type - Field type
expression - Expression, some fields are generated through expressions.
table Column type - This is different from the column type, it does not represent the field type, but only has two values: STANDARD and GENERATED. (Actually, I'm not quite sure about the meaning of this field, it probably indicates whether this field is generated or not)
compression type - Indicates the compression method used for this field.

Once we have obtained information regarding the types and characteristics of each column, we gain comprehensive knowledge about the overall structure and organization of our dataset. The remaining components consist of actual data entries and index-related information, which can be accessed through the table data field.

table data

Since indexes and table data are usually large in size, we don't store them directly here. Instead, we store pointers (block-id, offset) that point to their actual locations.

Let's explain each field using this diagram:

table data block: Pointer to the actual table data.
total rows: The number of rows in this table.
index num: The total number of indexes in this table.
index: Pointer to the index.

Now let's examine how table data block stores its actual structured information.

The initial storage contains metadata about a series of column data (the structure of the column data block will be explained later). The last two fields are easy to understand. The first one stores statistical information about the table, and the other one stores the number of 'row groups'. Now, let's address two questions: what is a 'row group', and why is the storage format different from before, which was <data-count, data, data,...data>, but now only stores a 'row group pointer'? What happens if there are more than 1 'row group'?

row group

We all know that OLAP generally uses columnar storage while OLTP uses row-based storage. Although columnar storage is superior in terms of reading and computation, when it comes to frequent insertions, deletions or updates, row-based storage outperforms columnar storage. Therefore, DuckDB has come up with a compromise here by grouping tuples and storing them in columns within each group. Currently, every 122880 tuples form one group.

Why only one row group pointer?

Because row groups are always stored sequentially according to their line numbers and they store blocks as meta blocks. Hence they can be managed through a SegmentTree for lazy loading subsequent row groups. When needed, they can be read directly from behind. That's why we only need to store the block-id of the first block here.

Now let's take a look at the storage structure of row groups.

Row start: The starting line number of this row group.
Tuple count: The number of rows in this row group.
Column pointers: Since columns are stored within each row group vertically (column-wise), this pointer points to the actual storage address for each column.
Versions: I haven't looked into this field too closely; it should be related to MVCC.

Let's continue with the storage structure of column data blocks.

To our surprise, we discovered that this isn't actually where the data itself is stored; it still holds pointers instead! But why? Well, it turns out that the real column data resides in what's called "pure blocks." Unlike their counterparts known as "meta blocks," these pure blocks don't have a convenient way to keep track of all their contents using a simple list structure like before.

As usual, let's explain the meanings of each field:

row start: The starting row number of this data.
tuple count: The total number of rows stored.
block id: The block ID where the actual data resides.
offset: The offset within the block ID where the actual data resides.
compress: The compression method used for the data.
stat: Statistical information about this portion of data.

Now, we've finally arrived at the very heart of it all—the block housing our precious columnar data! However, please note that its storage format can vary depending on which compression technique was employed during processing. I'll briefly introduce a few common types here, but if you're curious about others, feel free to explore them further on your own!

Const Column

In a const column, every single value is identical. This means we don't need to store any actual data at all. Instead, we can simply retrieve the minimum value from the statistical information.

uncompress column

When it comes to uncompressing columns in a dataset, it means that there is no compression applied to these specific columns. For data types such as uint32 or uint8, which have fixed sizes, we can easily read each value individually without any additional steps required. However, when dealing with variable-length data types like strings that do not have predetermined lengths, a different approach called Dictionary Compression is used

For strings, the first two fields give us the position of the dictionary:

dict_start = dict_end - dict_size
dict_end = dict_end
dict_size = dict_size

Here, we can consider the dictionary as a string pool and offsets as corresponding starting positions, where offsets[i] - offsets[i-1] represents the length. This might sound abstract, so let's take an example.

In this example, we have three strings: foo, bar, and duckdb.

We store these three strings in reverse order in a dictionary. The offset is relative to the "dict end". This allows us to locate the starting address of the corresponding string.

foo
head = dict - offset = dict - 3

length = 3 - 0 = 3
bar
head = dict - offset = dict - 6

length = 6 - 3 = 3
foo
head = dict - offset = dict - 12

length = 12 - 6=6

Now, let's address a potential issue: what if a string exceeds the maximum allowed length? In such cases, we utilize negative offsets to indicate that these strings are longer than usual. To access them, we store their (block id, offset) pairs in our dictionary and retrieve them from another block where they are stored.

RLE column and bitpacking

RLE column is relatively simple, with the values stored in the front and the number of occurrences of each value stored in the back. They are separated by RLE count offset.

As for bitpack columns, I'll let curious readers delve into that topic themselves.

Dictionary column

If you grasp how strings are stored in "uncompress column," understanding "Dictionary column" becomes much easier as well. In this case, the term 'dict' retains its original meaning while 'index Buffer' refers to what was previously mentioned as 'offsets' and 'bitpacking.' It represents which position within the index Buffer corresponds to each row's value. By using dict.get(indexBuffer[bitpacking[i]]), we can retrieve the stored value.

One important optimization technique employed here involves decompressing the dict during actual scanning. However, if it turns out that all data needs to be scanned, only decompressing bitpacking would suffice.

Last

In this article, we explored how tables are stored in DuckDB. Unlike other databases, DuckDB takes a unique approach by utilizing just one file to store all of its data (although I'm unsure whether this is advantageous or not). Designed as a single-machine database without distributed capabilities in mind, it aims to optimize performance through techniques such as lazy loading with row groups. Additionally, column data can be compressed using various formats for efficient storage.

有趣的知识 -- CPU利用率，延迟，吞吐量之间的关系

tang-hi — Mon, 13 Nov 2023 00:00:00 GMT

CPU利用率和延迟之间的关系

做在线服务的时候，我们经常会给CPU利用率设置一个阈值，如果超过了这个阈值，我们会对服务扩容。这个阈值到底应该如何选择，为什么当CPU利用率到达了50%(意味着它有一半时间是空闲的),我们就需要对其进行扩容？要回答这些问题就意味着，我们不能仅仅对CPU和延迟有一个感性的认识，而应该有一个定量的描述。

1. Little's Law

Little's Law 是排队论中的一个重要定律，由 John D. C. Little 在 1961 年提出。该定律指出，在一个系统中，平均的顾客数等于平均的到达率乘以平均的服务时间。换算到计算机中即系统中的平均请求数等于单位时间内的请求数量乘以平均的处理时间，其公式可以表示为

$$ L = \lambda W $$

其中:

$L$ : 系统中的平均请求数,即当前正在处理以及待处理的请求

$\lambda$ : 单位时间内到达的请求数,即QPS

$W$ : 单个请求的平均处理时间, 即延迟

2. 模型构建

在得知上述定理后，我们可以对我们的服务进行一个简单的模型构建M/M/1模型

我们只有一个服务器
服务器一次只处理一个请求
请求的到来服从泊松分布
请求的处理时间服从泊松分布

因为整个服务是动态的，每时每刻待处理的请求数量都是变化的，但是它们都服从马尔科夫链，其中每个状态的变化都是一个泊松过程。

因此如果当前状态为 $i$ , 那么它的上一个状态为 $ j $ ( $ i \pm 1 $ ), 由于整个系统处于平稳的状态，所以离开状态的速率等于进入状态的速率, 即

$$ \lambda * P(X=i) = \mu * P(X=j) $$

其中:

$P(X=i)$ 代表状态$i$的概率，即请求数量为$i$的概率

$P(X=j)$ 代表状态$j$的概率，即请求数量为$j$的概率

$\lambda$ 代表QPS

$\mu$ 代表单位时间内处理请求的数量

根据上面的式子，我们可以得到如下的递推式

$$ \lambda * P(X=0) = \mu * P(X=1)\newline \lambda * P(X=1) = \mu * P(X=2)\newline \lambda * P(X=2) = \mu * P(X=3)\newline .\newline .\newline $$

我们可以从上述公式推导出

$$ P(X=i) = (\frac{\lambda}{\mu})^n * P(X=0) $$

又由于所有的概率之和为1

$$ \sum_{i=0}^{\infty}P(X=i) = 1\newline (\frac{\lambda}{\mu})^0 * P(X=0) + (\frac{\lambda}{\mu})^1 * P(X=0) ... (\frac{\lambda}{\mu})^n * P(X=0) = 1\newline P(X=0) = \frac{(\mu - \lambda)}{\mu} $$

同样的我们如果按照这个模型来计算平均待处理的请求数，可以得到

$$ \sum_{i=0}^{\infty}P(X=i) * i\newline = P(X=0) * \frac{\lambda}{\mu} * (\frac{1}{(1-\frac{\lambda}{\mu})^2}) $$

将P(X=0)代入可以得到

$$ L = \frac{\lambda^2}{\mu * (\mu - \lambda)} $$

再通过Little's Law的公式我们可以得到

$$ W = \frac{\lambda}{\mu * (\mu - \lambda)} $$

整个推导过程较长，你可以只记住最后一个公式，最后我们得到延迟和QPS以及单位时间处理请求量之间的关系。

3.CPU利用率的表示

我们可以将CPU的利用率看作一段时间内的请求数量除以同一段时间内CPU最大可处理的请求数量，那么我们有以下公式

$$ \rho = \frac{\lambda * T}{\mu * T} = \frac{\lambda}{\mu} $$

将其代入第二节我们得到的公式可以得到延迟与CPU利用率之间的关系

$$ W = \frac{\rho}{\mu*(1-\rho)} $$

那么这个图具体长什么样呢？我们可以通过Walfram查看

从这幅图中我们可以发现随着CPU利用率的提升，延迟会快速的增长，我们可以通过下面的数据更具体的感受这一关系

CPU usage	latency	latency increase
0.2	0.0025
0.3	0.0042	68%
0.4	0.006	42%
0.5	0.01	66%
0.6	0.015	50%
0.7	0.023	53%
0.8	0.040	73%
0.9	0.090	125%
0.99	0.99	1000%

可以看到每次CPU利用率提升10%，延迟就会增加50%以上。如果CPU利用率到达80%，延迟的增长率甚至能到125%. 所以，我们可以得出以下结论

为什么有时候CPU利用率到达了50%就需要扩容？因为延迟的增长和CPU的利用率并不是线性增长.
如果CPU利用率到达了80%，就需要高度重视服务的负载了.
如果你想要低延迟，你需要保持低CPU利用率。在代码不变的情况下，高CPU利用率意味着高延迟。(想到了之前公司领导提出的高利用率，低延迟的降本增效项目)

吞吐量和延迟之间的关系

吞吐量和延迟一直是系统调优中永恒不变的话题，那么他们的到关系到底是怎么样的，我们能否像之前一样通过某种公式来描述它？

1. 模型构建

首先我们假定以下条件

我们只有一个服务器
服务器一次只处理一个请求
请求的到来服从泊松分布
请求的处理时间服从泊松分布
每个请求的处理时间保持一样

根据以上的假设，我们可以用以下的图片来表明该模型，横轴表明时间，纵轴表明待处理的工作量。

我们可以用下面的图来表明更一般的情况，即有可能一个请求还没处理完，下一个请求就来了。

现在我们可以尝试计算这个图形的面积，我们有两种方式计算这个图像的面积

第一种方式 $$ Area = T * (average ~~ height) \newline = T * (average ~~ wait ~~ time) \newline = T * W\newline $$

其中:

$T$ 为时间的长度

$W$ 为请求的平均等待时间

第二种方式 $$ Area = (area ~ of ~ triangles) + (area ~ of ~ parallelograms)\newline = (number ~ of ~ request) * (area ~ of ~ single ~ triangle + area ~ of ~ single ~ parallelogram) \newline = T*\lambda * ( (\frac{S^2}{2}) + S * W) $$

其中:

$T$ 为时间的长度

$\lambda$ 为单位时间内可以处理的请求数，即吞吐量

$S$ 为每个请求的处理时间

$W$ 为请求的平均等待时间

我们将上面两个公式进行求解

$$ T * W = T*\lambda * ( (\frac{S^2}{2}) + S * W) $$

可以得到

$$ W = \frac{\lambda*S^2}{2 * (1 - S\lambda)} $$

如果我们固定S(每个请求的处理时间), 我们可以获得以下W与 $\lambda$ 的函数图像，即延迟和吞吐量的关系。

从该公式我们可以得到以下结论

延迟与高吞吐成正比，也就是说低延迟和高吞吐是不可能同时存在的。
我们也获得了一个符合直觉的情况，如果要高吞吐你需要高CPU利用率，如果你需要低延迟你需要低CPU利用率。
如果你可以将处理请求的时间缩短一半，你可以在吞吐量增长一倍的情况下依旧保持之前一般的延迟。

结论

为什么有时候CPU利用率到达了50%就需要扩容？因为延迟的增长和CPU的利用率并不是线性增长.
如果CPU利用率到达了80%，就需要高度重视服务的负载了.
如果你想要低延迟，你需要保持低CPU利用率。在代码不变的情况下，高CPU利用率意味着高延迟。
延迟与高吞吐成正比，也就是说低延迟和高吞吐是不可能同时存在的。
我们也获得了一个符合直觉的情况，如果要高吞吐你需要高CPU利用率，如果你需要低延迟你需要低CPU利用率。
如果你可以将处理请求的时间缩短一半，你可以在吞吐量增长一倍的情况下依旧保持之前一般的延迟。

DuckDB -- MVCC和增删改查

tang-hi — Fri, 04 Aug 2023 00:00:00 GMT

DuckDB的MVCC实现来自于论文，但是DuckDB做了一定的简化。即它的隔离级别并不是可串行化，而是保证Snapshot的隔离，从而它的实现复杂度大幅降低。这篇文章会详细描述DuckDB的MVCC机制，以及增删改查是如何实现的。

注意:

DuckDB是我看的第一个数据库的实现。因此这篇文章并不会比较它与其他数据库在MVCC上的优劣。
这篇文章并不会事无巨细的把所有实现细节解析出来，只是为了让你可以完整了解是怎么实现的，后续实际看源码时可以更方便的理解。

前置知识

DuckDB的状态跟踪

DuckDB无论增删改查都会有一个状态一直跟踪整个过程，比如查询表的话，它会有一个TableScanGlobalSourceState和一个TableScanLocalSourceState对整个查询流程进行跟踪，这个state主要追踪的是当前进行到哪一行了，还剩多少行，等等。

对于每个算子，这个global和local代表的具体含义都会一些不同，后面具体讲增删改查的时候会进行描述。因为DuckDB的table格式可以划分为rowGroups -> rowGroup -> column -> segment。所以实际上每一个单元都有一个相应的state进行追踪。

DuckDB的local storage

DuckDB的存储可以分为两块。一块是table，代表这个表在磁盘中的状态，另一块是local storage, 代表这个事务中对该表做的操作，比如增删改查等等.... 而local storage只有在commit的时候才会去和table进行合并。这有两点好处。
1. 增加事务的并发度。
2. rollback时几乎无成本。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/mvcc-local-storage.png"/> </div>
DuckDB的MVCC粒度

DuckDB的的MVCC粒度是对Segment而言的，即每一个column中的部分数据，会有一个version info记录着它是被哪个事务加入的，又是被哪个事务删除的。同时还会有一个Update Segment记录着它的Update Version. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/mvcc-singualrity.png"/> </div>

MVCC

DuckDB会为每一个新创建的transaction赋两个值。

transaction id（从2^62开始递增）
start time （从2开始递增)

这样赋值的原因在于，在一个transaction还未提交时，我们会使用transaction id作为它的commit id,只有当它提交以后，我们才会将commit id设置为提交的这一时间。这样就可以确保当事务仍未提交时，它所作出的更改不会被看到。

我先用文字描述MVCC的实现。然后通过一个例子更直观的理解该实现。

文字描述

我们会对每一个Segment维护一个链表，链表中存储版本信息。版本信息中的version初始化为transaction id，当commit时，再更新为commit id (版本信息中保存的是，这个事务变更前的数据)。

当我们对数据进行扫描时，我们会不断比较version与当前事务的start time。当满足以下两个条件，我们就会应用其保存的版本。

version_number > start_time

说明这个版本还未commit，或者这个版本在事务开始之后才commit.那么我们应当还原成这个事务之前的版本，即应用该版本。
version_number != transaction_id

我们不会将数据还原为这次事务之前的版本。

当我们对数据进行更改时，我们会直接在原地进行修改，然后将更改之前的数据保存进Undo Buffer，插入链表的头部。

DuckDB为了可以对列进行压缩，并没有直接进行原地更改，相反它是在链表头部保存了一个哑节点。它的原地修改就是直接修改哑节点，这个并不妨碍理解MVCC，所以可以直接认为Duck也在原地修改。

例子

下面我们考虑以下例子。

我们有一张银行存款表，里面每一个储户的余额都为10，同时我们有4个事务同时执行。

Txn1 Thomas 向 Larry 转1元
Txn2 Thomas 向 Tom 转1元
Txn3 求和
Txn4 Thomas 向 Andy 转1元

我们假设Txn1, Txn4已经commit，而 Txn2, Txn3仍在执行，并且Txn1, Txn4在T1, T2commit，而Txn2, Txn3在T3,T4开启了事务。他们的transaction id为一个十分大的数。那么此时整体的version info 如下图所示。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/mvcc-version-info.png"/> </div>

我们可以看到每一个事务都有一个对应的Undo Buffer,同时每一版本的信息都有一个链表来进行维护。我们下面来考虑事务Txn3的执行情况。

当读取Thomas的Balance时，table中的数据为10,但是因为Thomas的版本信息不为null，所以我们需要遍历链表查看是否有更合适的版本。

哑节点直接应用， banlance 变为7
UndoBuffer Ty , 因为Ty > T3， balance变为8。
UndoBuffer T2 , 因为T2 < T3，不应用。
UndoBuffer T1 , 因为T1 < T3，不应用。

最终得到的结果为8.符合快照隔离的要求。

后续几个读取的流程留给读者自己练习, 我们下面介绍DuckDB的增删改查。

Insert

Insert 的入口函数为PhysicalInsert::Sink

if (!parallel) {
		// init global state if not initialized
		if (!gstate.initialized) {
			storage.InitializeLocalAppend(gstate.append_state, context.client);
			gstate.initialized = true;
		}

		// check if has some conflict with the rules such as UNIQUE, FOREIGN KEY, etc.
		idx_t updated_tuples = OnConflictHandling(table, context, lstate);
		gstate.insert_count += lstate.insert_chunk.size();
		gstate.insert_count += updated_tuples;
		storage.LocalAppend(gstate.append_state, table, context.client, lstate.insert_chunk, true);

		if (return_chunk) {
			gstate.return_collection.Append(lstate.insert_chunk);
		}
	} else {
		// add into local state's insert chunk
		D_ASSERT(!return_chunk);
		// parallel append
		if (!lstate.local_collection) {
			lock_guard<mutex> l(gstate.lock);
			auto &table_info = storage.info;
			auto &block_manager = TableIOManager::Get(storage).GetBlockManagerForRowData();
			lstate.local_collection =
			    make_uniq<RowGroupCollection>(table_info, block_manager, insert_types, MAX_ROW_ID);
			lstate.local_collection->InitializeEmpty();
			lstate.local_collection->InitializeAppend(lstate.local_append_state);
			lstate.writer = &gstate.table.GetStorage().CreateOptimisticWriter(context.client);
		}
		OnConflictHandling(table, context, lstate);

		auto new_row_group = lstate.local_collection->Append(lstate.insert_chunk, lstate.local_append_state);
		if (new_row_group) {
			lstate.writer->WriteNewRowGroup(*lstate.local_collection);
		}
	}

从代码中，我们可以看到DuckDB的Insert有两种模式

并行化，每一个算子有自己独立的存储空间，并行插入，Combine的时候合入全局的存储空间 (合入的成本相较于插入成本低很多，因为只需要把指针指向新的位置即可)。
非并行化，每一个算子直接往全局的存储空间进行插入。

这里我只介绍非并行化，因为插入的流程是一样的，只是处理的方式不同，因此如果你理解了非并行化，那么你也理解了并行化的方式。

还记得前置知识中，我们说过，DuckDB中每一个table除了它在磁盘中的表示形式，他还有一个Local Storage专门用来存储未提交的事务对table进行的增量操作.而这个Local Storae的格式与table是完全一致的.即我们的添加流程为。

找到table中要添加的RowGroup
找到RowGroup中要添加的Column
找到Column要添加的Segment
根据Segement使用的压缩方法不同，调用不同的压缩算法，把数据添加进Segment。对应的代码片段参考如下

// add into rowGroups
bool RowGroupCollection::Append(DataChunk &chunk, TableAppendState &state) {
	
	idx_t append_count = chunk.size();
	idx_t remaining = chunk.size();
	auto current_row_group = state.row_group_append_state.row_group;
		// check how much we can fit into the current row_group
	idx_t append_count =
		    MinValue<idx_t>(remaining, RowGroup::ROW_GROUP_SIZE - state.row_group_append_state.offset_in_row_group);
		if (append_count > 0) {
			// !! insert into row group
			current_row_group->Append(state.row_group_append_state, chunk, append_count);
		// skip....
}

// add into rowGroup
void RowGroup::Append(RowGroupAppendState &state, DataChunk &chunk, idx_t append_count) {
	// append to the current row_group
	// append into all column
	for (idx_t i = 0; i < GetColumnCount(); i++) {
		auto &col_data = GetColumn(i);
		col_data.Append(state.states[i], chunk.data[i], append_count);
	}
	// update row group append state
	state.offset_in_row_group += append_count;
}

// add into column
void ColumnData::AppendData(BaseStatistics &stats, ColumnAppendState &state, UnifiedVectorFormat &vdata, idx_t count) {
	
	while (true) {
		// append the data from the vector
		idx_t copied_elements = state.current->Append(state, vdata, offset, count);

		// we couldn't fit everything we wanted in the current column segment, create a new one
		{
			auto l = data.Lock();
			AppendTransientSegment(l, state.current->start + state.current->count);
			state.current = data.GetLastSegment(l);
			state.current->InitializeAppend(state);
		}
		// skip...
	}
}

// use compress function to add data into column
idx_t ColumnSegment::Append(ColumnAppendState &state, UnifiedVectorFormat &append_data, idx_t offset, idx_t count) {
	D_ASSERT(segment_type == ColumnSegmentType::TRANSIENT);
	if (!function.get().append) {
		throw InternalException("Attempting to append to a segment without append method");
	}
	return function.get().append(*state.append_state, *this, stats, append_data, offset, count);
}

代码中有几点需要注意

如果Segment空间不够，我们会创建新的Segment,但是这个Segement的类型为transientSegment。意味着这是一个临时Segment，当内存不足时，会把它写到临时文件中，然后释放这块内存。
当我们写满一块RowGroup时，我们会将其刷入磁盘,仿佛这个RowGroup已经被添加到了table中。这是因为如果不这么做，当我们要插入的数据非常大时，我们需要频繁的把数据写到临时文件，这可能造成较大的性能问题。而提前刷入磁盘，我们只需要在rollback时，标记该区域为未使用区域，唯一的问题就是可能造成数据库磁盘文件膨胀。有兴趣的可以查看这个PR。

在我们将数据添加到Local Storage后,我们需要对该Insert进行Commit。

string DuckTransaction::Commit(AttachedDatabase &db, transaction_t commit_id, bool checkpoint) noexcept {
    // skip...
	try {
		
		storage->Commit(commit_state, *this);
		undo_buffer.Commit(iterator_state, log, commit_id);
		if (log) {
			// commit any sequences that were used to the WAL
			for (auto &entry : sequence_usage) {
				log->WriteSequenceValue(*entry.first, entry.second);
			}
		}
		if (storage_commit_state) {
            // WAL Flush to DISK
			storage_commit_state->FlushCommit();
		}
		return string();
	} catch (std::exception &ex) {
		return ex.what();
	}
}

代码中我们可以看到事务的提交就是三个流程

storage commit
UndoBuffer commit
WAL 刷到磁盘中

Storage Commit

这个相对简单就是遍历LocalStorage中的chunk，然后将其添加到table中。

注意DuckDB每一个column都有insert_id, delete_id来描述，它是由哪个transaction添加的，由哪个transaction删除的。代码中将其称为Version Info

在将数据添加到table后，我们将添加的信息加入到UndoBuffer中。格式为

UndoBuffer Commit

逆序遍历UndoBuffer，根据不同的Undo Flag对每一个Entry进行不同的操作。对于Insert而言

将新增的数据写到LOG中
将table中的对应的version info 由transaction id 更改为commit id

WAL 刷到磁盘中

在WAL中写WAL_FLUSH后，全部刷新到磁盘。后续Replay时，只有遇到WAL_FLUSH才会进行commit。因此如果在WAL刷到磁盘前断电，哪怕Storage/UndoBuffer Commit了，重启后也是不可见的。

Delete

Delete 的入口函数为PhysicalDelete::Sink

SinkResultType PhysicalDelete::Sink(ExecutionContext &context, DataChunk &chunk, OperatorSinkInput &input) const {
	auto &gstate = input.global_state.Cast<DeleteGlobalState>();
	auto &ustate = input.local_state.Cast<DeleteLocalState>();

	// get rows and
	auto &transaction = DuckTransaction::Get(context.client, table.db);
	auto &row_identifiers = chunk.data[row_id_index];

	// skip...
	gstate.deleted_count += table.Delete(tableref, context.client, row_identifiers, chunk.size());

	return SinkResultType::NEED_MORE_INPUT;
}

idx_t DataTable::Delete(TableCatalogEntry &table, ClientContext &context, Vector &row_identifiers, idx_t count) {
	while (pos < count) {
		idx_t start = pos;
		// transaction inserted tuples have row identifiers >= MAX_ROW_ID
		bool is_transaction_delete = ids[pos] >= MAX_ROW_ID;
		// figure out which batch of rows to delete now
		for (pos++; pos < count; pos++) {
			bool row_is_transaction_delete = ids[pos] >= MAX_ROW_ID;
			if (row_is_transaction_delete != is_transaction_delete) {
				break;
			}
		}
		idx_t current_offset = start;
		idx_t current_count = pos - start;

		Vector offset_ids(row_identifiers, current_offset, pos);
		if (is_transaction_delete) {
			// transaction-local delete
			// transaction add and transaction delete
			delete_count += local_storage.Delete(*this, offset_ids, current_count);
		} else {
			// regular table delete
			delete_count += row_groups->Delete(transaction, *this, ids + current_offset, current_count);
		}
	}
	return delete_count;
}

从代码中可以看到，delete不同于insert，它是直接对table进行删除。但是delete会区分要删除的数据是transaction local的，还是table的。即是local storage还是table的,区分逻辑为transaction local的行号都是大于MAX_ROW_ID的。（删除逻辑是一样的，因此我们只需要看一个就行了）

首先我们需要找到要删除的Row Group

idx_t RowGroupCollection::Delete(TransactionData transaction, DataTable &table, row_t *ids, idx_t count) {
	idx_t delete_count = 0;
	// delete is in the row groups
	// we need to figure out for each id to which row group it belongs
	// usually all (or many) ids belong to the same row group
	// we iterate over the ids and check for every id if it belongs to the same row group as their predecessor
	idx_t pos = 0;
	do {
		idx_t start = pos;
		auto row_group = row_groups->GetSegment(ids[start]);
		for (pos++; pos < count; pos++) {
			D_ASSERT(ids[pos] >= 0);
			// check if this id still belongs to this row group
			if (idx_t(ids[pos]) < row_group->start) {
				// id is before row_group start -> it does not
				break;
			}
			if (idx_t(ids[pos]) >= row_group->start + row_group->count) {
				// id is after row group end -> it does not
				break;
			}
		}
		delete_count += row_group->Delete(transaction, table, ids + start, pos - start);
	} while (pos < count);
	return delete_count;
}

但是我们并不需要实际删除该数据，我们所要做的仅仅是标记删除，即将对应数据的delete id标记为当前的transaction id, 表明被当前transaction删除。

void VersionDeleteState::Flush() {
	// no need to flush if there is nothing to flush
	if (count == 0) {
		return;
	}

	// it is possible for delete statements to delete the same tuple multiple times when combined with a USING clause
	// in the current_info->Delete, we check which tuples are actually deleted (excluding duplicate deletions)
	// this is returned in the actual_delete_count
	auto actual_delete_count = current_info->Delete(transaction.transaction_id, rows, count);
	delete_count += actual_delete_count;
	// we actually delete some tuples: push the delete into the undo buffer
	if (transaction.transaction && actual_delete_count > 0) {
		// now push the delete into the undo buffer, but only if any deletes were actually performed
		transaction.transaction->PushDelete(table, current_info, rows, actual_delete_count, base_row + chunk_row);
	}
	count = 0;
}

// delete according row
idx_t ChunkVectorInfo::Delete(transaction_t transaction_id, row_t rows[], idx_t count) {
	any_deleted = true;

	idx_t deleted_tuples = 0;
	for (idx_t i = 0; i < count; i++) {

		// already deleted
		if (deleted[rows[i]] == transaction_id) {
			continue;
		}

		// first check the chunk for conflicts
		if (deleted[rows[i]] != NOT_DELETED_ID) {
			// tuple was already deleted by another transaction
			throw TransactionException("Conflict on tuple deletion!");
		}
		// delete
		deleted[rows[i]] = transaction_id;
		rows[deleted_tuples] = rows[i];
		deleted_tuples++;
	}
	return deleted_tuples;
}

// add undo buffer
void DuckTransaction::PushDelete(DataTable &table, ChunkVectorInfo *vinfo, row_t rows[], idx_t count, idx_t base_row) {
	auto delete_info = reinterpret_cast<DeleteInfo *>(
	    undo_buffer.CreateEntry(UndoFlags::DELETE_TUPLE, sizeof(DeleteInfo) + sizeof(row_t) * count));
	delete_info->vinfo = vinfo;
	delete_info->table = &table;
	delete_info->count = count;
	delete_info->base_row = base_row;
	memcpy(delete_info->rows, rows, sizeof(row_t) * count);
}

从上面的代码中我们可以看到我们会将当前的transaction id赋值给deleted数组中对应的元素，同时往UndoBuffer中添加对应的Entry, 即将删除的行号写到UndoBuffer中。

同样的事务的提交为三个流程

storage commit

storage commit 在Insert中已经讲过了，值得注意的是，当我们扫描Local Storage时，我们会忽略被删除的数据, 因此被删除的数据不会被合并进table中.
UndoBuffer Commit
WAL 刷到磁盘中

与Insert完全一致。

下面我们来分析一下UndoBuffer Commit

case UndoFlags::DELETE_TUPLE: {
		// deletion:
		auto info = reinterpret_cast<DeleteInfo *>(data);

		// write delete info into wal log
		if (HAS_LOG && !info->table->info->IsTemporary()) {
			WriteDelete(*info);
		}

		// mark the tuples as committed
		info->vinfo->CommitDelete(commit_id, info->rows, info->count);
		break;
}

可以看到和Insert几乎一样

将删除的行号写进LOG.
将table中的对应的version info 由transaction id 更改为commit id

Update

Update 的入口函数为PhysicalUpdate::Sink。

SinkResultType PhysicalUpdate::Sink(ExecutionContext &context, DataChunk &chunk, OperatorSinkInput &input) const {
	//skip....
		table.Update(tableref, context.client, row_ids, columns, update_chunk);
	// skip...

}

void DataTable::Update(TableCatalogEntry &table, ClientContext &context, Vector &row_ids,
                       const vector<PhysicalIndex> &column_ids, DataChunk &updates) {
	// skip...
	auto ids = FlatVector::GetData<row_t>(row_ids);
	auto first_id = FlatVector::GetValue<row_t>(row_ids, 0);
	if (first_id >= MAX_ROW_ID) {
		// update is in transaction-local storage: push update into local storage
		auto &local_storage = LocalStorage::Get(context, db);
		local_storage.Update(*this, row_ids, column_ids, updates);
		return;
	}

	// update is in the row groups
	// we need to figure out for each id to which row group it belongs
	// usually all (or many) ids belong to the same row group
	// we iterate over the ids and check for every id if it belongs to the same row group as their predecessor
	row_groups->Update(transaction, ids, column_ids, updates);
}

和delete一样，我们也会通过row-id区分更改的是local storage还是table,我们来看Update的具体逻辑。

// RowGroup Update
void RowGroup::Update(TransactionData transaction, DataChunk &update_chunk, row_t *ids, idx_t offset, idx_t count,
                      const vector<PhysicalIndex> &column_ids) {
	for (idx_t i = 0; i < column_ids.size(); i++) {
		auto column = column_ids[i];
		auto &col_data = GetColumn(column.index);

		if (offset > 0) {
			Vector sliced_vector(update_chunk.data[i], offset, offset + count);
			sliced_vector.Flatten(count);
			col_data.Update(transaction, column.index, sliced_vector, ids + offset, count);
		} else {
			col_data.Update(transaction, column.index, update_chunk.data[i], ids, count);
		}
	}
}
// Column Update
void ColumnData::Update(TransactionData transaction, idx_t column_index, Vector &update_vector, row_t *row_ids,
                        idx_t update_count) {
	lock_guard<mutex> update_guard(update_lock);
	if (!updates) {
		updates = make_uniq<UpdateSegment>(*this);
	}
	Vector base_vector(type);
	ColumnScanState state;
	auto fetch_count = Fetch(state, row_ids[0], base_vector);

	base_vector.Flatten(fetch_count);
	updates->Update(transaction, column_index, update_vector, row_ids, update_count, base_vector);
}

从上面的代码我们可以得知，我们仍旧是先找需要Update的RowGroup, 再找需要Update的ColumnData,每一个ColumnData都有一个UpdateSegment，这里面存放着数据的历史版本。而其修改的流程与我们前面介绍的MVCC一致。

/ @brief Update the segment with the given transaction data
// @param transaction The transaction data
// @param column_index The index of the column to update
// @param update The vector containing the update data
// @param ids The row ids to update
// @param count The amount of ids to update
// @param base_data The original data of the column
void UpdateSegment::Update(TransactionData transaction, idx_t column_index, Vector &update, row_t *ids, idx_t count,
                           Vector &base_data) {
	// obtain an exclusive lock
	auto write_lock = lock.GetExclusiveLock();

	update.Flatten(count);
	// skip....
	if (root->info[vector_index]) {
		// there is already a version here, check if there are any conflicts and search for the node that belongs to
		// this transaction in the version chain
		auto base_info = root->info[vector_index]->info.get();
		
		auto node = base_info->next;
		while (node) {
			if (node->version_number == transaction.transaction_id) {
				// it has! use this node
				break;
			}
			node = node->next;
		}
			node->segment = this;
			node->vector_index = vector_index;
			node->N = 0;
			node->column_index = column_index;

			// insert the new node into the chain
			node->next = base_info->next;
			if (node->next) {
				node->next->prev = node;
			}
			node->prev = base_info;
			base_info->next = transaction.transaction ? node : nullptr;
		}
		// now we are going to perform the merge
		// because we found this txn has done update before
		// so we just merge the update into the node
		merge_update_function(base_info, base_data, node, update, ids, count, sel);
	} else {
		// there is no version info yet: create the top level update info and fill it with the updates
		auto result = make_uniq<UpdateNodeData>();

		result->info = make_uniq<UpdateInfo>();
		result->tuples = make_unsafe_uniq_array<sel_t>(STANDARD_VECTOR_SIZE);
		result->tuple_data = make_unsafe_uniq_array<data_t>(STANDARD_VECTOR_SIZE * type_size);
		result->info->tuples = result->tuples.get();
		result->info->tuple_data = result->tuple_data.get();
		result->info->version_number = TRANSACTION_ID_START - 1;
		result->info->column_index = column_index;
		InitializeUpdateInfo(*result->info, ids, sel, count, vector_index, vector_offset);
		// skip...
		InitializeUpdateInfo(*transaction_node, ids, sel, count, vector_index, vector_offset);

		// we write the updates in the update node data, and write the updates in the info
		initialize_update_function(transaction_node, base_data, result->info.get(), update, sel);

		result->info->next = transaction.transaction ? transaction_node : nullptr;
		result->info->prev = nullptr;
		transaction_node->next = nullptr;
		transaction_node->prev = result->info.get();
		transaction_node->column_index = column_index;

		transaction_node->Verify();
		result->info->Verify();

		root->info[vector_index] = std::move(result);
	}
}

代码很长，但是实际干的事情就是一件事，将修改前的数据保存下来做成一个UndoBuffer的Entry写入UndoBuffer，然后直接本地修改，即base_info更新数据，然后将Entry插入到base_info的next。

最后相同的流程同样的事务的提交为三个流程

storage commit
UndoBuffer Commit
WAL 刷到磁盘中

不同的只有UndoBuffer Commit

case UndoFlags::UPDATE_TUPLE: {
		// update:
		auto info = reinterpret_cast<UpdateInfo *>(data);
		if (HAS_LOG && !info->segment->column_data.GetTableInfo().IsTemporary()) {
			WriteUpdate(*info);
		}
		info->version_number = commit_id;
		break;
}

同样将哪些column变了，写入到WAL中。然后将Update Info的version number从transaction id变为commit id,表明提交成功。

Scan

最后我们来讲一下Scan，有了前面的铺垫，Scan就相对容易一些了。 Scan的入口函数为PhysicalTableScan::GetData

SourceResultType PhysicalTableScan::GetData(ExecutionContext &context, DataChunk &chunk,
                                            OperatorSourceInput &input) const {
	D_ASSERT(!column_ids.empty());
	auto &gstate = input.global_state.Cast<TableScanGlobalSourceState>();
	auto &state = input.local_state.Cast<TableScanLocalSourceState>();

	TableFunctionInput data(bind_data.get(), state.local_state.get(), gstate.global_state.get());
	function.function(context.client, data, chunk);

	return chunk.size() == 0 ? SourceResultType::FINISHED : SourceResultType::HAVE_MORE_OUTPUT;
}

static void TableScanFunc(ClientContext &context, TableFunctionInput &data_p, DataChunk &output) {
	// skip...
	do {
		if(/*skip....*/) {
		} else {
			// scan!!
			storage.Scan(transaction, output, state.scan_state);
		}
		// skip...
	} while (true);
}

void DataTable::Scan(DuckTransaction &transaction, DataChunk &result, TableScanState &state) {
	// scan the persistent segments
	// table state is the the presistent data
	if (state.table_state.Scan(transaction, result)) {
		D_ASSERT(result.size() > 0);
		return;
	}

	// scan the transaction-local segments

	// this was added to the local storage
	auto &local_storage = LocalStorage::Get(transaction);
	local_storage.Scan(state.local_state, state.GetColumnIds(), result);
}

从代码中我们可以看到Scan的流程为先扫描Table再扫描Local Storage。对于Table的扫描，同样也是一个rowGroup，一个rowGroup来扫描的。我们主要看一下对RowGroup的扫描。

template <TableScanType TYPE>
void RowGroup::TemplatedScan(TransactionData transaction, CollectionScanState &state, DataChunk &result) {

	auto table_filters = state.GetFilters();
	const auto &column_ids = state.GetColumnIds();
	auto adaptive_filter = state.GetAdaptiveFilter();
	while (true) {
		
		idx_t current_row = state.vector_index * STANDARD_VECTOR_SIZE;
		// each time scan entire vector, unless remaining less than STANDARD_VECTOR_SIZE
		auto max_count = MinValue<idx_t>(STANDARD_VECTOR_SIZE, state.max_row_group_row - current_row);

		// second, scan the version chunk manager to figure out which tuples to load for this transaction
		idx_t count;
		SelectionVector valid_sel(STANDARD_VECTOR_SIZE);
		if (TYPE == TableScanType::TABLE_SCAN_REGULAR) {
			// get what is needed to scan in this vector
			// may be it's deleted by this transaction or inserted by other transaction
			count = state.row_group->GetSelVector(transaction, state.vector_index, valid_sel, max_count);
			if (count == 0) {
				// nothing to scan for this vector, skip the entire vector
				// increase state.vector_idx, and make every column skip ${count} vector data
				NextVector(state);
				continue;
			}
		}
			if (count == 0) {
				// nothing to scan for this vector, skip the entire vector
				NextVector(state);
				continue;
			}
		} else {
			count = max_count;
		}
		// skip...
	}
}

因为代码很长，我们分段来看，首先上面的代码中最重要的就是

state.row_group->GetSelVector(transaction, state.vector_index, valid_sel, max_count);

这句的含义是，确定这个rowGroup有哪些是我们这个transaction可见的，因为有些数据可能是被其他transaction添加的，对于我们来说应该是不可见的。我们可以通过insert-id和delete-id来进行判断


	static bool UseInsertedVersion(transaction_t start_time, transaction_t transaction_id, transaction_t id) {
		return id < start_time || id == transaction_id;
	}

	static bool UseDeletedVersion(transaction_t start_time, transaction_t transaction_id, transaction_t id) {
		return !UseInsertedVersion(start_time, transaction_id, id);
	}

对于Insert,如果它的Commit时间小于start time 或者它就是这个事务添加的。那么应该是可见的。对于Delete,如果它的Commit时间大于start time 或者它不是这个事务删除的。那么它不应该被删除，即应该是可见的。

在确定了哪些tuple是可见后，我们就应该尝试去读取数据了。

if (count == max_count && !table_filters) {
	// scan all vectors completely: full scan without deletions or table filters
	for (idx_t i = 0; i < column_ids.size(); i++) {
		const auto &column = column_ids[i];
		if (column == COLUMN_IDENTIFIER_ROW_ID) {
			// scan row id
			D_ASSERT(result.data[i].GetType().InternalType() == ROW_TYPE);
			result.data[i].Sequence(this->start + current_row, 1, count);
		} else {
			auto &col_data = GetColumn(column);
			if (TYPE != TableScanType::TABLE_SCAN_REGULAR) {
				col_data.ScanCommitted(state.vector_index, state.column_scans[i], result.data[i], ALLOW_UPDATES);
			} else {
				col_data.Scan(transaction, state.vector_index, state.column_scans[i], result.data[i]);
			}
		}
	}
}

如果全部可见，且没有filter，那么我们直接对每一个column进行读取。

template <bool SCAN_COMMITTED, bool ALLOW_UPDATES>
idx_t ColumnData::ScanVector(TransactionData transaction, idx_t vector_index, ColumnScanState &state, Vector &result) {
	// we have got data in the table into the result
	// the total count in this result is scan count
	auto scan_count = ScanVector(state, result, STANDARD_VECTOR_SIZE);

	lock_guard<mutex> update_guard(update_lock);
	if (updates) {
		if (!ALLOW_UPDATES && updates->HasUncommittedUpdates(vector_index)) {
			throw TransactionException("Cannot create index with outstanding updates");
		}
		result.Flatten(scan_count);
		if (SCAN_COMMITTED) {
			updates->FetchCommitted(vector_index, result);
		} else {
			updates->FetchUpdates(transaction, vector_index, result);
		}
	}
	return scan_count;
}

// MVCC read
template <class T>
	static void UpdatesForTransaction(UpdateInfo *current, transaction_t start_time, transaction_t transaction_id,
	                                  T &&callback) {
		while (current) {
			if (current->version_number > start_time && current->version_number != transaction_id) {
				// these tuples were either committed AFTER this transaction started or are not committed yet, use
				// tuples stored in this version
				// update the coressponding data
				callback(current);
			}
			current = current->next;
		}
}

上面的代码先读取这个column的原始数据，然后看它有没有Update，如果有的话，就根据我们之前描述的MVCC的方式进行更新。

如果有Filter的话，会根据Filter条件先进行过滤，再根据过滤后的数据去获得相应的ColumnData，方式与上面描述的一样。

if (table_filters) {
	D_ASSERT(adaptive_filter);
	D_ASSERT(ALLOW_UPDATES);
	for (idx_t i = 0; i < table_filters->filters.size(); i++) {
		auto tf_idx = adaptive_filter->permutation[i];
		auto col_idx = column_ids[tf_idx];
		auto &col_data = GetColumn(col_idx);
		col_data.Select(transaction, state.vector_index, state.column_scans[tf_idx], result.data[tf_idx],sel, approved_tuple_count, *table_filters->filters[tf_idx]);
	}
	for (auto &table_filter : table_filters->filters) {
		result.data[table_filter.first].Slice(sel, approved_tuple_count);
	}
}

//! Now we use the selection vector to fetch data for the other columns.
for (idx_t i = 0; i < column_ids.size(); i++) {
	// we fetch column data for all columns that were not used for filtered
	// skip...
	col_data.FilterScanCommitted(state.vector_index, state.column_scans[i], result.data[i], sel,approved_tuple_count, ALLOW_UPDATES);
}

最后将读取的数据全部返回。

总结

MVCC与增删改查的东西确实太多了，很难面面俱到，因此这篇文章也只能说把大体的轮廓介绍了一下。如果想知道全部的细节，还是需要去阅读源码。如果有任何不理解，或者觉得描述的不太清晰的，请随时留言提出。

DuckDB -- ART索引

tang-hi — Fri, 21 Jul 2023 00:00:00 GMT

DuckDB不同于其他的数据库，并没有使用B+树作为主要索引结构，而是使用了ART(Adaptive Radix Tree)作为它内部的主要索引结构。本文将介绍这一索引

ART(Adaptive Radix Tree)

ART 索引是由Viktor Leis, Alfons Kemper, Thomas Neumann等人提出，它相比于B+数的主要区别在于B+树是面向磁盘的，而ART则是面向内存的。即ART索引是需要全部加载到内存中的。DuckDB之所以选择这个索引有以下几方面的考虑

随着内存越来越大，并且价格也越来越便宜，我们可以使用纯内存的索引，从而避免磁盘IO，提升性能。
ART索引可以很大程度上的节省空间。
ART索引支持范围查询。
ART索引有着较高的性能。

后续本文会先介绍ART这一数据结构，然后配合着DuckDB的代码描述ART是如何实现的。

数据结构

在讲ART索引之前，我们先看一下Trie树。(如果你不知道Trie树，可以参考Trie )

我们可以看到Trie树在检索时的优点是，它的检索时间仅与最长的字符串长度有关，而与存储的字符数量无关，这一特性在数据量极大的情况下十分优秀。但是它的缺点是浪费空间，即每个内部节点都需要保存固定数量的指针，即使它仅有极少的子结点。

比如图中的root节点，尽管他只有三个子结点，但是它仍然需要保存指向a,c,e...的空指针。这十分浪费空间。其次Trie树仅支持保存字符串。

ART则是在Trie树的基础上，解决了它缺点的同时，保留了它的优点。下面我们来介绍ART索引。

对于一个索引而言，我们希望它有以下两个特点

查询速度快
空间占用小

但是如果我们使用Trie树做索引(ART是Trie的一个变种)，我们就要面临取舍，如果内部节点可以拥有的最大子结点越多(空间占据越多)，那么它的高度也越低(速度越快)。如果内部节点拥有的最大子结点越少(空间占据越少)，那么他的高度也越高(速度越慢)。

ART树选择每个内部节点的大小为8bit(子结点的数量为256),刚好是一个byte。这样的好处是免去了内存对齐的问题，同时在空间与速度上取得了一个较好的平衡。我们称内部节点所占据的位宽为span.

尽管如此，面对稀疏的数据时，每个节点有256个子结点仍旧会浪费空间，为了解决这个问题。ART将内部节点进一步细分为以下四类, 我们分别来对其进行介绍。

Node4
Node16
Node48
Node256

Node4

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/node4.png"/> </div> 从图中可以看出，Node4分为两个部分，一个是key数组，一个是child数组。key数组存放key的部分内容(也就是key的一个byte)，child数组则是保存对应的子结点的指针。注意，我们为了可以范围查询，key数组要求顺序存储。

Node16

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/node16.png"/> </div> Node16和Node4几乎一样，区别只是从4个slot变为16个slot

Node48

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/node48.png"/> </div> Node48和之前介绍的Node一样也是分为key数组和child数组,区别在于Node48的key数组长度为256,这样子我们就无须通过遍历找到对应的数组，而是可以直接通过key的二进制值作为下标直接定位到对应的key slot。key slot中存放的是指针，指向对应的子结点在child数组中的位置。因此child数组的长度仅需要48就可以了。

实际查询仅需要child_array[key_array[key]]即可。

Node256

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/node256.png"/> </div> Node256就是Trie树原始的内部节点表示形式，仅需要一个数组，数组的下标即为key，数组中存放的就是子结点的指针。

各种不同类型的Node可以相互转换，如果子结点数量超过限制容量就向上转换，如果节点数量相较于限制容量太小就向下转换。

Leaf

ART中的叶节点存放的就是Key对应的Value值 ART的叶节点可以采用三种形式

单独有一个叶节点类型专门保存Value
和中间节点保持一致的类型，唯一区别则是child数组不保存指针而是值
如果值足够小可以通过位操作和指针一起保存，那么我们可以将值直接存放在内部节点中。

DuckDB采用的是第一种方式。

优化

在解决了ART的空间问题，我们希望可i进一步优化查询速度，即减少树的高度。论文中有两种方式，但实际上我们可以通过一种简单的做法同时获得这两种优化，每个节点加上Prefix标识。

lazy expansion <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/lazy-expansion.png"/> </div> 其实这个优化相当简单，我们只需Leaf可以保存多个byte即可，这样子对于多个只有一个子结点的路径来说，我们可以将其都保存在Leaf中，从而减少树的高度。
path compression <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/path-compress.png"/> </div> 这个优化和lazy expansion类似，我们只需让内部节点可以保存多个byte即可。即如果内部节点有相同的前缀，我们可以将其保存在Prefix中，key数组仅仅只对key不同的部分作区分。这样子也可以有效的减少树的高度。

如果这里没看懂也没关系，后续我们会分析DuckDB的代码，那样会更加清晰。

数据转换

对于ART来说，我们前面介绍的都是对于字符串类型，如果作为一个被广泛使用的索引，那我们也需要支持不同类型的数据。而ART索引实际上是把Key作为数据流进行处理的，也就是说如果想要通过ART进行范围搜索，我们需要让Key保持一个性质，即二进制的大小与该类型的语义大小相同。即 $$ \text{memcmp}(binary(x), binary(y)) < 0 \iff \text{x} < \text{y} $$

$$ \text{memcmp}(binary(x), binary(y)) = 0 \iff \text{x} = \text{y} $$

$$ \text{memcmp}(binary(x), binary(y)) > 0 \iff \text{x} > \text{y} $$

因此我们需要对某些数字进行转换

unsigned integers

无需转化，已经满足需求。
signed integers

将符号位flip即可

Floating Point Numbers

static inline uint32_t EncodeFloat(float x) {
	uint64_t buff;

	//! zero
	if (x == 0) {
		buff = 0;
		buff |= (1u << 31);
		return buff;
	}
	// nan
	if (Value::IsNan(x)) {
		return UINT_MAX;
	}
	//! infinity
	if (x > FLT_MAX) {
		return UINT_MAX - 1;
	}
	//! -infinity
	if (x < -FLT_MAX) {
		return 0;
	}
	buff = Load<uint32_t>(const_data_ptr_cast(&x));
	if ((buff & (1u << 31)) == 0) { //! +0 and positive numbers
		buff |= (1u << 31);
	} else {          //! negative numbers
		buff = ~buff; //! complement 1
	}

	return buff;
}

Character Strings

UCA算法已经做出了定义
Null

我们可以将该值设置为比最大位数仍多1位。
Compound Keys

按照其包含的基本类型进行拼接即可

源码解析

这一章节我们通过阅读DuckDB的源码，来看一下ART索引的实现。 ART索引的相关实现都在art.cpp和art.hpp，我们主要关注Insert和Find, 其他的函数留给读者自行了解。

Insert

bool ART::Insert(Node &node, const ARTKey &key, idx_t depth, const row_t &row_id) {

	if (!node.IsSet()) {
		// node is currently empty, create a leaf here with the key
		Leaf::New(*this, node, key, depth, row_id);
		return true;
	}

	if (node.DecodeARTNodeType() == NType::LEAF) {

		// add a row ID to a leaf, if they have the same key
		auto &leaf = Leaf::Get(*this, node);
		auto mismatch_position = leaf.prefix.KeyMismatchPosition(*this, key, depth);

		// identical equal
		if (mismatch_position == leaf.prefix.count && depth + leaf.prefix.count == key.len) {
			return InsertToLeaf(node, row_id);
		}

		// example:
		// prefix : hello
		// key[depth] : heel;
		// mismatch_position = 2
		// replace leaf with Node4 and store both leaves in it
		auto old_node = node;
		auto &new_n4 = Node4::New(*this, node);

		// new prefix
		// new_n4's prefix is he
		new_n4.prefix.Initialize(*this, key, depth, mismatch_position);

		// old_node's prefix change to llo
		auto key_byte = old_node.GetPrefix(*this).Reduce(*this, mismatch_position);

		// add child
		Node4::InsertChild(*this, node, key_byte, old_node);

		Node leaf_node;
		Leaf::New(*this, leaf_node, key, depth + mismatch_position + 1, row_id);
		// add child
		Node4::InsertChild(*this, node, key[depth + mismatch_position], leaf_node);

		return true;
	}

	// handle prefix of inner node
	auto &old_node_prefix = node.GetPrefix(*this);
	if (old_node_prefix.count) {

		auto mismatch_position = old_node_prefix.KeyMismatchPosition(*this, key, depth);
		if (mismatch_position != old_node_prefix.count) {

			// prefix differs, create new node
			auto old_node = node;
			auto &new_n4 = Node4::New(*this, node);
			new_n4.prefix.Initialize(*this, key, depth, mismatch_position);

			auto key_byte = old_node_prefix.Reduce(*this, mismatch_position);
			Node4::InsertChild(*this, node, key_byte, old_node);

			Node leaf_node;
			Leaf::New(*this, leaf_node, key, depth + mismatch_position + 1, row_id);
			Node4::InsertChild(*this, node, key[depth + mismatch_position], leaf_node);

			return true;
		}
		depth += node.GetPrefix(*this).count;
	}

	// recurse
	D_ASSERT(depth < key.len);
	auto child = node.GetChild(*this, key[depth]);
	if (child) {
		bool success = Insert(*child, key, depth + 1, row_id);
		node.ReplaceChild(*this, key[depth], *child);
		return success;
	}

	// insert at position
	Node leaf_node;
	Leaf::New(*this, leaf_node, key, depth + 1, row_id);
	Node::InsertChild(*this, node, key[depth], leaf_node);
	return true;
}

代码还是比较多的，我们先介绍一下参数的意义

node 即为当前要进行插入的节点.
key 即为要插入的key
depth

即当前已经处理到key的第几个byte,举个例子，key为hello, depth为3，那么说明he已经保存在了node的祖先节点中，我们当前要处理的是l。
row_id 即为key对应的value值.

bool ART::Insert(Node &node, const ARTKey &key, idx_t depth, const row_t &row_id) {

	if (!node.IsSet()) {
		// node is currently empty, create a leaf here with the key
		Leaf::New(*this, node, key, depth, row_id);
		return true;
	}
}

如果当前节点为空，那么直接设置该节点为叶节点，并且将row_id进行保存，注意这里我们会使用lazy-expansion, 即将key剩余未处理的字符全部保存在叶节点中。

bool ART::Insert(Node &node, const ARTKey &key, idx_t depth, const row_t &row_id) {

	// .... skip
	if (node.DecodeARTNodeType() == NType::LEAF) {

		// add a row ID to a leaf, if they have the same key
		auto &leaf = Leaf::Get(*this, node);
		auto mismatch_position = leaf.prefix.KeyMismatchPosition(*this, key, depth);

		// identical equal
		if (mismatch_position == leaf.prefix.count && depth + leaf.prefix.count == key.len) {
			return InsertToLeaf(node, row_id);
		}

		// example:
		// prefix : hello
		// key[depth] : heel;
		// mismatch_position = 2
		// replace leaf with Node4 and store both leaves in it
		auto old_node = node;
		auto &new_n4 = Node4::New(*this, node);

		// new prefix
		// new_n4's prefix is he
		new_n4.prefix.Initialize(*this, key, depth, mismatch_position);

		// old_node's prefix change to llo
		auto key_byte = old_node.GetPrefix(*this).Reduce(*this, mismatch_position);

		// add child
		Node4::InsertChild(*this, node, key_byte, old_node);

		Node leaf_node;
		Leaf::New(*this, leaf_node, key, depth + mismatch_position + 1, row_id);
		// add child
		Node4::InsertChild(*this, node, key[depth + mismatch_position], leaf_node);

		return true;
	}
	
	//skip....
}

如果当前遇到的是叶节点，同时key完全相同，那么我们可以直接将row_id插入叶节点中。不然的话，我们需要将叶节点变为内部节点，同时将不同的部分作为该内部节点的叶节点。如下图所示。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/leaf-insert.png"/> </div>

bool ART::Insert(Node &node, const ARTKey &key, idx_t depth, const row_t &row_id) {

	// skip ....
	// handle prefix of inner node
	auto &old_node_prefix = node.GetPrefix(*this);
	if (old_node_prefix.count) {

		auto mismatch_position = old_node_prefix.KeyMismatchPosition(*this, key, depth);
		if (mismatch_position != old_node_prefix.count) {

			// prefix differs, create new node
			auto old_node = node;
			auto &new_n4 = Node4::New(*this, node);
			new_n4.prefix.Initialize(*this, key, depth, mismatch_position);

			auto key_byte = old_node_prefix.Reduce(*this, mismatch_position);
			Node4::InsertChild(*this, node, key_byte, old_node);

			Node leaf_node;
			Leaf::New(*this, leaf_node, key, depth + mismatch_position + 1, row_id);
			Node4::InsertChild(*this, node, key[depth + mismatch_position], leaf_node);

			return true;
		}
		depth += node.GetPrefix(*this).count;
	}

	// recurse
	D_ASSERT(depth < key.len);
	auto child = node.GetChild(*this, key[depth]);
	if (child) {
		bool success = Insert(*child, key, depth + 1, row_id);
		node.ReplaceChild(*this, key[depth], *child);
		return success;
	}

	// insert at position
	Node leaf_node;
	Leaf::New(*this, leaf_node, key, depth + 1, row_id);
	Node::InsertChild(*this, node, key[depth], leaf_node);
	return true;
}

如果是内部节点，那我们需要讨论

如果前缀完全相同，即“hello"和”hellopxxx“。那么我们仅需要找出子结点进行插入即可。如下图所示。

如果前缀有不同指出，那么我们需要创建一个新的节点。并将两个节点作为子结点进行插入。如下图所示。

可以看到，我们只需要在内部节点，和叶节点中支持存储多个字符后，便天然支持上述的有化方案。

Find

Node ART::Lookup(Node node, const ARTKey &key, idx_t depth) {

	while (node.IsSet()) {
		if (node.DecodeARTNodeType() == NType::LEAF) {
			auto &leaf = Leaf::Get(*this, node);

			// check if leaf contains key
			for (idx_t i = 0; i < leaf.prefix.count; i++) {
				if (leaf.prefix.GetByte(*this, i) != key[i + depth]) {
					return Node();
				}
			}
			return node;
		}
		auto &node_prefix = node.GetPrefix(*this);
		if (node_prefix.count) {
			for (idx_t pos = 0; pos < node_prefix.count; pos++) {
				if (key[depth + pos] != node_prefix.GetByte(*this, pos)) {
					// prefix mismatch, subtree of node does not contain key
					return Node();
				}
			}
			depth += node_prefix.count;
		}

		// prefix matches key, but no child at byte, does not contain key
		auto child = node.GetChild(*this, key[depth]);
		if (!child) {
			return Node();
		}

		// recurse into child
		node = *child;
		D_ASSERT(node.IsSet());
		depth++;
	}

	return Node();
}

查找的代码相对来说比较简单

查找到了 Leaf 节点,检查Prefix是否匹配。如果不匹配说明Key不存在，若匹配直接返回该叶节点即可。
查找到了 内部节点,检查Prefix是否匹配。如果不匹配说明Key不存在，若匹配继续搜索对应的子节点。

Last

本文介绍了DuckDB的ART索引，可以看到尽管ART索引的树会比B+树更高，因此如果是面向磁盘的情况下，B+树会比ART索引优势更大，但是如果是内存索引的情况下，ART索引更加紧凑，同时他的渐进时间复杂度仅与key的长度有关，可能也更加cache friendly？它的节点相较于B+树更加的小，可以更多的保存在cache中。从论文中的实验来看，它的性能会比B+树更好。相较于Hash table,它支持范围查询。基于此DuckDB将ART索引作为其的主要索引。

DuckDB -- table的存储格式

tang-hi — Wed, 19 Jul 2023 00:00:00 GMT

本文将介绍DuckDB是如何存储它的表结构，本文仅涉及表结构，其他对于理解表结构无关的内容会进行忽略或者一笔带过。

前置知识

Block Type

DuckDB与其他数据库不同，它将所有的信息都存储在了同一个文件中。文件之中使用Block进行管理,Block分为MetaBlock以及DataBlock

DataBlock 即为单纯的一个Block
MetaBlock 是一个Block List，它的头8个字节表示 next_block_id。因此如果内容过多，我们可以使用这样一个Block list来存储。

Field Reader

我们有时在一个Block中读取数据时，会采用Field Reader的方式来进行读取。该Field Reader与表的字段无关，仅仅是在你读取一些数据前，会先读取max_field_count和total_size

max_field_count后续要读取的字段个数
total_size 后续要读取的总字节数。

Segment Tree

Segment可以认为是一块数据，我们使用SegmentTree来对Segment进行管理，虽然它的名字叫做SegmentTree，但实际上它内部是使用vector来保存Segment的。并且会使用二分查找来寻找指定的Segment，因此这要求Segment是按序存储的。SegmentTree的另一个特点就是支持懒加载。它并不会一次性将要管理的Segment全部读取进来，相反，它会在需要时，才从磁盘中读取Segment.

文件结构

这一节我们开始介绍DuckDB的文件结构。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/header.png"/> </div>

我们从图中可以看到DuckDB有三个Header，因为这三个Header并不影响我们理解表的存储，因此这里只是简单的介绍一下。

MainHeader
1. checksum 校验和
2. Magic bytes 确定这是duckDB的文件
3. version numbers 版本号
4. flags 表明该数据库是否可读，可写
DataBaseHeader
1. iteration 迭代次数
2. meta block 存放data的第一个block的block-id
3. free list 可被复用的block
4. block count 总block数

下面我们可以看到Data的存储,它由一个schema count和 ${schema_count} 个schema组成，我们的表就存储在schema中。(schema在DuckDB中可以认为就是一个database) <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/overview.png"/> </div> 我们继续看schema的存储结构，第一个字段就是schema的名称,随后跟着的就是该schema所拥有的各种类型的个数。下面我们简单介绍各种类型。有兴趣的可以自己看一下官网的定义。这里我们只关注table的数据.

我们继续查看table的结构

我们从图中可以看到table中前三个字段分别是catalog name, schema name, table name. 我们通过这三个字段可以确定这个表属于哪一个文件(catalog)的哪一个数据库(schema)的哪一个表. costraints这一字段来表明该表的一些约束,比如Not Null, Unique等.我们这篇文章不会深究这部分,我们主要研究Columns以及table data字段.

Columns

该字段保存的是table各个列的定义. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/column-define.png"/> </div> 我们可以看到,第一个字段保存的是column的数量,该字段后面紧跟着每个column的定义.我们下面来看一下各个字段的意义

column name 字段名
column type 字段类型
expression 表达式, 有些字段是通过表达式生成的.
table Column type 这个不同于column type, 并不表示字段类型,他只有两个取值,一个是STANDARD, 另一个则是GENERATED . (其实我也不是太明白这个字段的意思,大概是用来判断这个字段是不是生成的)
compression type 表明这个字段所采用的压缩方法.

在获得了column的类型以后,其实我们已经完全知晓了table的整个结构,剩下的就是实际数据以及索引的数据了.而这些数据则需要通过table data这个字段获得.

table data

因为索引以及表的实际数据一般都比较大,因此我们并没有在这里直接存储,而是存储了指向实际数据的指针(block-id, offset).

我们配合这张图来对各个字段进行解释

table data block 指向实际表数据的指针.
total rows 该表的行数.
index num 该表的所有索引数量.
index 指向索引的指针.

下面我们看一下table data block的实际数据存储的结构 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/row-group.png"/> </div>

最开始存储的是一系列column数据的元信息（后面会介绍column data block的结构)，后两个个字段十分好理解，第一个存储着表的统计信息，另一个存储着 row group 数量。这里引出两个问题，什么是row group ，为什么存储格式不和前面一样即<data-count, data, data, ... data>.而是只存储了一个row group pointer,如果row group count 大于1怎么办？

row group

我们都知道OLAP一般采取列式存储,而OLTP则采取行式存储。尽管在读取，计算方面列式存储优于行式存储，但如果是频繁的增删改查，行式存储则优于列式存储。因此DuckDB在这里做了一个折衷方案，即将tuple进行分组，组内进行列式存储。目前是每122880分为一组。

为什么只有一个row group pointer

因为row group一定是按照行号按序存储的，同时它存储的block为meta block，所以它可以通过SegmentTree进行管理，从而可以对后续的row group进行懒加载, 当需要时再直接向后读取即可，因此在这里只需要存储第一块的block-id了。

下面我们再看一下row group的存储结构

row start 该row group的起始行号
tuple count 该row group的行数
column pointers 因为row group中是按列存储，因此该pointer指向column的实际存储地址
versions 这个字段没太细看，应该是mvcc相关的内容。

我们继续看column data block的存储结构 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/column-data-block.png"/> </div>

我们惊讶的发现这里仍旧不是实际存储数据的地方，存放的还是指针，这是为什么？原因在于实际的column数据是存放在pure block中的，即它没法像meta block那样有一个 block list，而每个block的大小是定死的，因此我们需要一个个block存储，这里的data pointer就充当了block list的链接作用。

按照惯例，依旧解释一下各字段的含义

row start 这片数据的开始行号
tuple count 存储的总行数
block id 实际数据所在的block id
offset 实际数据所在的block id 的offset
compress 数据所采用的压缩方式
stat 该部分数据的统计信息

现在我们来到了column数据所在的block。存储的格式会因为压缩方式不同而不同，我这里简单介绍几种，有兴趣的可以自己看一下其他几种。

Const Column

const column，即所有的值都一样，所以我们可以完全不存储任何数据。只需从统计信息中得到min value即可

uncompress column

uncompress column，即不压缩。对于像uint32, uint8这样的数据类型，因为是固定大小，因此我们只要一个个读取即可。但是对于像string 这样非定长的数据类型,我们就会采用另一种方式来存储,即 Dictionary Compress(说好的不压缩呢！)

对于string首先前两个字段就可以得到dict的位置

dict_start = dict_end - dict_size
dict_end = dict_end
dict_size = dict_size

我们在这里将dict可以看作string pool，而offset则是对应的起始位置，而offsets[i] - offsets[i-1]即为长度。这么说有点抽象，我们举一个例子。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/string-compress.png"/> </div>

这个例子里面我们一共有三个字符串 foo ,bar , duckdb

我们将这三个字符串逆序存放在dict中。offset则是相对于dict end的offset.通过这种方式我们可以定位到相应的string的首地址。

foo
head = dict - offset = dict - 3

length = 3 - 0 = 3
bar
head = dict - offset = dict - 6

length = 6 - 3 = 3
foo
head = dict - offset = dict - 12

length = 12 - 6 = 6

还记得我们说过column data所在的block都是pure block，如果string的长度超长怎么办？我们会通过将offset取负数，表明该string较长，同时在dict中对应位置存储(block id, offset)，然后去另一个block中读取该string。

RLE column and bitpacking

RLE column相对简单，前面存储的是值，后面存储该值出现的次数。通过 RLE count offset将两者进行分隔。

bitpack column留给有兴趣的读者自己研究。

Dictionary column

如果你理解了uncompress column中string的存储方式，那你也会较为容易的理解Dictionary column，其中dict含义保持不变，index Buffer则是之前提到的offsets,bitpacking 存储的则是该行对应的值是index Buffer中的第几个。通过 dict.get(indexBuffer[bitpacking[i]]) 获得存储的值。

值得注意的是，这里还有一个优化时，在实际扫描时，会先对dict进行解压，而后如果发现要扫描所有数据时，只需要解压bitpacking即可。

Last

本文介绍了DuckDB中table的存储结构，duckDB相比于其他的数据库，它仅使用一个文件存储整个数据库(其实我也不知道这是好是坏，但是它的定位是单机数据库，不寻求分布式能力，也许还可以？) 同时它使用了row group的方案，并对其进行懒加载的方式提升性能。column也支持多种压缩格式。

LevelDB(3) -- 压实与版本

tang-hi — Wed, 21 Jun 2023 00:00:00 GMT

本文将介绍LevelDB的压实操作，以及相应的版本管理。

Compact

通过前面的文章，我们知道LevelDB会先将添加的KV写入到内存中的MemTable中，等到MemTable到达一定阈值后，再将MemTabledump到磁盘中，而且LevelDB为了写的性能，并不会做update-in-place,而是标记删除。这就会导致，随着数据的增多，无用的数据也增多(被标记删除的旧记录)，文件数也会越来越多。因此我们需要将多个小文件合并为一个大文件，从而删除无用的数据，并且减少文件数从而提升查询性能。

合并可以减少空间的占用也许比较好理解，但是为什么减少文件数可以提升性能呢？首先，如果文件数多，那么做一个查询时，需要查询的文件数也相应会变多。其次通过压实合并文件，同一level中的文件可以保证key之间没有重叠，从而每一层只需要查找一个文件即可，不同level之间的文件中的key也尽可能没有重叠。

下面我们来看一下LevelDB的Compact实现

Compact的时机

LevelDB在三种情况下会尝试触发Compact

DB刚被打开时，此时会尝试触发一次Compact
有数据写入时，此时也会尝试触发一次Compact
查询数据时，也会尝试触发一次Compact

第一种情况没什么好说的，就是打开的时候看看能不能让整个DB更整洁。

第二种情况则是如果数据写入前，发现MemTable已经到达阈值了，那么此时需要将当前的MemTabledump到磁盘中（这也是一种压实）dump的具体细节在 LevelDB(1) -- 写::ldb文件的格式与生成。

第三种情况则是查询数据时，如果我们一次查询，查询了多个文件，这就说明level与level之间有key重叠（同level中key不重叠,除了level-0，因此如果查询了多个文件说明一定涉及多个level), 对于这种情况我们会记录这个文件被查询的次数,当到达阈值后，我们就要尝试进行Compact，这样子后续再查时，我们可能只需要查找一个文件就可以了。

一个文件可以被查询的阈值是如何设置的，我们直接看代码与注释，相信就可以很好的理解了。

 	  // We arrange to automatically compact this file after
      // a certain number of seeks.  Let's assume:
      //   (1) One seek costs 10ms
      //   (2) Writing or reading 1MB costs 10ms (100MB/s)
      //   (3) A compaction of 1MB does 25MB of IO:
      //         1MB read from this level
      //         10-12MB read from next level (boundaries may be misaligned)
      //         10-12MB written to next level
      // This implies that 25 seeks cost the same as the compaction
      // of 1MB of data.  I.e., one seek costs approximately the
      // same as the compaction of 40KB of data.  We are a little
      // conservative and allow approximately one seek for every 16KB
      // of data before triggering a compaction.
      f->allowed_seeks = static_cast<int>((f->file_size / 16384U));
      if (f->allowed_seeks < 100) f->allowed_seeks = 100;

      levels_[level].deleted_files.erase(f->number);
      levels_[level].added_files->insert(f);

Compact的实现

我们先看代码

void DBImpl::BackgroundCall() {
  MutexLock l(&mutex_);
  // 该标识已经在 DBImpl::MaybeScheduleCompaction 进行设置
  assert(background_compaction_scheduled_);
  if (shutting_down_.Acquire_Load()) {
    // No more background work when shutting down.
  } else if (!bg_error_.ok()) {
    // No more background work after a background error.
  } else {
    // 执行具体的压实任务
    BackgroundCompaction();
  }

  background_compaction_scheduled_ = false;

  // 前一次压实可能在某个 level 产生了过多文件, 所以再调度
  // 一次压实, 如果判断真得需要的话.
  MaybeScheduleCompaction();
  background_work_finished_signal_.SignalAll();
}

在这里我们可以看到，我们再Compact后，仍然会尝试再次Compact这是因为再上一次的Compact后，可能我们产生了过多的文件，从而需要再次Compact.

下面的代码是Compact的具体实现.

// 该方法仅在 DBImpl::BackgroundCall 调用
void DBImpl::BackgroundCompaction() {
  // 压实过程需要全程持有锁, 这也暗示压实不能耗费太多时间.
  mutex_.AssertHeld();

  // 先压实已满的 memtable
  if (imm_ != nullptr) {
    CompactMemTable();
    return;
  }
    
  // 如果手动触发了一个压实
  if (is_manual) {
    // ...
  } else {
    // 否则根据统计信息确定待压实 level
    c = versions_->PickCompaction();
  }


  Status status;
  if (c == nullptr) {
    // 无需压实
  } else if (!is_manual && c->IsTrivialMove()) {
    // 不做压实, 直接把文件从 level 移动到 level+1
    assert(c->num_input_files(0) == 1);
    FileMetaData* f = c->input(0, 0);
    // 将该文件从 level 层删除
    c->edit()->DeleteFile(c->level(), f->number);
    // 将该文件增加到 level+1
    c->edit()->AddFile(c->level() + 1, f->number, f->file_size,
                       f->smallest, f->largest);
    // 应用本次移动操作
    status = versions_->LogAndApply(c->edit(), &mutex_);
    if (!status.ok()) {
      RecordBackgroundError(status);
    }
    VersionSet::LevelSummaryStorage tmp;
    Log(options_.info_log, "Moved #%lld to level-%d %lld bytes %s: %s\n",
        static_cast<unsigned long long>(f->number),
        c->level() + 1,
        static_cast<unsigned long long>(f->file_size),
        status.ToString().c_str(),
        versions_->LevelSummary(&tmp));
  } else {
    CompactionState* compact = new CompactionState(c);
    // 做压实
    status = DoCompactionWork(compact);
    if (!status.ok()) {
      RecordBackgroundError(status);
    }
    // 清理压实现场
    CleanupCompaction(compact);
    // 释放压实用到的输入文件
    c->ReleaseInputs();
    // 删除过期文件
    DeleteObsoleteFiles();
  }
  delete c;

  //....
}

从代码中可以看到，整个Compact的实现分为两部分，如果有需要被dump到磁盘的MemTable,那么就直接进行压实。具体流程我在LevelDB(1) -- 写::ldb文件的格式与生成有详细的描述，这里不赘述。在本篇文章中，我们主要关注第二部分，即自动Compact。

首先我们会尝试挑选出需要被Compact的level。

	Compaction* c;
  int level;

  const bool size_compaction = (current_->compaction_score_ >= 1);
  const bool seek_compaction = (current_->file_to_compact_ != nullptr);

  // 我们倾向于因为某层数据太多而触发的压实,
  // 而非因为查询次数超过上限(即 FileMetaData->allowed_seeks)触发的压实.
  // 实现办法就是先检查大小后检查查询次数.

  // 先看有无 level 存储比值已经超过上限
  if (size_compaction) {
    level = current_->compaction_level_;
    assert(level >= 0);
    assert(level+1 < config::kNumLevels);
    c = new Compaction(options_, level);

    // 找到待压实 level 第一个可能包含 compact_pointer_[level] 的文件
    for (size_t i = 0; i < current_->files_[level].size(); i++) {
      FileMetaData* f = current_->files_[level][i];
      if (compact_pointer_[level].empty() ||
          icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > 0) {
        // 把这个文件追加到 level 对应的待压实文件集合中
        c->inputs_[0].push_back(f);
        break;
      }
    }
    // 如果 level 对应的待压实文件集合为空(说明 compact_pointer_[level]
    // 位于 level 最后一个文件之后), 则回绕到开头, 将其第一个
    // 文件加入到待压实集合.
    if (c->inputs_[0].empty()) {
      // Wrap-around to the beginning of the key space
      c->inputs_[0].push_back(current_->files_[level][0]);
    }
  } else if (seek_compaction) { // 再看是否有文件因为查询次数过多
    // (Version::Get() 时候疑似包含但实际不包含目标 key 的最底层
    // level 的第一个文件会被记录到统计信息中, 然后会被 Version::UpdateStats() 处理)
    // 而可以触发压实
    level = current_->file_to_compact_level_;
    c = new Compaction(options_, level);
    c->inputs_[0].push_back(current_->file_to_compact_);
  } else {
    return nullptr;
  }

  c->input_version_ = current_;
  c->input_version_->Ref();

  // level-0 文件可能彼此重叠, 所以要把全部重叠文件都加入到待压实文件集合中
  if (level == 0) {
    InternalKey smallest, largest;
    GetRange(c->inputs_[0], &smallest, &largest);
    // Note that the next call will discard the file we placed in
    // c->inputs_[0] earlier and replace it with an overlapping set
    // which will include the picked file.
    // 注意下面这个方法会清除 inputs[0] 内容, 不过不用担心, 由于已经提前提取到了
    // inputs[0] 键范围所以下面这个方法会把那个被清除的文件重新捞回来.
    current_->GetOverlappingInputs(0, &smallest, &largest, &c->inputs_[0]);
    assert(!c->inputs_[0].empty());
  }

  // 将 level+1 中与 level 对应待压实集合重叠的文件拿出来做压实
  SetupOtherInputs(c);

  return c;

根据Compact的触发原因不同，我们采用不同的策略

由于某一层的数据超过阈值导致的Compact,对于这种情况我们采用round-robin的方式来进行Compact，即如果上一次Compact的最大的key为“A1”，那么我们这一次就挑选出比"A1"大的文件来做Compact.
如果是由于查询次数过大导致的Compact,那么我们就直接选择该文件来做Compact

注意，我们会对Level-0做特殊的处理，因为Level-0中文件的Key会重叠，因此我们会将所有Key重叠的文件都作为准备Compact的候选项。

在获得需要Compact的文件后，我们需要在上一层寻找与当前层重叠的文件作为一个整体一起compact。如下图所示。

在得到上一层准备被Compact的文件后，我们会获得key的范围，如上图所示，一开始Level-x的范围为（50-700），在得到Level-x+1准备被Compact的文件后,范围来到了(50-720)。在某些情况下，我们在不改变Level-x+1准备被Compact的文件数的前提下，从Level-x中选择更多的文件来进行Compact，从而使得Compact的效率更高。之所以不改变Level-x+1准备被Compact的文件数，是为了防止无限制的循环下去，从而导致Level-x和Level-x+1的所有文件全部都需要进行Compact。

在决定了需要Compact的文件后，我们有两种方式进行Compact

TrivialMove

return (num_input_files(0) == 1 && num_input_files(1) == 0 &&
          TotalFileSize(grandparents_) <=
              MaxGrandParentOverlapBytes(vset->options_));

如果 level 层只有 1 个待压实文件， level+1 层没有与 level 待压实文件发生重叠的文件且 level+2 层与 level 待压实文件重叠的字节数不大于上限,则可以用移动替代压实.这里之所以要判断**level+2 层与 level 待压实文件重叠的字节数不大于上限** 是因为如果 level 与祖父(即 level+2) 有大量重叠数据, 合并后会创建一个父文件(即 level+1), 很显然这个文件和自己父亲 level(即上面说的 level+2)存在大量重叠数据, 这个情况会导致后续非常昂贵的合并.

DoCompactionWork

// 具体压实就做一件事情:
// 遍历待压实文件, 如果某个 key (位于 level-L 或者 level-(L+1))的类型属性取值为"删除",
// 则确认其在 level-(L+2) 或之上是否存在, 若不存在则丢弃之, 否则写入合并后的文件.
Status DBImpl::DoCompactionWork(CompactionState* compact) {
  const uint64_t start_micros = env_->NowMicros();
  // 用于 imm_ 压实耗时统计
  int64_t imm_micros = 0;

  Log(options_.info_log,  "Compacting %d@%d + %d@%d files",
      compact->compaction->num_input_files(0),
      compact->compaction->level(),
      compact->compaction->num_input_files(1),
      compact->compaction->level() + 1);

  assert(versions_->NumLevelFiles(compact->compaction->level()) > 0);
  assert(compact->builder == nullptr);
  assert(compact->outfile == nullptr);
  // 如果快照列表为空, 则将最新的操作序列号作为最小的快照
  if (snapshots_.empty()) {
    compact->smallest_snapshot = versions_->LastSequence();
  } else {
    // 否则从快照列表获取最老的快照对应的序列号作为最小快照.
    // 虽然最老, 但是没有 release 就是要保障可见性的.
    compact->smallest_snapshot = snapshots_.oldest()->sequence_number();
  }

  // 真正做压实工作的之前要释放锁
  mutex_.Unlock();

  // 针对待压实的全部文件创建一个大迭代器
  Iterator* input = versions_->MakeInputIterator(compact->compaction);
  // 迭代器指针拨到开头
  input->SeekToFirst();
  Status status;
  ParsedInternalKey ikey;
  // 下面三个临时变量用来处理多个文件(如果压实涉及了 level-0)
  // 或多个 level 存在同名 user key 的问题, 典型地有如下两种:
  // 1. level-0 文件可能存在重叠, 同名 user key 后出现的更新,
  // 序列号也更大.
  // 2. 低 level  和高 level 之间可能重叠(这个可能其实是肯定,
  // 因为不重叠就不用压实了), 同名 user key 先出现的更新, 序列号也更大.
  std::string current_user_key;
  bool has_current_user_key = false;
  // 如果 user key 出现多次, 下面这个用于记录上次出现时对应的
  // internal key 的序列号.
  SequenceNumber last_sequence_for_key = kMaxSequenceNumber;
  for (; input->Valid() && !shutting_down_.Acquire_Load(); ) {
    // 优先处理已经写满待压实的 memtable
    if (has_imm_.NoBarrier_Load() != nullptr) {
      const uint64_t imm_start = env_->NowMicros();
      mutex_.Lock();
      if (imm_ != nullptr) {
        // immutable memtable 落盘
        CompactMemTable();
        // 如有必要唤醒 MakeRoomForWrite()
        background_work_finished_signal_.SignalAll();
      }
      mutex_.Unlock();
      imm_micros += (env_->NowMicros() - imm_start);
    }

    // 即将被处理的 key
    Slice key = input->key();
    // 当发现截止到 key, level 和 level+2 重叠数据量已经达到上限, 则
    // 开始进行压实; key 也是压实的最右区间.
    //　一进来循环看到这个判断代码可能比较懵, 肯定看不太懂, 其实下面这个判断一般
    // 要经过若干循环才能成立, 先看后面代码再回来看这个判断.
    if (compact->compaction->ShouldStopBefore(key) &&
        compact->builder != nullptr) {
      // 将压实生成的文件落盘
      status = FinishCompactionOutputFile(compact, input);
      if (!status.ok()) {
        break;
      }
    }

    // Handle key/value, add to state, etc.
    bool drop = false;
    // 反序列化 internal key
    if (!ParseInternalKey(key, &ikey)) {
      // Do not hide error keys
      current_user_key.clear();
      has_current_user_key = false;
      last_sequence_for_key = kMaxSequenceNumber;
    } else {
      // 如果这个 user key 之前迭代未出现过, 记下来
      if (!has_current_user_key ||
          user_comparator()->Compare(ikey.user_key,
                                     Slice(current_user_key)) != 0) {
        current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());
        has_current_user_key = true;
        // 标记这个 user key 截止目前轮次迭代对应的序列号;
        // 因为是首次出现所以这里直接置为序列号最大可能取值.
        // 确保最新的数据一定不会被drop
        last_sequence_for_key = kMaxSequenceNumber;
      }

      // 序列号过小, 丢弃这个 key 本次迭代对应的数据; 后面还有这个 key
      // 对应的更新的数据.
      // 上一个seq <= smallest_snapshot, 那么这个必然 < smallest_snapshot
      // 因此可以直接丢弃。
      if (last_sequence_for_key <= compact->smallest_snapshot) {
        // Hidden by an newer entry for same user key
        drop = true;    // 规则 (A)
      } else if (ikey.type == kTypeDeletion &&
                 ikey.sequence <= compact->smallest_snapshot &&
                 compact->compaction->IsBaseLevelForKey(ikey.user_key)) {
        // 对于这个 user key:
        // (1) 更高的 levels(指的是祖父 level 及之上)没有对应数据了
        // (2) 更低的 levels 对应的数据的序列号会更大(这个是显然地)
        // (3) 目前正在被压实的各个 levels(即 level 和 level+1) 中序列号
        // 更小的数据在循环的未来几次迭代中会被丢弃(根据上面的规则(A)).
        //
        // 综上, 这个删除标记已经过期了并且可以被丢弃.
        drop = true;
      }
	  // 如果没有snapshot，相同的user key只保存最新数据。
      last_sequence_for_key = ikey.sequence;
    }
#if 0
    Log(options_.info_log,
        "  Compact: %s, seq %d, type: %d %d, drop: %d, is_base: %d, "
        "%d smallest_snapshot: %d",
        ikey.user_key.ToString().c_str(),
        (int)ikey.sequence, ikey.type, kTypeValue, drop,
        compact->compaction->IsBaseLevelForKey(ikey.user_key),
        (int)last_sequence_for_key, (int)compact->smallest_snapshot);
#endif

    // 如果当前数据项不丢弃, 则进行压实落盘
    if (!drop) {
      // 如有必要则创建新的 output file
      if (compact->builder == nullptr) {
        status = OpenCompactionOutputFile(compact);
        if (!status.ok()) {
          break;
        }
      }
      if (compact->builder->NumEntries() == 0) {
        // 如果一个都没写过, input 迭代器又是从小到大遍历,
        // 所以当前 user key 肯定是最小的
        compact->current_output()->smallest.DecodeFrom(key);
      }
      // 否则当前 user key 目前就是最大的
      compact->current_output()->largest.DecodeFrom(key);
      // 将该 user key 对应的数据项写入 sstable.
      // TODO 这里有个地方没看明白:
      // 如果当前 user key 首次出现, 则
      // 上面 last_sequence_for_key 被置为 kMaxSequenceNumber,
      // 且类型不是 kTypeDeletion, 那当前数据项就不会被 drop, 即使
      // 这个数据项实际 sequence number 小于 smallest_snapshot,
      // 有点矛盾了.
      compact->builder->Add(key, input->value());

      // 如果 sstable 文件足够大, 则落盘并关闭
      if (compact->builder->FileSize() >=
          compact->compaction->MaxOutputFileSize()) {
        status = FinishCompactionOutputFile(compact, input);
        if (!status.ok()) {
          break;
        }
      }
    }

    // 处理下个 key
    input->Next();
  }

  if (status.ok() && shutting_down_.Acquire_Load()) {
    status = Status::IOError("Deleting DB during compaction");
  }
  if (status.ok() && compact->builder != nullptr) {
    status = FinishCompactionOutputFile(compact, input);
  }
  if (status.ok()) {
    status = input->status();
  }
  delete input;
  input = nullptr;

  CompactionStats stats;
  stats.micros = env_->NowMicros() - start_micros - imm_micros;
  for (int which = 0; which < 2; which++) {
    for (int i = 0; i < compact->compaction->num_input_files(which); i++) {
      stats.bytes_read += compact->compaction->input(which, i)->file_size;
    }
  }
  for (size_t i = 0; i < compact->outputs.size(); i++) {
    stats.bytes_written += compact->outputs[i].file_size;
  }

  mutex_.Lock();
  stats_[compact->compaction->level() + 1].Add(stats);

  if (status.ok()) {
    status = InstallCompactionResults(compact);
  }
  if (!status.ok()) {
    RecordBackgroundError(status);
  }
  VersionSet::LevelSummaryStorage tmp;
  Log(options_.info_log,
      "compacted to: %s", versions_->LevelSummary(&tmp));
  return status;
}

实际的压实工作其实很简单，对于待压实的文件构建一个统一的迭代器，从小到大顺序访问（还记得InternalKey的构造吗，该构造可以保证我们总是最先读到最新的数据),不断的将数据写入到新的文件中。注意，如果遇到遇到标记删除的数据,不应该马上drop，而是应该确定上层没有该key的数据了，再drop(因为如果drop了，会导致以后读取该数据，可能读到上层的数据，从而导致一个本该被删除的数据又被读到了)

后面每次把最新的数据写到新的文件中。下面两种情况需要将新生成的文件落盘。

文件到达阈值
文件与grandparent重叠的byte数到达阈值（减少后续的Compact压力）

最终完成压实的操作。

Version

LevelDB的最后，我们介绍一下版本，版本可以认为是LevelDB管理文件的一个接口，如果你想要获取文件，获取某一Level的文件，你都需要通过Version。

LevelDB的版本由三部分组成

VersionSet 负责维护所有的Version
Version 一个确定的版本，可以认为是数据库的Snapshot。
VersionEdit 增量更新的版本，当完成增量更新后，VersionEdit就会变为Version。

我们先看一下Version

class Version {
 
  VersionSet* vset_;           
  // 接下来两个指针使得 Version 可以构成双向循环链表
  // 指向链表中下个 version 的指针
  Version* next_;              
  // 指向链表中前个 version 的指针
  Version* prev_;              
  // 该 version 的活跃引用计数
  int refs_;                    

  // 核心成员, 该成员保存了当前最新的 level 架构信息,
  // 即 db 每个 level 的文件元数据链表
  std::vector<FileMetaData*> files_[config::kNumLevels];

  // 基于查询统计而得出的下个待压实的文件及其所在的 level
  FileMetaData* file_to_compact_;
  int file_to_compact_level_;

  // 基于存储比值计算的压实分数,
  // 小于 1 意味着未到上限, 压实不是很需要.
  // 由 Finalize() 计算.
  double compaction_score_;
  // 基于存储比值而得出的下个待压实的 level.
  // 由 Finalize() 计算.
  int compaction_level_;
};

我们可以看到Version最重要的是保存着每一层的文件元信息，通过这些信息，我们可以获得每一个Level所拥有的文件。这也就是Version最重要的作用，确定Level的格式。

我们再看一下VersionSet

class VersionSet {
  Env* const env_;
  const std::string dbname_;
  const Options* const options_;
  // 每次用户进行查询操作的时候(DBImpl::Get())可能需要去查询
  // 磁盘上的文件, 这就要求有个缓存功能来加速.
  // 下面这个成员会缓存 sstable 文件对应的 Table 实例, 
  // 用于加速用户的查询, 否则每次读文件解析
  // 就很慢了. 目前在用的缓存策略是 LRU.
  // 该变量实际值来自 DBImpl 实例, 具体见 VersionSet 构造方法.
  TableCache* const table_cache_;
  const InternalKeyComparator icmp_;
  uint64_t next_file_number_;
  uint64_t manifest_file_number_;
  // 记录最近一次更新操作对应的序列号(逐一递增, WriteBatch 包含一批更新操作, 每个更新操作都会有一个序列号).
  // 具体修改建 DbImpl::Write 方法
  uint64_t last_sequence_;
  uint64_t log_number_;
  uint64_t prev_log_number_;  // 0 or backing store for memtable being compacted

  // Opened lazily
  // 当前 MANIFEST 文件
  WritableFile* descriptor_file_;
  // MANIFEST 文件格式同 log 文件, 所以写入方法就复用了.
  // 其每条日志就是一个序列化后的 VersionEdit.
  log::Writer* descriptor_log_; 
  // 属于该 VersionSet 的 Version 都会被维护到一个双向循环链表中,
  // 而且新加入的 Version 都会插入到 dummy_versions_ 前面. 
  // dummy_versions_.next_ 默认指向自己(具体见 Version 构造函数)后续指向最老的 version.
  Version dummy_versions_; 
  // 指向当前 Version == dummy_versions_.prev_
  Version* current_;       

  // Per-level key at which the next compaction at that level should start.
  // Either an empty string, or a valid InternalKey.
  // 记录了每个 level 各自对应的下次压实的起始 key
  std::string compact_pointer_[config::kNumLevels];

  // No copying allowed
  VersionSet(const VersionSet&);
  void operator=(const VersionSet&);
}

我们可以看到VersionSet保存着整个数据库的元信息，例如下一个文件的number，最新的sequence...,同时也维护者最新的Version，我们可以将其视作数据库的元信息。

我们再看一下VersionEdit

class VersionEdit {

 private:
  friend class VersionSet;

  typedef std::set< std::pair<int, uint64_t> > DeletedFileSet;

  // 比较器名称
  std::string comparator_; // comparator name
  uint64_t log_number_;
  uint64_t prev_log_number_;
  // 下个 MANIFEST 文件编号, 从 1 开始
  uint64_t next_file_number_;
  // 下个写操作的序列号
  SequenceNumber last_sequence_;
  bool has_comparator_;
  bool has_log_number_;
  bool has_prev_log_number_;
  bool has_next_file_number_;
  bool has_last_sequence_;

  // 记录每个 level 下次压实的起始 key
  std::vector< std::pair<int, InternalKey> > compact_pointers_;
  // 保存从当前 level 架构要删除的一个文件
  DeletedFileSet deleted_files_;
  // 保存要新增到当前 level 架构中的文件(注意第二个参数不是指针类型)
  std::vector< std::pair<int, FileMetaData> > new_files_;
};

VersionEdit存储着目前数据库的增量信息，可以认为是实时的Version

我们从数据库的Open过程，来将这三个类串联起来。Open的过程主要还是Recover完成的，我们来看Recover的代码

Status DBImpl::Recover(VersionEdit* edit, bool *save_manifest) {
  mutex_.AssertHeld();

  // Ignore error from CreateDir since the creation of the DB is
  // committed only when the descriptor is created, and this directory
  // may already exist from a previous failed creation attempt.
  // 创建数据库目录(一个目录代表一个数据库)
  env_->CreateDir(dbname_);
  assert(db_lock_ == nullptr);
  // 锁定该目录
  Status s = env_->LockFile(LockFileName(dbname_), &db_lock_);
  if (!s.ok()) {
    return s;
  }

  // 如果 CURRENT 文件(记录当前 MENIFEST 文件名称)不存在则创建之
  if (!env_->FileExists(CurrentFileName(dbname_))) {
    if (options_.create_if_missing) {
      // 创建之
      s = NewDB();
      if (!s.ok()) {
        return s;
      }
    } else {
      // 报错
      return Status::InvalidArgument(
          dbname_, "does not exist (create_if_missing is false)");
    }
  } else {
    if (options_.error_if_exists) {
      return Status::InvalidArgument(
          dbname_, "exists (error_if_exists is true)");
    }
  }
  //....
}

首先Recover会将目录锁定起来，如果目录下没有CURRENT文件，那么该数据库为新建立的，新建一个数据库即可。

在这里CURRENT文件记录着最新的MANIFEST文件名。

在得到了最新的MANIFEST文件名后，我们就可以调用VersionSet::Recover读取MANIFEST文件

  Builder builder(this, current_);

  {
    LogReporter reporter;
    reporter.status = &s;
    log::Reader reader(file, &reporter, true/*checksum*/, 0/*initial_offset*/);
    Slice record;
    std::string scratch;
    // 循环读取 MANIFEST 文件日志, 每一行日志就是一个 VersionEdit
    while (reader.ReadRecord(&record, &scratch) && s.ok()) {
      VersionEdit edit;
      // 将 record 反序列化为 version_edit
      s = edit.DecodeFrom(record);
      if (s.ok()) {
        if (edit.has_comparator_ &&
            edit.comparator_ != icmp_.user_comparator()->Name()) {
          s = Status::InvalidArgument(
              edit.comparator_ + " does not match existing comparator ",
              icmp_.user_comparator()->Name());
        }
      }

      // 将 VersionEdit 保存到 VersionSet 的 builder 中, 
      // 后者可以一次性将这些文件变更与当前 Version 合并构成新 version.
      if (s.ok()) {
        builder.Apply(&edit);
      }

      if (edit.has_log_number_) {
        // 保存最新的日志文件名, 越后面的日志(record)记录的日志文件名越新
        log_number = edit.log_number_;
        have_log_number = true;
      }

      if (edit.has_prev_log_number_) {
        prev_log_number = edit.prev_log_number_;
        have_prev_log_number = true;
      }

      if (edit.has_next_file_number_) {
        next_file = edit.next_file_number_;
        have_next_file = true;
      }

      if (edit.has_last_sequence_) {
        last_sequence = edit.last_sequence_;
        have_last_sequence = true;
      }
    }
  }

Manifest的文件格式与LOG文件保持一致，LOG文件的具体格式如下图所示。

而里面的data则是VersionEdit的序列化形式（里面的tag可能出现多次）。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/version-edit.png"/> </div>

在反序列化VersionEdit后，我们可以将所有的VersionEdit合并给一个Version,本质上就是把VersionEdit的新增文件以及删除文件，compact_pointer,和现有的Version进行合并。

void Apply(VersionEdit* edit) {
    // 更新压实指针信息
    // 将 edit 中保存的每一层下次压实起始 key 复制到 VersionSet 中
    for (size_t i = 0; i < edit->compact_pointers_.size(); i++) {
      const int level = edit->compact_pointers_[i].first;
      // 与下面新增和删除不同, 这里直接修改 vset
      vset_->compact_pointer_[level] =
          edit->compact_pointers_[i].second.Encode().ToString();
    }

    // 删除文件
    // 将 edit 中保存的待删除文件集合导入到 levels_[].deleted_files 中
    const VersionEdit::DeletedFileSet& del = edit->deleted_files_;
    for (VersionEdit::DeletedFileSet::const_iterator iter = del.begin();
         iter != del.end();
         ++iter) {
      const int level = iter->first;
      const uint64_t number = iter->second;
      levels_[level].deleted_files.insert(number);
    }

    // 添加新文件
    // 将 edit 中保存的新增文件集合导入到 levels_[].added_files 中
    for (size_t i = 0; i < edit->new_files_.size(); i++) {
      // pair 第一个参数为 level
      const int level = edit->new_files_[i].first;
      // pair 第二个参数为 FileMetaData
      FileMetaData* f = new FileMetaData(edit->new_files_[i].second);
      f->refs = 1;

      // leveldb 针对经过一定查询次数的文件进行自动压实. 我们假设:
      //    (1)一次查询消耗 10ms
      //    (2)写或者读 1MB 数据消耗 10ms(即 100MB/s, 这是一般磁盘 IO 速度)
      //    (3)1MB 数据的压实做了 25MB 数据的 IO 工作: 
      //        从 level-L 读取了 1MB
      //        从 level-(L+1) 读取了 10-12MB(边界可能没有对齐)重叠数据
      //        将压实后的 10-12MB 数据写入到 level-(L+1)
      // 基于上述假设, 我们可以得出, 执行 25 次查询消耗的时间与压实 1MB 数据
      // 的时间相同, 都是 250ms. 也就是说, 一次查询大约相当于压实 40KB (=1MB/25)数据.
      // 现实可能没这么理想, 我们保守一些, 假设每次查询大约相当于压实 16KB 数据, 这样
      // 我们就可以得出压实之前一个文件被允许查询的次数 == [文件字节数/16KB],
      // 一个文件最大 2MB, 则在压实前最多允许查询 128 次, 超过次数会触发压实操作.
      f->allowed_seeks = (f->file_size / 16384);
      // 如果允许查询次数小于 100, 则按 100 次处理. 
      if (f->allowed_seeks < 100) f->allowed_seeks = 100;

      // todo 一个文件会同时出现在删除列表和新增列表? 
      levels_[level].deleted_files.erase(f->number);
      levels_[level].added_files->insert(f);
    }
  }

  // 将当前 version 与 builder 保存的新增文件按序合并
  // 追加到新 Version v 中.
  void SaveTo(Version* v) {
    BySmallestKey cmp;
    cmp.internal_comparator = &vset_->icmp_;
    // 从低到高将当前 Version base_ 每个 level 文件列表和 Builder::levels_ 每个对应 level
    // 新增文件列表合并, 并保存到 Version v 对应 level 中.
    for (int level = 0; level < config::kNumLevels; level++) {
      // 把新加的文件和已有文件进行合并, 丢弃已被删除的文件, 最终结果保存到 *v.

      // Version base_ 中 level-L 对应的文件列表
      const std::vector<FileMetaData*>& base_files = base_->files_[level];
      std::vector<FileMetaData*>::const_iterator base_iter = base_files.begin();
      std::vector<FileMetaData*>::const_iterator base_end = base_files.end();
      // builder 保存的 level-L 对应的新增文件集合
      const FileSet* added = levels_[level].added_files;
      v->files_[level].reserve(base_files.size() + added->size());
      // 下面两个循环按照文件包含的 key 从小到大顺序合并前述两个文件列表.
      // (具体逻辑就是将两个有序列表合并的过程.)
      for (FileSet::const_iterator added_iter = added->begin();
           added_iter != added->end();
           ++added_iter) {
        // 针对 builder 中每个新增文件 *added_iter,
        // 从 base_ 对应 level 寻找第一个大于它的文件,
        // 然后将这个文件之前的文件(builder 里文件列表从小到大有序)
        // 都追加到 v 中.
        // 寻找过程采用 BySmallestKey 比较器(这个抽象极好).
        for (std::vector<FileMetaData*>::const_iterator bpos
                 = std::upper_bound(base_iter, base_end, *added_iter, cmp);
             base_iter != bpos; // 如果相等说明 builder 全部文件都比 added_iter 大
             ++base_iter) {
          // bpos 位置处文件小于 added_iter,
          // 将其追加到 Version v 对应 level 的文件列表中
          MaybeAddFile(v, level, *base_iter);
        }

        // builder 中小于 added_iter 的文件都追加过了,
        // 将 *added_iter 追加到 Version v 的对应 level 的文件列表中.
        MaybeAddFile(v, level, *added_iter);
      }

      // Add remaining base files
      // 将 Version base_ 中 level-L 对应的文件列表剩余的文件追加到 Version v 的对应 level-L 的文件列表中
      for (; base_iter != base_end; ++base_iter) {
        MaybeAddFile(v, level, *base_iter);
      }
    }
  }

然后将新生成的Version加入到VersionSet中，并设置为current

void VersionSet::AppendVersion(Version* v) {
  // Make "v" current
  assert(v->refs_ == 0);
  assert(v != current_);
  if (current_ != nullptr) {
    current_->Unref();
  }
  current_ = v;
  // current_ 引用了 v, 将 v 引用计数加一
  v->Ref();

  // Append to linked list
  // 将 v 加入到双向循环链表中, 新插入的永远是 dummy_versions_ 的前驱.
  v->prev_ = dummy_versions_.prev_;
  v->next_ = &dummy_versions_;
  v->prev_->next_ = v;
  v->next_->prev_ = v;
}

如果你想知道在Compact时，是如何生成新的Version与新的MANIFEST,那么你可以看一下VersionSet::LogAndApply的实现。

Overview

本文介绍了LevelDB的Compact以及版本。这篇文章并没有将所有的细节都写出来，如果你想要详细了解，我推荐你还是需要去读相关代码。这篇文章更侧重描写出LevelDB的大概轮廓，以及一些比较重要的细节。希望对你理解LevelDB相关代码有所帮助。

LevelDB(2) -- 读

tang-hi — Mon, 19 Jun 2023 00:00:00 GMT

本文将介绍LevelDB的读操作，以及相应的迭代器

Overview

我们先看一下levelDB读操作的整体流程。

我们首先会尝试从MemTable中读取对应的KV，如果没获取到，我们会从ImmutableMemTable中读取，如果仍旧没读到，我们就会尝试去${version}.ldb中获取对应的KV。

因为MemTable与ImmutableMemTable的结构完全一致，他们的区别仅仅是一个是目前正在使用的MemTable,一个是已经达到Flush的阈值，准备往磁盘中写了。因此这篇文章会分为两部分来介绍

从Memtable中读取KV。
从${version}.ldb中读取KV。

Read From Memtable

还记得我们在 LevelDB(1) -- 写中对于Memtable的描述,它会将用户输入的Key转化为InternalKey再插入，因此为了查询的时候，Key保持一致。我们也需要先将Key转化为InternalKey。

在这里sequence number为最大值（这样我们才能获取最新的数据），如果用户有指定snapshot。那么这个sequence number则为该snapshot的值。tag则为kValueTypeForSeek* ,因为我们排序数据项时会考虑序列号, 而且会在 user_key 部分相等时按照 tag (由七个字节序列号后跟一个字节 ValueType 构成)降序排列(tag 越大 internal_key 越小), 所以我们应该使用最大的 ValueType,这样调用 MemTable.Seek(k) 确保找到的第一个大于等于 k 的数据项(MemTable 中数据项从小到大排序)就是我们要找的数据项.

在完成InternalKey的构造后，我们开始在Memtable中查询数据。Memtable的整个查询接口都是由迭代器暴露出来的，因此我们先看一下迭代器的接口。

class Iterator {
   public:
    // Initialize an iterator over the specified list.
    // The returned iterator is not valid.
    //
    // 构造方法返回的迭代器是无效的
    explicit Iterator(const SkipList* list);

    // Returns true iff the iterator is positioned at a valid node.
    //
    // 当且仅当迭代器指向有效的 node 时才返回 true. 
    bool Valid() const;

    // Returns the key at the current position.
    // REQUIRES: Valid()
    //
    // 返回迭代器当前位置的 key. 
    // 要求: 当前迭代器有效. 
    const Key& key() const;

    // Advances to the next position.
    // REQUIRES: Valid()
    //
    // 将迭代器移动到下个位置. 
    // 要求: 当前迭代器有效. 
    void Next();

    // Advances to the previous position.
    // REQUIRES: Valid()
    //
    // 将迭代器倒退一个位置. 
    // 要求: 当前迭代器有效. 
    void Prev();

    // Advance to the first entry with a key >= target
    //
    // 将迭代器移动到第一个 key >= target 的数据项所在位置. 
    void Seek(const Key& target);

    // Position at the first entry in list.
    // Final state of iterator is Valid() iff list is not empty.
    //
    // 将迭代器移动到 skiplist 第一个数据项所在位置. 
    // 迭代器的最终状态是有效的, 当且仅当 skiplist 不为空. 
    void SeekToFirst();

    // Position at the last entry in list.
    // Final state of iterator is Valid() iff list is not empty.
    //
    // 将迭代器移动到 skiplist 最后一个数据项所在位置. 
    // 迭代器的最终状态是有效的, 当且仅当 skiplist 不为空. 
    void SeekToLast();

   private:
    const SkipList* list_;
    Node* node_;
    // Intentionally copyable
  };

这里面我们主要关注Seek

template<typename Key, class Comparator>
inline void SkipList<Key,Comparator>::Iterator::Seek(const Key& target) {
  node_ = list_->FindGreaterOrEqual(target, nullptr); 
}

可以看到里面实际使用的还是SkipList::FindFreaterOrEqual

template<typename Key, class Comparator>
typename SkipList<Key,Comparator>::Node* 
SkipList<Key,Comparator>::FindGreaterOrEqual(const Key& key, Node** prev)
    const {
  // head_ 为 SkipList 原始数据链表的起始节点,
  // 该节点不存储用户数据, 仅用作哨兵.
  Node* x = head_;
  // 每次查找都是从最高索引层开始查找, 只要确认可能存在
  // 才会降到下一级更细致索引层继续查找.
  // 索引层计数从 0 开始, 所以这里减一才是最高层.
  int level = GetMaxHeight() - 1; 
  while (true) {
    // 下面用的 Next 方法是带同步设施的, 其实由于 SkipList 对外开放的操作
    // 需要调用者自己提供同步, 所以这里可以直接用 NoBarrier_Next.
    Node* next = x->Next(level);
    if (KeyIsAfterNode(key, next)) {
      // key 大于 next, 在该索引层继续向后找
      x = next; 
    } else {
      // key 可能存在.
      //
      // 如果 key 比 SkipList 中每个 node 的 key 都小, 
      // 那么最后返回的 node 为 head_->Next(0), 
      // 同时 pre 里面存的都是 dummy head; 
      // 调用者需要使用返回的 node 与自己持有 key进一步进行对比,
      // 以确定是否找到目标节点. 
      if (prev != nullptr) prev[level] = x;
      if (level == 0) {
        // 就是它！如果 key 比 SkipList 里每个 node 的都大, 则 next 最终为 nullptr.
        return next;  
      } else {
        // 确定目标范围, 但是粒度太粗, 下沉一层继续找
        level--;
      }
    }
  }
}

代码写的十分简单易懂，从最高层开始寻找，如果Key大于当前节点，那么往后继续找，否则下沉一层继续找，如果已经是最后一层了，那么返回该节点。

在找到节点后，我们根据InternalKey的格式将用户输入的key解析出来进行比较，如果不相等，那么返回未找到，如果相等说明我们查到了这个key的最新值，我们查看该节点的tag是kTypeValue 还是kTypeDeletion？如果是kTypeValue，那么我们找到了那个值，但如果是kTypeDeletion则说明该值已经被删除。

可以看到从Memtable中查询数据还相对较为简单，只要明白InternalKey的排列顺序即可。

Read From LDB FILE

如果从Memtable中无法找到对应的KV,那么我们就需要从文件中进行查找了。这里我们分为两种情况，Level-0的的文件查找以及Level-1以上的文件查找。之所以要这么区分是因为Level-0各个文件的key可能重叠,例如file1的range为[0,100],file2的range为[50,200]。而level-1及以上的文件的key则不重叠，也就是不相交，因此我们会采用不同的方式来进行读取。查找的过程先从Level-0开始，而后逐级向上。这是因为Level越低，数据越新，如果我们得到了最新的数据，就不用再往下面找了。

1. 确定需要读取的文件

首先要做的就是要确定哪些文件。从下图中我们可以看到，对于Level-0以及其他Level，确定要读取的文件是不一样的。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/determin-file.png"/> </div>

因为Level-0中的文件key之间可能有重叠，因此我们需要一个个进行检查，而对于Level-x(x>1)来说，每个文件不相交，因此我们可以对他们按key排序后，进行二分查找，从而加快查找的速度。

2. 读取ldb文件，查找KV

在确定了读取的文件后，我们需要读取相应的文件，并检查需要查询的KEY是否在ldb文件中。

为了读取文件，我们会先尝试从TableCache中找到需要的文件Cache，如果没找到，我们再去磁盘中读取并反序列化为Table,将反序列化的Table插入到TableCache中。

2.1 TableCache的实现

目前LevelDB使用的缓存策略为LRU，我们先看一下Cache的接口。

class LEVELDB_EXPORT Cache {
 public:
  Cache() = default;

  Cache(const Cache&) = delete;
  Cache& operator=(const Cache&) = delete;

  // 析构时调用构造时传入的 deleter 函数销毁每一个数据项. 
  virtual ~Cache();

  // Cache 中存储的数据项的抽象类型, 具体实现参见 LRUHandle
  struct Handle { };

  /**
   * 插入一对 <key, value> 到 cache 中, 同时为这个映射设置
   * 一个对 cache 容量的消耗, 具体使用时候用的是要插入的数据
   * 字节数. 
   *
   * 该方法返回一个 handle, 对应本次插入的映射. 
   * 当调用者不再需要这个映射的时候, 需要调用 this->Release(handle). 
   *
   * 当被插入的数据项不再被需要时, key 和 value 将会被传递给这里指定的 deleter. 
   * @param key 要插入的映射的 key
   * @param value 要插入的映射的 value
   * @param charge 要插入的映射对应的花费
   * @param deleter 要插入的映射对应的 deleter
   * @return 要插入的映射对应的 handle
   */
  virtual Handle* Insert(const Slice& key, void* value, size_t charge,
                         void (*deleter)(const Slice& key, void* value)) = 0;

  /**
   * 如果 cache 中没有针对 key 的映射, 返回 nullptr. 
   * 其它情况返回对应该映射的 handle. 
   * 当不再需要这个映射的时候, 调用者必须调用 this->Release(handle). 
   * @param key 要查询映射的 key
   * @return 要查询的映射对应的 handle
   */
  virtual Handle* Lookup(const Slice& key) = 0;


  /**
   * 先通过 Lookup 查询映射对应的 handle, 然后调用该函数来释放该映射. 
   *
   * 前提一: handle 之前未被释放过.
   * 前提二: handle 必须是通过在 *this 上调用某个方法返回的.
   * @param handle 通过 Lookup 查询到的映射对应的 handle
   */
  virtual void Release(Handle* handle) = 0;

  /**
   * 成功调用 Lookup 后返回的 handle 中封装的 value 可以通过该方法解析. 
   *
   * 前提一: handle 之前未被释放过
   * 前提二: handle 必须是通过在 *this 上调用某个方法返回的
   * @param handle
   * @return
   */
  virtual void* Value(Handle* handle) = 0;

  /**
   * 如果 cache 包含了 key 对应的映射, 删除之. 
   * 注意, 底层的数据项将会继续存在直到现有的指向该数据项的全部 handles 已被释放掉. 
   * @param key 要删除的映射对应的 key
   */
  virtual void Erase(const Slice& key) = 0;

  /**
   * 返回一个新生成的数字 id. 
   * 可能会被共享同一个 cache 的多个客户端用来对键空间进行分区.
   *
   * 典型地用法是, 某个客户端在启动时调用该方法生成一个新 id, 
   * 然后将该 id 作为它的 keys 的前缀. 
   * @return
   */
  virtual uint64_t NewId() = 0;

  /**
   * 移除 cache 中全部不再活跃的数据项. 
   * 内存受限的应用可以调用该方法来减少缓存造成的内存消耗. 
   *
   * 该方法的默认实现什么也不做, 强烈建议在派生类实现中重写该方法. 
   * leveldb 未来版本可能会将该方法修改为一个纯抽象方法. 
   */
  virtual void Prune() {}

  /**
   * 返回 cache 为了存储当前全部元素的总花费的估计值
   * @return
   */
  virtual size_t TotalCharge() const = 0;

 private:
  void LRU_Remove(Handle* e);
  void LRU_Append(Handle* e);
  void Unref(Handle* e);

  struct Rep;
  Rep* rep_;
};

对于Cache接口的实现则为ShardedLRUCache，它维护了多个cache shard，从而在并发访问时，无须使用一把大锁，而是可以更加细粒度的加锁，从而提升并发时的性能。

我们看一下ShardedLRUCache::Insert的实现

virtual Handle* Insert(const Slice& key, void* value, size_t charge,
                         void (*deleter)(const Slice& key, void* value)) {
    // 计算 hash
    const uint32_t hash = HashSlice(key);
    // 基于 hash 做 sharding
    return shard_[Shard(hash)].Insert(key, hash, value, charge, deleter);
}

从代码中，我们可以看到ShardedLRUCache只计算key所属的shard，然后具体的逻辑由LRUCache执行。

/**
 * 该方法类似 Cache::Insert() 不过多了一个 hash 参数.
 * 该方法线程安全, 允许多个线程并发向同一个 shard 中插入.
 *
 * @param key 要插入的数据项的 key
 * @param hash 要插入的数据项的 hash
 * @param value 要插入的数据项的 value
 * @param charge 要插入的数据项的 charge
 * @param deleter 要插入的数据项的 deleter
 * @return 返回插入的数据项的句柄
 */
Cache::Handle* LRUCache::Insert(
    const Slice& key, uint32_t hash, void* value, size_t charge,
    void (*deleter)(const Slice& key, void* value)) {
  MutexLock l(&mutex_);

  // 基于 LRUHandle 本身大小和 key 的实际长度来分配空间. 
  // 减掉的 1 指的是 LRUHandle 初始化时为 key_data 预占的空间, 
  // 不减掉的话后面加上 key.size() 就多了一个字节. 
  LRUHandle* e = reinterpret_cast<LRUHandle*>(
      malloc(sizeof(LRUHandle)-1 + key.size()));
  e->value = value;
  e->deleter = deleter;
  e->charge = charge;
  e->key_length = key.size();
  e->hash = hash;
  e->in_cache = false;
  // 能存在于 cache 中的最小 ref 值, 
  // 表示当前除了 cache 对象还没有任何外部引用.
  e->refs = 1;  
  memcpy(e->key_data, key.data(), key.size());

  if (capacity_ > 0) {
    // 放入 in_use_ 列表就要增加引用.
    e->refs++;
    // 该数据项被放到了 shard 中
    e->in_cache = true;
    // 将该数据项追加到 shard 的 in_use 链表
    LRU_Append(&in_use_, e);
    usage_ += charge;
    // 将数据项插入到 hashtable, 这可以看做一个二级缓存.
    // 如果 shard 中存在与 e "相同的 key 相同的 hash" 的项, 
    // 则将 e 插入同时将老的数据项从 shard 彻底删除.
    FinishErase(table_.Insert(e));
  } else {
    // 如果 capacity_<= 0 意味着关闭了缓存功能. 
    // 此处的赋值是防止 key() 方法的 assert 失败. 
    e->next = nullptr;
  }
	// 下面这个循环解释了 LRUCache 的 LRU 效果.
  // 如果本 shard 的使用量大于容量并且 lru 链表不为空, 
  // 则从 lru 链表里面淘汰数据项, lru 链表数据当前肯定未被使用, 
  // 直至使用量小于容量或者 lru 清空. 
  while (usage_ > capacity_ && lru_.next != &lru_) {
		// 这很重要, lru_.next 是 least recently used 的元素
    LRUHandle* old = lru_.next;
    // lru 链表里面的数据项除了被该 shard 引用不会被任何客户端引用
    assert(old->refs == 1);
    // 从 shard 将 old 彻底删除
    bool erased = FinishErase(table_.Remove(old->key(), old->hash));
    if (!erased) {  
      // to avoid unused variable when compiled NDEBUG
      assert(erased);
    }
  }

  // 将 LRUHandle 重新解释为 Cache::Handle
  return reinterpret_cast<Cache::Handle*>(e);
}

我们首先会malloc一个新的LRUHandle,然后对该LRUHandle进行赋值。随后我们直接使用头插法将这个LRUHandle插入链表中，并且如果这个key之前缓存过，那么我们将旧缓存删除，最后如果发现使用量超过限额，就尝试去除过期的数据。

我们再看看查找的代码，可以发现代码相当简单，通过key去hashtable中找到对应的LRUHandle并返回。

Cache::Handle* LRUCache::Lookup(const Slice& key, uint32_t hash) {
  MutexLock l(&mutex_);
  // table_ 是个哈希表, 存储了该 shard 全部数据项的指针, 
  // O(1) 复杂度. 
  LRUHandle* e = table_.Lookup(key, hash); 
  if (e != nullptr) {
    // 如果查到, 则将该数据项引用数加 1, 
    // 查询命中后续就要. 
    Ref(e); 
  }
  return reinterpret_cast<Cache::Handle*>(e);
}

2.2 LDB的反序列化与查找

反序列化LDB文件的入口函数为Table::Open, 我们配合着LDB的Layout来理解LDB的反序列化。

我们首先读取Footer获得index Block,filter index block的位置和大小，随后我们通过index block的位置和大小解析出index Block的内容（这里我们仅仅只是通过CRC检查文件完整性，及其根据compress type来解压文件，并不做更进一步的解析),然后我们再根据filter index block解析出filter block(这里我们会将filter block的base, offset, bloom value全部解析出来)。注意，我们并不会解析data block.

到了这里，我们就算将ldb文件反序列化为Table了。下面我们看一下查找的过程。

// 在 table 中查找 k 对应的数据项. 
// 如果 table 具有 filter, 则用 filter 找; 
// 如果没有 filter 则去 data block 里面查找, 
// 并且在找到后通过 saver 保存 key/value. 
// 注意, 针对 data block 的读取和解析发生在这个方法里.
Status Table::InternalGet(const ReadOptions& options, const Slice& k,
                          void* arg,
                          void (*saver)(void*, const Slice&, const Slice&)) {
  Status s;
  // 针对 data index block 构造 iterator
  Iterator* iiter = rep_->index_block->NewIterator(rep_->options.comparator);
  // 在 data index block 中寻找第一个大于等于 k 的数据项, 这个数据项
  // 就是目标 data block 的 handle.
  iiter->Seek(k);
  if (iiter->Valid()) {
    // 取出对应的 data block 的 BlockHandle
    Slice handle_value = iiter->value(); 
    FilterBlockReader* filter = rep_->filter;
    BlockHandle handle;
    // 如果有 filter 找起来就快了, 如果确定
    // 不存在就可以直接反悔了.
    if (filter != nullptr &&
        handle.DecodeFrom(&handle_value).ok() &&
        !filter->KeyMayMatch(handle.offset(), k)) {
      // 没在该 data block 对应的过滤器找到这个 key, 肯定不存在
    } else { 
      // 如果没有 filter, 或者在 filter 中查询时无法笃定
      // key 不存在, 就需要在 block 中进行查找.
      // 看到了没? Open() 方法没有解析任何 data block, 解析
      // 是在这里进行的, 因为这里要查询数据了.
      Iterator* block_iter = BlockReader(this, options, iiter->value());
      block_iter->Seek(k);
      if (block_iter->Valid()) {
        // 将找到的 key/value 保存到输出型参数 arg 中, 
        // 因为后面会将迭代器释放掉.
        (*saver)(arg, block_iter->key(), block_iter->value()); 
      }
      s = block_iter->status();
      delete block_iter;
    }
  }
  if (s.ok()) {
    s = iiter->status();
  }
  delete iiter;
  return s;
}

我们首先会在index block中寻找刚好大于等于k的数据项，从而可以快速定位到data block,当然在往data block中查找前，我们可以先使用布隆过滤器来确定该值是否在这个data block 中,如果说在的话，我们就可以直接在data block中进行查找，并返回查找结果。

Overview

本文介绍了LevelDB的读操作的流程。这篇文章并没有将所有的细节都写出来，如果你想要详细了解，我推荐你还是需要去读相关代码。这篇文章更侧重描写出LevelDB的大概轮廓，以及一些比较重要的细节。希望对你理解LevelDB相关代码有所帮助。

LevelDB(1) -- 写

tang-hi — Thu, 15 Jun 2023 00:00:00 GMT

本文将介绍LevelDB是如何存储写入数据, 以及数据在磁盘中的存储格式.

Overview

我们先看一下LevelDB整体的写流程是什么样子的.

从图中可以看出levelDB采用经典的WAL方式来进行写入,即先将写入的操作写到log文件中,再将实际的数据写到内存中的Memtable中(Memtable采用skipList实现),当Memtable达到阈值后再将其转化为ImmutableMemTable,最终将其落盘持久化保存.

因此本文后面将主要介绍

log文件的格式与生成
memtable的实现
ldb文件的格式与生成

写操作的生成

在讲述文件的格式与生成之前,我们需要先描述写操作是如何生成.

当用户调用DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) 时,我们会将key,value包装成一个WriteBatch,顾名思义,WriteBatch中会有许多的写操作.其定义如下所示.

class LEVELDB_EXPORT WriteBatch {
 public:
  WriteBatch();
  // skip ....
  // Intentionally copyable.
  WriteBatch(const WriteBatch&) = default; // 默认拷贝构造
  WriteBatch& operator =(const WriteBatch&) = default; // 默认赋值构造
  // skip ....
  // Store the mapping "key->value" in the database.
  void Put(const Slice& key, const Slice& value);
  ~WriteBatch();
  // skip ....
 private:

  std::string rep_;  // See comment in write_batch.cc for the format of rep_
};

}  // namespace leveldb

我们可以看到其本质就是一个string,我们会将写操作通过Put接口将其写入到rep_中,WriteBatch在内存中的格式如下图所示. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/write-batch.png"/> </div>

从图中可以看到，WriteBatch由SequenceNumber,count以及count个KV对组成, 其中前8个byte为SequenceNumber(LevelDB中写操作的唯一自增编号)，紧跟着的4个byte为存储的KV对个数。KV对则由tag,key-size And key's content, value-size And value's content构成。

其中tag为枚举类型,kTypeDeletion表明删除,kTypeDeletion表明增加.

enum ValueType {
  kTypeDeletion = 0x0,
  kTypeValue = 0x1
};

在生成WriteBatch后(此时WriteBatch中仅有用户输入的KV对)，我们生成Writer, 并将WriteBatch存入Writer中。

struct DBImpl::Writer {
  Status status;
  WriteBatch* batch;
  bool sync;
  bool done;
  port::CondVar cv;

  explicit Writer(port::Mutex* mu) : cv(mu) { }
};

随后，我们会将Writer放到队列中（一个经典的生产者，消费者模型）。当该Writer为队首时才会被拿出来执行。

writers_.push_back(&w);
while (!w.done && &w != writers_.front()) {
	w.cv.Wait();
}

当该Writer被拿出来执行时，我们首先会确保Memtable仍然有较为充足的空间给它进行写入，不然的话我们可能要进行compact(目前可以不关注，后续在讲compact的时候会详细阐述，这里先假定空间一定充足)。

此时，我们会尝试将多个WriterBatch合并为一个后一起执行。具体逻辑可以参考代码

WriteBatch* DBImpl::BuildBatchGroup(Writer** last_writer) {
  mutex_.AssertHeld();
  assert(!writers_.empty());
  // 取出队首 writer
  Writer* first = writers_.front();
  // 取出队首 writer 的待写数据集
  WriteBatch* result = first->batch;
  assert(result != nullptr);

  // 计算队首 writer 数据集大小
  size_t size = WriteBatchInternal::ByteSize(first->batch);

	// 虽然支持合并, 但是有两个限制条件:
	// 1. 不合并同步写入操作(设置了 writer.sync), 发现同步写操作立马停止后续合并操作并返回已合并内容.
	// 2. 为了避免小数据量写入操作被延迟太久, 针对合并上限做了限制, 最大 1MB.
  size_t max_size = 1 << 20;
	// 如果队首 writer 要写内容大小不超过 128KB
  if (size <= (128<<10)) {
	// 则 max_size 改为不超过 256KB
    max_size = size + (128<<10);
  }

  *last_writer = first;
  std::deque<Writer*>::iterator iter = writers_.begin();
  ++iter;  // Advance past "first"
  // iter 从 first 之后 writer 开始遍历
  for (; iter != writers_.end(); ++iter) {
    Writer* w = *iter;
		// 同步写操作不做合并
    if (w->sync && !first->sync) {
      // Do not include a sync write into a batch handled by a non-sync write.
      break;
    }

    if (w->batch != nullptr) {
      size += WriteBatchInternal::ByteSize(w->batch);
      if (size > max_size) {
        // 避免 batch group 过大
        break;
      }

      // Append to *result
      if (result == first->batch) {
        // 不篡改 first writer 的 batch, 而是把若干 batch 合并到临时的 tmp_batch_ 中
        result = tmp_batch_;
        assert(WriteBatchInternal::Count(result) == 0);
        WriteBatchInternal::Append(result, first->batch);
      }
      WriteBatchInternal::Append(result, w->batch);
    }
    // last_writer 指向被合并的最后一个 writer
    *last_writer = w;
  }
  return result;
}

简单来说，遍历待执行的WriteBatch，只要它

不要求同步
合并后不会导致WriteBatch大小超过max_size。

都会被合并，但只要违反上述任意一条，合并流程就会终止。

经过上述流程，我们完成了对写操作的所有预处理，可以进行真正的写操作了。

log文件的格式与生成

在生成了待写入的WriteBatch后,我们首先将其写入到log文件中。log文件的内部格式是通过block进行组织的，具体结构如下图所示。

我们可以看到log文件是由一个个Block组成的，而每一个Block的大小都是固定的32KB，Block中存储着多个WriteBatch:头四个byte为校验和，后两个byte为data的长度，后续的一个byte为type(前七个byte被统称为Header)，最后剩下的就是data的数据，也就是WriteBatch中的rep_。

如果一个block剩余空间不足以存储Header，也就是他剩下的存储空间小于7byte,那么我们会对这个Block末尾填充0,然后将数据写到新的Block中。

现在我们来讲一下Header中的type代表着什么。考虑这么一种情况，如果block中的剩余空间太小，以致于我们的WriteBatch无法全部存储在该Block中，那么我们可能要将数据分为不同的块存储到不同的Block中，为了后续读的时候可以知道，这是不是一个完整的块，以及何时读完了完整的块。我们需要type来进行标识。

type依旧为枚举类型.

enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,

  kFullType = 1,

  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4
};

其中

kFullType 表明后续的data数据为完整的数据
kFirstType 表明这是分块后的第一块数据，仍需要继续读取
kMiddleType表明这是分块后的中间数据，仍需要继续读取
kLastType 表明这是分块后的最后一块数据，无需读取。

在将WriteBatch的数据写入到log文件后，我们就完成了写入的第一步，写日志。

MemTable的实现

在将writeBatch写入到log文件后，我们便可以将数据写入MemTable中。

我们会对writeBatch中的(tag,Key,value)进行遍历，根据tag的不同，决定是向MemTable添加还是删除。

因为MemTable的内部实现是skiplist，而skiplist只能回答key在不在，而不能回答key关联的value是什么，因此我们需要将用户输入的(key,value)转化为skiplist中内部使用的key。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/internal-key.png"/> </div>

从图中可以发现，key和value中间被(sequence number , tag)隔开，这样的目的是为了后续在排序时，我们可以先按照key从小到大排序, 当key相同时，(sequence number, tag)按照由大到小排序，通过这种方式，永远是版本最新的在最前面（(sequence number, tag)越大，版本越新）。

最后就是向MemTable中插入该internal key。

template<typename Key, class Comparator>
void SkipList<Key,Comparator>::Insert(const Key& key) {
  // pre 将用于存储 key 对应的各个索引层的前驱节点
  Node* prev[kMaxHeight];
  // 找到第一个大与于目标 key 的节点, 一会会把 key
  // 插到这个节点前面.
  // 如果为 nullptr 表示当前 SkipList 节点都比 key 小.
  Node* x = FindGreaterOrEqual(key, prev); 

  // 虽然 x 是我们找到的第一个大于等于目标 key 的节点, 
  // 但是 leveldb 不允许重复插入 key 相等的数据项.
  assert(x == nullptr || !Equal(key, x->key));

  // 确定待插入节点的最大索引层数
  int height = RandomHeight();
  // 更新 SkipList 实例维护的最大索引层数
  if (height > GetMaxHeight()) {
    // 如果最大索引层数有变, 则当前节点将是索引层数最多的节点,
    // 需要将前面求得的待插入节点的前驱节点高度补齐.
    for (int i = GetMaxHeight(); i < height; i++) {
      // 新生成了几个 level, key 对应的前驱节点肯定都是 dummy head
      prev[i] = head_; 
    }
    //fprintf(stderr, "Change height from %d to %d\n", max_height_, height);

    // 这里在修改 max_height_ 无需同步, 哪怕同时有多个并发读线程. 
    // 其它并发读线程如果观察到新的 max_height_ 值, 
    // 那它们将会要么看到 dummy head 新的索引层(注意 SkipList 
    // 初始化时会把 dummy head 的索引高度直接初始化为最大, 默认是 12, 
    // 所以不存在越界问题)的值都为 nullptr, 要么看到的是
    // 下面循环将要赋值的新节点 x. 
    max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
  }

  // 为待插入数据创建一个新节点
  x = NewNode(key, height);
  // 将 x 插入到每一层前后节点之间, 注意是每一层, 
  // 插入的时候都是先采用 no barrier 方式为 x 后继赋值, 此时 x 还不会被其它线程看到; 
  // 然后插入一个 barrier, 则上面 no barrier 的修改针对全部线程都可见了(其中也包括
  // 了 NewNode 时可能发生的通过 NoBarrier_Store 方式修改的 arena_.memory_usage_), 
  // 最后修改 x 前驱的后继为自己. 
  for (int i = 0; i < height; i++) {
    // 注意该循环就下面两步, 而且只有第二步采用了同步设施, 尽管如此,
    // 第一步的写操作对其它线程也是可见的. 
    // 这是 Release-Acquire ordering 语义所保证的. 
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
    prev[i]->SetNext(i, x);
  }
}

为了可以读懂上面的代码，我们先来阐述一下LevelDB是如何实现skiplist的。

首先我们先看一下SkipList::Node的定义

template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
  explicit Node(const Key& k) : key(k) { }

  Key const key;

  Node* Next(int n) {
    assert(n >= 0);
    return reinterpret_cast<Node*>(next_[n].Acquire_Load());
  }
  void SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].Release_Store(x);
  }

  Node* NoBarrier_Next(int n) {
    assert(n >= 0);
    return reinterpret_cast<Node*>(next_[n].NoBarrier_Load());
  }

  void NoBarrier_SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].NoBarrier_Store(x);
  }

 private:
  // Array of length equal to the node height. 
  // next_[0] is lowest level link.
  port::AtomicPointer next_[1];
};

从代码中，我们看到Node有两个成员变量，一个是key，一个是next_，key没有什么好说的，主要需要理解的是next_,简单点来说，next_的长度等于Node的高度，next_[i]为Node在level-i的后继节点。通过下图，相信你可以更好理解。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/node.png"/> </div>

了解完Node后，我们还需要知道SkipList::head_,这个成员变量是一个dummy node,它的类型也是Node,它的next_点保存着每一个level的首节点。因此插入的过程可以描述为

找到第一个大与于目标 key 的节点, 一会会把 key插到这个节点前面,如果为 nullptr 表示当前 skipList 节点都比 key 小.同时会记录每一层刚好比key小的节点。
为该节点随机生成一个层数，作为该节点的最大层数。
通过之前找到的刚好比他大的节点，以及刚好比他小的节点，将该节点插入进skiplist.

当将WriteBatch所有的(key, value,tag)，全部插入MemTable后，我们可以认为插入的过程已经全部完成了，剩下的就是将之前合并到新的WriteBatch的那些Writer移出队列，并唤醒队列的头部Writer,相信通过代码，可以很容易理解。

while (true) {
    // [&w, last_writer] 的 batch 被合并写入 log 了, 所以将其出队.
    Writer* ready = writers_.front();
    writers_.pop_front();
    // &w 并没有 wait, 它是本次负责合并写入的 writer,
		// 所以它 &w 的 status 和 done 可以不用改, 反正也用不到.
    if (ready != &w) {
			// 传递合并写执行结果给 group 中各个 writer
      ready->status = status;
      ready->done = true;
      // 唤醒当前方法入口的 w.cv.Wait(), 通过此处被唤醒的
			// writers 都是被合并到队首 writer 统一写入 log 文件的.
      // 它们被唤醒后, 只需检查下 done 状态就可以返回了.
      ready->cv.Signal();
    }
    // last_writer 指向被合并处理的最后一个 writer
    if (ready == last_writer) break;
  }

  // 如果当前 writers_ 队列不为空, 唤醒当前的队首节点.
  if (!writers_.empty()) {
    // 叫醒新的待写入 writer
    writers_.front()->cv.Signal();
  }

ldb文件的格式与生成

当把WriteBatch所有的(key, value,tag)，全部插入MemTable后，Put流程就算结束了。 ldb文件的格式与生成，应当属于compact中的内容。但是趁现在对MemTable的记性还比较新，可以顺便将ldb文件一起讲了。而且ldb文件本质上就是将MemTable落盘，内容上也不算突兀。

ldb文件由五部分组成

data blocks
filter blcoks
filterindex block
index block
footer

其中data blocks中保存着KV数据，filter blocks中存储着布隆过滤器，filterindex block存储着指向filter blocks的索引，index block存储着指向data blocks的索引，footer存储着指向filterindex block和index block的索引。

我们先看一下组成data block以及index block的基础组成部分。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/block-layout.png"/> </div>

因为data block和index block都是由KV组成，data block的Key是skipList的internal key, Value是用户输入的Value.index block的Key是每个data block的最后一个Key, Value是data block的handle.

我们可以详细看一下这个Block的结构组成,因为我们的Key是按照顺序的,因此我们可以使用仅存储两个Key不同的部分,从而减少空间占用,因此我们先存储两个Key相同的长度大小,需要存储的Key的大小,Value的长度大小,存储的Key的内容,存储的Value内容.我们可以通过下面的例子更好的阐述一下这个概念

如图所示,因为hello以及hellz共享hell,因此对于hellz我们仅需要存储z即可.

每个 block 的前缀压缩不是从第一个数据项开始就一直下去, 而是每隔一段(间隔可配置)设置一个新的前缀压缩起点(作为新起点的数据项的 key 保存原值而非做前缀压缩), restart指的就是新起点, 从这个地方开始继续做前缀压缩.在写入文件前,我们还需要对KV对以及restart数据一起进行压缩,压缩的方式由compress type表示,

因此单个Block由KV对, restart数组,restart数组长度,压缩类型,CRC校验和组成.而data block以及index block则是由一个个Block组成.

而filter block则是根据data block的大小生成布隆过滤器,默认每2K个大小生成一个布隆过滤器.它的存储结构为 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/filter-block.png"/> </div>

开始存着一系列的布隆过滤器,然后是各个布隆过滤器的offset数组,紧跟着offset的offset(通过该值找到offset,因为offset是数组是一个变值),最后跟的是这个filter block的元信息(data block数据多大后产生一个布隆过滤器)

而filter index block则使用Block的存储格式存储着Key: "filter.$(filter.name)", Value: offset and size of filter block

最后的Footer的格式则为下图所示. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/footer.png"/> </div>

Footer中存储着filter index block的指针以及index block的指针以及magic number

在介绍完各个模块的磁盘结构后,我们可以看一下ldb文件的全貌,以及各部分之间的关系,如下图所示. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/ldb-overview.png"/> </div>

在了解了整个全貌后,我们可以看一下整个生成的流程是怎么样的.

首先构建出一个新的TableBuilder, 然后按序将Memtable中的数据写入TableBuilder.
TableBuilder将数据全部写入data block中 (按照Block的格式写)
当Block的大小超过4K时,将生成的data block落盘,尝试生成一个filter block,并生成一个index handle,将其插入到index block中.(注意,这里插入index block前,会尝试缩短key的大小,详情请参考代码Comparator::FindShortestSeparator).

最后等所有数据添加完后,依次写入data block, filter block, filter index block, index block,以及footer

Overview

本文介绍了LevelDB的写操作的流程，以及相关文件的生成与格式。这篇文章并没有将所有的细节都写出来，如果你想要详细了解，我推荐你还是需要去读相关代码。这篇文章更侧重描写出LevelDB的大概轮廓，以及一些比较重要的细节。希望对你理解LevelDB相关代码有所帮助。

How Lucene Stores Its Forward Index

tang-hi — Tue, 23 May 2023 00:00:00 GMT

This article will introduce how Lucene 9.6 stores its forward index, to help readers better understand its internal workings.

A forward index, also known as a direct index, is a basic data structure in information retrieval systems. It stores the content and attributes of each document in the order of the documents, allowing the system to quickly access detailed information of any specified document. In Lucene, the storage mechanism of forward data is one of the key factors enabling it to efficiently perform full-text searches.

Since the main focus of this article is the storage format of the forward index on the disk, the preprocessing of the document and how the docID is obtained will be ignored.

What is a Forward Index

Simply put, a forward index is a structure that allows querying the corresponding document through docID. We can compare it to a key-value pair, where docID is the key and the document content is the value.

Therefore, the layout of Lucene's forward index on the disk must allow quick location of the document content through docID.

Building the Forward Index

The entry function for building the forward index is IndexingChain#processDocument (Lucene refers to the forward index as StoredFields).

 void processDocument(int docID, Iterable<? extends IndexableField> document) throws IOException {
   	
    startStoredFields(docID);
    try {
	  // skip .....
      docFieldIdx = 0;
      for (IndexableField field : document) {
        if (processField(docID, field, docFields[docFieldIdx])) {
          fields[indexedFieldCount] = docFields[docFieldIdx];
          indexedFieldCount++;
        }
        docFieldIdx++;
      }
    } finally {
      if (hasHitAbortingException == false) {
      	// skip ...
        // finish forward index
        finishStoredFields();
        
        // skip ...
      }
    }
  }

If we only focus on the processing of the forward index, we will find that Lucene does three things for the forward index:

Initialization based on docID.
Processing each field in the document.
Finalizing the forward index for this document.

If we are only interested in how the index is stored on the disk, we only need to pay attention to the last two points.

private boolean processField(int docID, IndexableField field, PerField pf) throws IOException {
    // skip....
    
    // Add stored fields
    if (fieldType.stored()) {
      StoredValue storedValue = field.storedValue();
      if (storedValue == null) {
        throw new IllegalArgumentException("Cannot store a null value");
      } else if (storedValue.getType() == StoredValue.Type.STRING
          && storedValue.getStringValue().length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
        throw new IllegalArgumentException(
            "stored field \""
                + field.name()
                + "\" is too large ("
                + storedValue.getStringValue().length()
                + " characters) to store");
      }
      try {
        storedFieldsConsumer.writeField(pf.fieldInfo, storedValue);
      } catch (Throwable th) {
        onAbortingException(th);
        throw th;
      }
    }

    // skip...
  }

void writeField(FieldInfo info, StoredValue value) throws IOException {
    switch (value.getType()) {
      case INTEGER -> writer.writeField(info, value.getIntValue());
      case LONG -> writer.writeField(info, value.getLongValue());
      case FLOAT -> writer.writeField(info, value.getFloatValue());
      case DOUBLE -> writer.writeField(info, value.getDoubleValue());
      case BINARY -> writer.writeField(info, value.getBinaryValue());
      case STRING -> writer.writeField(info, value.getStringValue());
      default -> throw new AssertionError();
    }
  }

We can see that when processing the forward index, we use writeField to process each field in the document.

Let's see how Lucene handles fixed-length and variable-length fields.

  @Override
  public void writeField(FieldInfo info, double value) throws IOException {
    ++numStoredFieldsInDoc;
    final long infoAndBits = (((long) info.number) << TYPE_BITS) | NUMERIC_DOUBLE;
    bufferedDocs.writeVLong(infoAndBits);
    writeZDouble(bufferedDocs, value);
  }

  @Override
  public void writeField(FieldInfo info, BytesRef value) throws IOException {
    ++numStoredFieldsInDoc;
    final long infoAndBits = (((long) info.number) << TYPE_BITS) | BYTE_ARR;
    bufferedDocs.writeVLong(infoAndBits);
    bufferedDocs.writeVInt(value.length);
    bufferedDocs.writeBytes(value.bytes, value.offset, value.length);
  }

The common point is that every time a field is written, numStoredFieldsInDoc++. This variable is easy to understand, recording how many fields are stored in this document. Then it adds the relevant information of this field to bufferedDocs (which can be considered as a memory array).

The relevant information of the field can be considered to have three types:

Field number (each field has a unique number)
Field data type
Field data, that is, the value of the field.

Because the data type of the field is only a few limited types, Lucene will store it with the field number as a long type

final long infoAndBits = (((long) info.number) << TYPE_BITS) | NUMERIC_DOUBLE;

When the field is of fixed length, we will directly write it into bufferedDocs. But when the field is variable length, we will first write the number of bytes occupied by this value into bufferedDocs, and then write this value into bufferedDocs.

After processing all the fields in each document, we can consider that we have buffered this document in memory, and then we need to finalize the forward index, that is, flush it to the disk. The function for finalizing the forward index of the document is finishDocument.

@Override
public void finishDocument() throws IOException {
    if (numBufferedDocs == this.numStoredFields.length) {
      final int newLength = ArrayUtil.oversize(numBufferedDocs + 1, 4);
      this.numStoredFields = ArrayUtil.growExact(this.numStoredFields, newLength);
      endOffsets = ArrayUtil.growExact(endOffsets, newLength);
    }
    this.numStoredFields[numBufferedDocs] = numStoredFieldsInDoc;
    numStoredFieldsInDoc = 0;
    endOffsets[numBufferedDocs] = Math.toIntExact(bufferedDocs.size());
    ++numBufferedDocs;
    if (triggerFlush()) {
      flush(false);
    }
}

In this function, we will find that it does four things:

Record the number of fields that need to be stored in each document and save it in the array numStoredFields.
Record the write-in position of the last byte of this document and save it in the array endOffsets.
Record the number of documents currently stored in memory, saved in the variable numBufferedDocs.
Determine whether it is necessary to flush the documents in memory to the disk. If a flush is needed, it is performed.

By the above diagram and code, we should have understood the first three points. Next, we will focus on the fourth point.

When to Flush to Disk

private boolean triggerFlush() {
    return bufferedDocs.size() >= chunkSize
        || // chunks of at least chunkSize bytes
        numBufferedDocs >= maxDocsPerChunk;
  }

From the code, we can see that when the number of Docs cached in memory reaches a threshold or the Docs memory usage reaches a threshold, both will trigger the operation of flushing to disk.

Flushing to Disk

From here, we start to really understand how Lucene saves its forward data on the disk. Let's assume that we have cached three documents in memory.

private void flush(boolean force) throws IOException {
    // skip...
    numChunks++;
   
    // skip...

    // transform end offsets into lengths
    final int[] lengths = endOffsets;
    for (int i = numBufferedDocs - 1; i > 0; --i) {
      lengths[i] = endOffsets[i] - endOffsets[i - 1];
      assert lengths[i] >= 0;
    }
    final boolean sliced = bufferedDocs.size() >= 2L * chunkSize;
    final boolean dirtyChunk = force;
    // skip...
}

From the code, we can see that before actually writing to the disk, we still need to do some calculations in memory:

Increment the number of chunks written to the disk.
Convert the previously saved position of the last byte of each document (endOffsets) into the length of each document.
Determine whether to slice, sliced.
Determine whether it is a dirtyChunk.

The last two points can be ignored for now, just understand the first two.

The files we need to write to the disk are five in total:

fdt
fdm
fdx
seg-xx-doc_ids
seg-xx-file_pointers

Among them, 4 and 5 are temporary files and will not appear in the final index file. They only serve the task of temporarily storing data. The specific values of each variable in memory and the disk files that need to be written can be seen in the following figure.

First, we will write the number of documents saved in this chunk and the starting position of this chunk in the fdt file to the files seg-xx-doc_ids and seg-xx-file_pointers.


private void flush(boolean force) throws IOException {
    // skip...
    indexWriter.writeIndex(numBufferedDocs, fieldsStream.getFilePointer());
    //skip...
}

void writeIndex(int numDocs, long startPointer) throws IOException {
    assert startPointer >= previousFP;
    docsOut.writeVInt(numDocs);
    filePointersOut.writeVLong(startPointer - previousFP);
    previousFP = startPointer;
    totalDocs += numDocs;
    totalChunks++;
}

We notice that when writing filePointers, we store not the actual value but the difference. This is because filePointers is definitely a continuously increasing array. In this case, storing the difference can make the elements actually stored smaller than the original value, which is conducive to compression. Imagine that the number of bits required to store 100000 is much greater than the number of bits required to store 3.

The state after writing the files seg-xx-doc_ids and seg-xx-file_pointers is as shown in the following figure.

After writing the files seg-xx-doc_ids and seg-xx-file_pointers, we need to write the cached document content into the fdt file.


private void flush(boolean force) throws IOException {
    // skip...
    writeHeader(docBase, numBufferedDocs, numStoredFields, lengths, sliced, dirtyChunk);
    //skip...
    if (sliced) {
      // big chunk, slice it, using ByteBuffersDataInput ignore memory copy
      final int capacity = (int) bytebuffers.size();
      for (int compressed = 0; compressed < capacity; compressed += chunkSize) {
        int l = Math.min(chunkSize, capacity - compressed);
        ByteBuffersDataInput bbdi = bytebuffers.slice(compressed, l);
        compressor.compress(bbdi, fieldsStream);
      }
    } else {
      compressor.compress(bytebuffers, fieldsStream);
    }
}

private void writeHeader(
      int docBase,
      int numBufferedDocs,
      int[] numStoredFields,
      int[] lengths,
      boolean sliced,
      boolean dirtyChunk)
      throws IOException {
    final int slicedBit = sliced ? 1 : 0;
    final int dirtyBit = dirtyChunk ? 2 : 0;
    // save docBase and numBufferedDocs
    fieldsStream.writeVInt(docBase);
    fieldsStream.writeVInt((numBufferedDocs << 2) | dirtyBit | slicedBit);

    // save numStoredFields
    saveInts(numStoredFields, numBufferedDocs, fieldsStream);

    // save lengths
    saveInts(lengths, numBufferedDocs, fieldsStream);
}

We can see that we will write docBase, numBufferedDocs, dirtyBit, slicedBit, numStoredFields, lengths, and bufferedDocs into fdt.

docBase is the first DocID of this chunk.
numBufferedDocs is the total number of Docs cached in this chunk.
dirtyBit, slicedBit can be ignored for now.
The array numStoredFields is the number of fields to be stored for each Doc.
The array lengths is the length of each Doc.
The array bufferedDocs is the actual stored data of all Docs.

The state after writing fdt is as shown in the following figure.

The function flush has been fully introduced. This is how Lucene processes one Doc after another, first caching them in memory, and then flushing them to the disk when a certain number is cached.

Generating the Final Index File

When Lucene has processed all the documents, it will call finish to generate the final index file.

@Override
public void finish(int numDocs) throws IOException {
    if (numBufferedDocs > 0) {
      flush(true);
    } else {
      assert bufferedDocs.size() == 0;
    }
    if (docBase != numDocs) {
      throw new RuntimeException(
          "Wrote " + docBase + " docs, finish called with numDocs=" + numDocs);
    }
    indexWriter.finish(numDocs, fieldsStream.getFilePointer(), metaStream);
    metaStream.writeVLong(numChunks);
    metaStream.writeVLong(numDirtyChunks);
    metaStream.writeVLong(numDirtyDocs);
    CodecUtil.writeFooter(metaStream);
    CodecUtil.writeFooter(fieldsStream);
    assert bufferedDocs.size() == 0;
}

We can now understand what dirty means in flush. When the Docs cached in memory have not reached the flush condition, but the documents have been fully processed, we need to forcibly flush them to the disk. In this case, we will set dirty to true. As for sliced, it is because if the length of bufferedDocs is very large, in order to ensure the effect of compression, we will slice it, compress the slices, and write them to the fdt file.

After flushing all the cached Docs to the disk, we start generating the fdx and fdm files. We first focus on indexWriter.finish(numDocs, fieldsStream.getFilePointer(), metaStream);

void finish(int numDocs, long maxPointer, IndexOutput metaOut) throws IOException {
    if (numDocs != totalDocs) {
      throw new IllegalStateException("Expected " + numDocs + " docs, but got " + totalDocs);
    }
    CodecUtil.writeFooter(docsOut);
    CodecUtil.writeFooter(filePointersOut);
    IOUtils.close(docsOut, filePointersOut);

    // skip...
}

Lucene will first write a Footer to the files seg-xx-doc_ids and seg-xx-file_pointers to mark the completion of writing. Also, the Footer can protect the integrity of the file.

Then we will write into fdx and fdm.

void finish(int numDocs, long maxPointer, IndexOutput metaOut) throws IOException {
    //skip...

    try (IndexOutput dataOut =
        dir.createOutput(IndexFileNames.segmentFileName(name, suffix, extension), ioContext)) {
      CodecUtil.writeIndexHeader(dataOut, codecName + "Idx", VERSION_CURRENT, id, suffix);

      metaOut.writeInt(numDocs);
      metaOut.writeInt(blockShift);
      metaOut.writeInt(totalChunks + 1);
      metaOut.writeLong(dataOut.getFilePointer());

      try (ChecksumIndexInput docsIn = dir.openChecksumInput(docsOut.getName())) {
        CodecUtil.checkHeader(docsIn, codecName + "Docs", VERSION_CURRENT, VERSION_CURRENT);
        Throwable priorE = null;
        try {
          final DirectMonotonicWriter docs =
              DirectMonotonicWriter.getInstance(metaOut, dataOut, totalChunks + 1, blockShift);
          long doc = 0;
          docs.add(doc);
          for (int i = 0; i < totalChunks; ++i) {
            doc += docsIn.readVInt();
            docs.add(doc);
          }
          docs.finish();
          if (doc != totalDocs) {
            throw new CorruptIndexException("Docs don't add up", docsIn);
          }
        } catch (Throwable e) {
          priorE = e;
        } finally {
          CodecUtil.checkFooter(docsIn, priorE);
        }
      }
      dir.deleteFile(docsOut.getName());
      docsOut = null;

      metaOut.writeLong(dataOut.getFilePointer());
      try (ChecksumIndexInput filePointersIn = dir.openChecksumInput(filePointersOut.getName())) {
        CodecUtil.checkHeader(
            filePointersIn, codecName + "FilePointers", VERSION_CURRENT, VERSION_CURRENT);
        Throwable priorE = null;
        try {
          final DirectMonotonicWriter filePointers =
              DirectMonotonicWriter.getInstance(metaOut, dataOut, totalChunks + 1, blockShift);
          long fp = 0;

          for (int i = 0; i < totalChunks; ++i) {
            fp += filePointersIn.readVLong();
            filePointers.add(fp);
          }
          if (maxPointer < fp) {
            throw new CorruptIndexException("File pointers don't add up", filePointersIn);
          }
          filePointers.add(maxPointer);
          filePointers.finish();
        } catch (Throwable e) {
          priorE = e;
        } finally {
          CodecUtil.checkFooter(filePointersIn, priorE);
        }
      }
      dir.deleteFile(filePointersOut.getName());
      filePointersOut = null;

      metaOut.writeLong(dataOut.getFilePointer());
      metaOut.writeLong(maxPointer);

      CodecUtil.writeFooter(dataOut);
    }
}

We will first write numDocs, blockShift, totalChunks+1, dataOut.getFilePointer() into fdm:

numDocs is the total number of docs.
blockShift is the meta information used for decompression and compression.
totalChunks+1 is the total number of chunks plus one.
dataOut.getFilePointer() is the next position to be written in the fdx file.

Then we compress the content saved in the file seg-xx-doc_ids and write it into the fdx file, and write the meta information needed for decompression into fdm, and finally write the next position to be written in fdm into fdm. The same way, compress the content saved in the file seg-xx-file_pointers and write it into the fdx file, and write the meta information needed for decompression into fdm, and finally write the next position to be written in fdm and fdt into fdm. The final state is as shown in the following figure.

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/fdx-finish.png"/> </div> From the diagram, we notice that the chunkSize after the Header of fdm is not reflected in the above code. This is because this variable is written when fdm is created.

After completing the above steps, we only need to write numChunks, numDirtyChunks, numDirtyDocs into fdm.

@Override
public void finish(int numDocs) throws IOException {
    //skip...
    metaStream.writeVLong(numChunks);
    metaStream.writeVLong(numDirtyChunks);
    metaStream.writeVLong(numDirtyDocs);
    CodecUtil.writeFooter(metaStream);
    CodecUtil.writeFooter(fieldsStream);
    assert bufferedDocs.size() == 0;
}

Finally, the complete index file is as shown in the following figure. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/finish-write.png"/> </div>

Overview

Finally, here is a schematic diagram of the index file and its relationships.

Lucene如何存储正排索引

tang-hi — Tue, 23 May 2023 00:00:00 GMT

本文将介绍Lucene9.6如何存储它的正排索引，以帮助读者更好地理解其内部工作原理。

正排索引，也被称为前向索引，是信息检索系统中的一种基本数据结构。它按照文档的顺序存储每个文档的内容和属性，使得系统能够快速地获取到任何指定文档的详细信息。在Lucene中，正排数据的存储机制是其能够高效执行全文搜索的关键因素之一。

因为本文的主要关注点是正排索引在磁盘中的存储格式，因此对于文档的预处理以及docID是如何获得会进行忽略。

什么是正排索引

简单来说，正排索引就是可以通过docID 查询到对应的文档。我们可以将其类比为键值对（Key-Value），其中docID为Key，文档内容为Value。

因此，Lucene的正排索引在磁盘中的布局必须能够通过docID快速定位到文档的内容。

正排索引的构建

正排索引构建的入口函数IndexingChain#processDocument （Lucene中将正排索引称为StoredFields）

 void processDocument(int docID, Iterable<? extends IndexableField> document) throws IOException {
   	
    startStoredFields(docID);
    try {
	  // skip .....
      docFieldIdx = 0;
      for (IndexableField field : document) {
        if (processField(docID, field, docFields[docFieldIdx])) {
          fields[indexedFieldCount] = docFields[docFieldIdx];
          indexedFieldCount++;
        }
        docFieldIdx++;
      }
    } finally {
      if (hasHitAbortingException == false) {
      	// skip ...
        // finish forward index
        finishStoredFields();
        
        // skip ...
      }
    }
  }

我们如果只关注正排索引的处理，会发现Lucene对于正排索引一共会做三件事

根据docID进行初始化。
对文档中的每一个字段进行处理。
对这篇doc中的正排索引进行收尾操作。

如果只是关注索引是如何存储在磁盘中的话，我们只需要关注后两件事。

private boolean processField(int docID, IndexableField field, PerField pf) throws IOException {
    // skip....
    
    // Add stored fields
    if (fieldType.stored()) {
      StoredValue storedValue = field.storedValue();
      if (storedValue == null) {
        throw new IllegalArgumentException("Cannot store a null value");
      } else if (storedValue.getType() == StoredValue.Type.STRING
          && storedValue.getStringValue().length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
        throw new IllegalArgumentException(
            "stored field \""
                + field.name()
                + "\" is too large ("
                + storedValue.getStringValue().length()
                + " characters) to store");
      }
      try {
        storedFieldsConsumer.writeField(pf.fieldInfo, storedValue);
      } catch (Throwable th) {
        onAbortingException(th);
        throw th;
      }
    }

    // skip...
  }

void writeField(FieldInfo info, StoredValue value) throws IOException {
    switch (value.getType()) {
      case INTEGER -> writer.writeField(info, value.getIntValue());
      case LONG -> writer.writeField(info, value.getLongValue());
      case FLOAT -> writer.writeField(info, value.getFloatValue());
      case DOUBLE -> writer.writeField(info, value.getDoubleValue());
      case BINARY -> writer.writeField(info, value.getBinaryValue());
      case STRING -> writer.writeField(info, value.getStringValue());
      default -> throw new AssertionError();
    }
  }

我们可以发现在处理正排索引时，我们使用writeField对文档中的每一个字段进行处理。

我们看一下对于定长以及变长的字段，Lucene分别是如何处理的。

  @Override
  public void writeField(FieldInfo info, double value) throws IOException {
    ++numStoredFieldsInDoc;
    final long infoAndBits = (((long) info.number) << TYPE_BITS) | NUMERIC_DOUBLE;
    bufferedDocs.writeVLong(infoAndBits);
    writeZDouble(bufferedDocs, value);
  }

  @Override
  public void writeField(FieldInfo info, BytesRef value) throws IOException {
    ++numStoredFieldsInDoc;
    final long infoAndBits = (((long) info.number) << TYPE_BITS) | BYTE_ARR;
    bufferedDocs.writeVLong(infoAndBits);
    bufferedDocs.writeVInt(value.length);
    bufferedDocs.writeBytes(value.bytes, value.offset, value.length);
  }

我们可以发现，无论字段是定长还是变长，每写入一个字段，都会使numStoredFieldsInDoc增加1。这个变量很好理解，它记录了这篇文档中存储了多少个字段。随后会向bufferedDocs（可以认为是一个内存数组）添加这个字段的相关信息。

字段的相关信息我们可以认为有三种

字段的编号(每个字段都有一个独一无二的编号)
字段的数据类型
字段的数据，即该字段的值。

因为字段的数据类型只有有限的几种，因此Lucene会将其与字段的编号一起存储为一个long类型

final long infoAndBits = (((long) info.number) << TYPE_BITS) | NUMERIC_DOUBLE;

而当字段为定长时，我们会直接将其写入bufferedDocs。但是当字段为变长时，我们会先将该值所占的bytes数写入bufferedDocs后，再将该值写入bufferedDocs。因此我们可以认为bufferedDocs的数据格式为

当处理完每篇文档的字段后，我们可以认为我们已经将这篇文档缓存在了内存中，而后我们需要做的就是对正排索引进行收尾工作，即将其flush到磁盘中。对文档的正排索引进行收尾工作的函数为finishDocument

@Override
public void finishDocument() throws IOException {
    if (numBufferedDocs == this.numStoredFields.length) {
      final int newLength = ArrayUtil.oversize(numBufferedDocs + 1, 4);
      this.numStoredFields = ArrayUtil.growExact(this.numStoredFields, newLength);
      endOffsets = ArrayUtil.growExact(endOffsets, newLength);
    }
    this.numStoredFields[numBufferedDocs] = numStoredFieldsInDoc;
    numStoredFieldsInDoc = 0;
    endOffsets[numBufferedDocs] = Math.toIntExact(bufferedDocs.size());
    ++numBufferedDocs;
    if (triggerFlush()) {
      flush(false);
    }
}

在这个函数中我们会发现它一共做了四件事

将每一篇文档中需要进行存储的字段数量记录下来，保存在数组numStoredFields中
记录下这篇文档最后一个byte的写入位置，保存在数组endOffsets中
记录下目前已经在内存中存储的文档数，保存在变量numBufferedDocs。
判断是否需要将内存中的文档刷到磁盘中，如果需要进行flush。

通过上面的图和代码，我们应该已经明白了前三件事，后续我们重点关注第四件事。

刷到磁盘的时机

private boolean triggerFlush() {
    return bufferedDocs.size() >= chunkSize
        || // chunks of at least chunkSize bytes
        numBufferedDocs >= maxDocsPerChunk;
  }

从代码中我们可以看到，当内存中缓存的Doc数量达到阈值或者缓存的Doc所占用的内存达到阈值时，都会触发落盘这一操作。

刷新到磁盘

从这里开始，我们开始真正了解，Lucene是如何将他的正排数据保存在磁盘中。我们假设我们内存中一共缓存了三篇文档。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/flush-overview.png"/> </div>

private void flush(boolean force) throws IOException {
    // skip...
    numChunks++;
   
    // skip...

    // transform end offsets into lengths
    final int[] lengths = endOffsets;
    for (int i = numBufferedDocs - 1; i > 0; --i) {
      lengths[i] = endOffsets[i] - endOffsets[i - 1];
      assert lengths[i] >= 0;
    }
    final boolean sliced = bufferedDocs.size() >= 2L * chunkSize;
    final boolean dirtyChunk = force;
    // skip...
}

从代码中我们可以看到在实际写到磁盘前，我们仍然需要在内存中做一些计算

递增写到磁盘的chunk数
将之前保存的每篇文档最后一byte所处的位置(endOffsets)转化为每篇文档的长度。
判断是否需要分片, sliced
判断是否为dirtyChunk

后两个目前不需要了解，只需要理解前两个即可。我们需要向磁盘中写入的文件一共有5个

fdt
fdm
fdx
seg-xx-doc_ids
seg-xx-file_pointers

其中4,5为临时文件，并不会出现在最后的索引文件中，仅仅起到暂时存储数据的任务。具体内存中各变量的值，以及需要写的磁盘文件可见下图。

首先我们会将该chunk所保存的文档数以及该chunk在fdt文件中的起始位置写到文件seg-xx-doc_ids,seg-xx-file_pointers中。


private void flush(boolean force) throws IOException {
    // skip...
    indexWriter.writeIndex(numBufferedDocs, fieldsStream.getFilePointer());
    //skip...
}

void writeIndex(int numDocs, long startPointer) throws IOException {
    assert startPointer >= previousFP;
    docsOut.writeVInt(numDocs);
    filePointersOut.writeVLong(startPointer - previousFP);
    previousFP = startPointer;
    totalDocs += numDocs;
    totalChunks++;
}

我们注意到当写filePointers时，我们存的并不是实际的值而是差值，这是因为filePointers一定是连续递增的数组，对于这种情况存储差值可以使得实际存储的元素相较于原值更小，从而有利于压缩。想象一下，存储100000所需要的bit数是远大于3所需要的bit数。写完文件seg-xx-doc_ids,seg-xx-file_pointers后的状态可参考下图。

在写完文件seg-xx-doc_ids,seg-xx-file_pointers后，我们需要将缓存的文档内容写入文件fdt中。


private void flush(boolean force) throws IOException {
    // skip...
    writeHeader(docBase, numBufferedDocs, numStoredFields, lengths, sliced, dirtyChunk);
    //skip...
    if (sliced) {
      // big chunk, slice it, using ByteBuffersDataInput ignore memory copy
      final int capacity = (int) bytebuffers.size();
      for (int compressed = 0; compressed < capacity; compressed += chunkSize) {
        int l = Math.min(chunkSize, capacity - compressed);
        ByteBuffersDataInput bbdi = bytebuffers.slice(compressed, l);
        compressor.compress(bbdi, fieldsStream);
      }
    } else {
      compressor.compress(bytebuffers, fieldsStream);
    }
}

private void writeHeader(
      int docBase,
      int numBufferedDocs,
      int[] numStoredFields,
      int[] lengths,
      boolean sliced,
      boolean dirtyChunk)
      throws IOException {
    final int slicedBit = sliced ? 1 : 0;
    final int dirtyBit = dirtyChunk ? 2 : 0;
    // save docBase and numBufferedDocs
    fieldsStream.writeVInt(docBase);
    fieldsStream.writeVInt((numBufferedDocs << 2) | dirtyBit | slicedBit);

    // save numStoredFields
    saveInts(numStoredFields, numBufferedDocs, fieldsStream);

    // save lengths
    saveInts(lengths, numBufferedDocs, fieldsStream);
}

可以看到我们会向fdt中写入docBase,numBufferedDocs,dirtyBit,slicedBit,numStoredFields,lengths以及bufferedDocs.

docBase为这个chunk的第一个DocID。
numBufferedDocs为这个chunk总共缓存的Doc数
dirtyBit,slicedBit目前可以忽略
数组numStoredFields 为每篇Doc需要存储的字段数量
数组lengths为每篇Doc的长度
数组numBufferedDocs为全部Doc实际存储的数据。

写完fdt后的状态如下图所示

函数flush目前全部介绍完毕，Lucene就是这样处理一篇一篇的Doc,先缓存在内存中，当缓存一定数量后再flush到磁盘中。

生成最后的索引文件

当Lucene处理完全部的文档后，会调用finish生成最后的索引文件。

@Override
public void finish(int numDocs) throws IOException {
    if (numBufferedDocs > 0) {
      flush(true);
    } else {
      assert bufferedDocs.size() == 0;
    }
    if (docBase != numDocs) {
      throw new RuntimeException(
          "Wrote " + docBase + " docs, finish called with numDocs=" + numDocs);
    }
    indexWriter.finish(numDocs, fieldsStream.getFilePointer(), metaStream);
    metaStream.writeVLong(numChunks);
    metaStream.writeVLong(numDirtyChunks);
    metaStream.writeVLong(numDirtyDocs);
    CodecUtil.writeFooter(metaStream);
    CodecUtil.writeFooter(fieldsStream);
    assert bufferedDocs.size() == 0;
}

通过这个函数我们现在可以知道flush中的dirty是什么意思,当我们内存缓存的Doc并未达到flush的条件，但是文档已经处理完了，我们需要将其强制 flush到磁盘中，对于这种情况，我们会将dirty设置为true。至于sliced则是因为如果bufferedDocs的长度很大，为了保证压缩的效果，我们会对其进行分片，分片压缩并写入到文件fdt中。

在将缓存的Doc全部flush到磁盘后，我们开始生成文件fdx，fdm。我们先关注indexWriter.finish(numDocs, fieldsStream.getFilePointer(), metaStream);

void finish(int numDocs, long maxPointer, IndexOutput metaOut) throws IOException {
    if (numDocs != totalDocs) {
      throw new IllegalStateException("Expected " + numDocs + " docs, but got " + totalDocs);
    }
    CodecUtil.writeFooter(docsOut);
    CodecUtil.writeFooter(filePointersOut);
    IOUtils.close(docsOut, filePointersOut);

    // skip...
}

Lucene首先会给文件seg-xx-doc_ids,seg-xx-file_pointers写上Footer标记写入完成。同时Footer也可以保护文件的完整性。

随后我们会像fdx以及fdm中写入

void finish(int numDocs, long maxPointer, IndexOutput metaOut) throws IOException {
    //skip...

    try (IndexOutput dataOut =
        dir.createOutput(IndexFileNames.segmentFileName(name, suffix, extension), ioContext)) {
      CodecUtil.writeIndexHeader(dataOut, codecName + "Idx", VERSION_CURRENT, id, suffix);

      metaOut.writeInt(numDocs);
      metaOut.writeInt(blockShift);
      metaOut.writeInt(totalChunks + 1);
      metaOut.writeLong(dataOut.getFilePointer());

      try (ChecksumIndexInput docsIn = dir.openChecksumInput(docsOut.getName())) {
        CodecUtil.checkHeader(docsIn, codecName + "Docs", VERSION_CURRENT, VERSION_CURRENT);
        Throwable priorE = null;
        try {
          final DirectMonotonicWriter docs =
              DirectMonotonicWriter.getInstance(metaOut, dataOut, totalChunks + 1, blockShift);
          long doc = 0;
          docs.add(doc);
          for (int i = 0; i < totalChunks; ++i) {
            doc += docsIn.readVInt();
            docs.add(doc);
          }
          docs.finish();
          if (doc != totalDocs) {
            throw new CorruptIndexException("Docs don't add up", docsIn);
          }
        } catch (Throwable e) {
          priorE = e;
        } finally {
          CodecUtil.checkFooter(docsIn, priorE);
        }
      }
      dir.deleteFile(docsOut.getName());
      docsOut = null;

      metaOut.writeLong(dataOut.getFilePointer());
      try (ChecksumIndexInput filePointersIn = dir.openChecksumInput(filePointersOut.getName())) {
        CodecUtil.checkHeader(
            filePointersIn, codecName + "FilePointers", VERSION_CURRENT, VERSION_CURRENT);
        Throwable priorE = null;
        try {
          final DirectMonotonicWriter filePointers =
              DirectMonotonicWriter.getInstance(metaOut, dataOut, totalChunks + 1, blockShift);
          long fp = 0;

          for (int i = 0; i < totalChunks; ++i) {
            fp += filePointersIn.readVLong();
            filePointers.add(fp);
          }
          if (maxPointer < fp) {
            throw new CorruptIndexException("File pointers don't add up", filePointersIn);
          }
          filePointers.add(maxPointer);
          filePointers.finish();
        } catch (Throwable e) {
          priorE = e;
        } finally {
          CodecUtil.checkFooter(filePointersIn, priorE);
        }
      }
      dir.deleteFile(filePointersOut.getName());
      filePointersOut = null;

      metaOut.writeLong(dataOut.getFilePointer());
      metaOut.writeLong(maxPointer);

      CodecUtil.writeFooter(dataOut);
    }
}

我们首先会向fdm中写入numDocs,blockShift,totalChunks+1,dataOut.getFilePointer()

numDocs 全量的doc数
blockShift 用于解压以及压缩的元信息
totalChunks+1 全部的chunk数+1
dataOut.getFilePointer() 文件fdx下一个待写入的位置。

随后将文件seg-xx-doc_ids中保存的内容压缩后写入fdx中，并将解压所需要的元信息写入fdm，最后将fdx下一个待写入的位置写入fdm。同样的方式将seg-xx-file_pointers中保存的内容压缩后写入fdx中，并将解压所需要的元信息写入fdm，并将fdx以及fdt下一个待写入的位置写入fdm。最终的状态如下图所示 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/fdx-finish.png"/> </div> 从图中，我们注意到fdm的Header后的chunkSize并没有在上述代码中体现，这是因为这个变量是在创建fdm时就写入的。

完成上述步骤后,我们只需要往fdm中写入numChunks,numDirtyChunks,numDirtyDocs

@Override
public void finish(int numDocs) throws IOException {
    //skip...
    metaStream.writeVLong(numChunks);
    metaStream.writeVLong(numDirtyChunks);
    metaStream.writeVLong(numDirtyDocs);
    CodecUtil.writeFooter(metaStream);
    CodecUtil.writeFooter(fieldsStream);
    assert bufferedDocs.size() == 0;
}

最后完整的的索引文件如下图所示 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/finish-write.png"/> </div>

Overview

最后给出一张索引文件的概略以及相互的关系图 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/overview.png"/> </div>

C++ Memory Model

tang-hi — Mon, 15 May 2023 00:00:00 GMT

这篇文章是因为对C++的内存模型和内存顺序感兴趣，在探索后对所学的知识进行一个总结，希望能以一个便于理解的方式让读者轻松了解C++的Memory Model.我们不会直接讨论memory model做了什么,而是它要做什么, 在它出来之前我们是怎么做的，它是怎么集成了之前的做法，从而形成它独有的模型。

Data Race的原因

首先我们知道，如果出现了Data Race的情况，最简单的方式就是一把大锁保平安，但是为什么锁就可以保证不出现Data Race? 锁究竟对我们的代码做了什么，从而导致Data Race的情况消失了? 这就是我们这篇文章想探索的问题。

Compiler优化

我们通过两个示例代码来阐述为什么编译器的优化会造成Data Race。

int Value = 0;
int IsPublished = 0;

void sendValue(int x)
{
    Value = 1 + x;
    IsPublished = 1 ;
}

先看第一个例子

设置Value的值
设置IsPublished = 1

如果我们有另一个线程，它会不断读取IsPublished,当IsPublished == 1时再去读取Value的值。在这种情况下，我们另一个线程可能读到Value的值为0！为什么？我们看一下这份代码所产生的汇编代码 (如果没有特特殊说明, 我们使用的编译器为gcc9.5, 采用-O2 -std=c++11编译选项)

sendValue(int):
        mov     DWORD PTR IsPublished[rip], 1             ## first set isPublished = 1
        add     edi, 2
        mov     DWORD PTR Value[rip], edi
        ret

从汇编代码中可以发现汇编与代码的顺序相反，我们首先设置IsPublished = 1，再设置Value的值。这就导致另一个线程看到IsPublished = 1时，Value的值可能还未设置。从而发生了Data Race。这是因为编译器在编译时，为了性能考虑，它可以任意交换代码的顺序。只要单线程执行，交换顺序后的结果与不交换的结果保持一致，我们将其称为as-if法则。因为IsPublished，Value是两个不相关的变量，交换它们的执行顺序并不会导致单线程的执行结果被改变，所以编译器可以进行这种优化，尽管对于多线程的程序来说，这会导致Data Race.

我们再来看另一个例子

int value = 90;
void foo() {

    value = 100;               // A                
    while(value == 100)
    {
        // do something
    }
    // exit loop
}

void end() {
    value = 99;				  // B
}

假设我们有两个线程，一个执行foo,一个执行end. 我们的预期是如果A比B先执行(假定都是原子操作)，线程foo会跳出循环。但是实际上foo可能永远都不会结束。还是老样子，我们看看汇编代码说了什么。

foo():
        mov     DWORD PTR value[rip], 100
.L21:
        jmp     .L21                 # loop forever
end():
        mov     DWORD PTR value[rip], 99
        ret

我们可以发现汇编后的foo，本质上就是一个死循环,它在将value设置为100后，就不停的循环下去了，根本不会再校验value的值。其原因也是as-if法则,因为上一条语句已经将value设置为100了，因此对于单线程来说，直接将语句转化为死循环就行了，不需要浪费时间再去检查value值了。

从上面的两个例子中，我们可以看到编译器的优化尽管会增加单线程的效率，但是会破坏多线程的正确性。因此C++的memory order一定需要解决这个问题。在这一小节，我不会过多阐述C++的memory order是怎么做的，而是阐述我们之前是怎么做的。在下一节，我们才会正式介绍C++的memory order。

如何解决

我们有两个方式可以禁止Compiler的优化

使用Compiler barrier
使用volatile关键字

Compiler barrier

在代码中直接插入asm volatile ("" ::: "memory"),可以保证

在compiler barrier后的代码不会被编译器优化到compiler barrier前
在compiler barrier前的代码不会被编译器优化到compiler barrier后

我们直接看第一个例子添加asm volatile ("" ::: "memory")后的汇编代码，可以发现IsPublished = 1;的汇编语句出现在了Value = x + 2;后，从而禁止了编译器不恰当的优化！( compiler barrier 仅在编译阶段生效 )

# void sendValue(int x)
# {
#     Value = x + 2;
#     asm volatile("" ::: "memory");
#     IsPublished = 1 ;
# } 

sendValue(int):
        add     edi, 2
        mov     DWORD PTR Value[rip], edi
        mov     DWORD PTR IsPublished[rip], 1
        ret

volatile

通过将变量声明为volatile,我们告诉编译器，这个变量可能在程序之外被修改（在嵌入式中使用的较多）。因此编译器会对该变量作出如下保证。

保证该变量一定会从内存中读取，而不是寄存器。
编译器不会将含有该变量的语句优化掉。
编译器不会将标记为volatile的变量进行重排序（不仅是该变量，而是所有被标记为volatile的变量）。

我们看看上述两个例子使用volatile后的改变。

# Example 1
# volatile int Value;
# volatile int IsPublished = 0;
 
# void sendValue(int x)
# {
#     Value = x + 2;
#     IsPublished = 1 ;
# } 
endValue(int):
        add     edi, 2
        mov     DWORD PTR Value[rip], edi
        mov     DWORD PTR IsPublished[rip], 1
        
# Example 2
# volatile int value = 90;
# void foo() {

#     value = 100;
 
#     while(value == 100)
#     {
#         // do something
#     }
#     // exit loop
# }

# void end() {
#     value = 99;
# }

foo():
        mov     DWORD PTR value[rip], 100
.L21:
        mov     eax, DWORD PTR value[rip]
        cmp     eax, 100
        je      .L21
        ret
end():
        mov     DWORD PTR value[rip], 99
        ret

第一个例子可以看到，因为声明为volatile，编译器不会再将其重排序了，这里之所以要将两个变量都声明为volatile是因为，volatile 和 非volatile变量是可以重排序的，只有都为volatile才不会重排序。而第二个例子中，可以看到编译器保证对value的读取都会从内存读取，并且不会对含有volatile的变量进行优化。

CPU乱序执行

这里我们仍旧使用第一个例子，不同的是我们加上compiler barrier的约束

int Value = 0;
int IsPublished = 0;

void sendValue(int x)
{
    Value = 1 + x;
    asm volatile("" ::: "memory");
    IsPublished = 1 ;
}

尽管我们在这里添加了compiler barrier，但是实际运行时，仍然可能出现在读到 IsPublished = 1后，Value的值为0的情况。这是因为CPU也可以乱序执行你所写的指令，只要符合as-if法则.因此可能在实际执行时，CPU先执行IsPublished = 1 后执行Value = 1 + x

如何解决

如果想要强制要求CPU按某种顺序执行指令，我们需要在代码中插入memory barrier, 这里仅介绍x86架构下的memory barrier,一共有三种memory barrier

mfence

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible

lfence

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible

sfence

Performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction. This serializing operation guarantees that every store instruction that precedes in program order the SFENCE instruction is globally visible before any store instruction that follows the SFENCE instruction is globally visible.

mfence，所有在mfence前的读写指令，都会被mfence后的读写指令感知到(可见),这么说可能比较抽象,我们看下面的图。

为了满足该正式定义其实很简单。

在mfence之前的读写代码，可以乱序执行。
在mfence之后的读写代码，可以乱序执行。
乱序执行的读写代码不可以跨过mfence指令。

只要满足上述四点就可以满足mfence定义，这里不做证明，有兴趣的可以自己尝试着证明一下。

lfence，所有在lfence前的读指令，都会被lfence后的读指令感知到(可见),这么说可能比较抽象,我们看下面的图。

为了满足该正式定义。

在lfence之前的读写代码，可以乱序执行。
在lfence之后的读写代码，可以乱序执行。
乱序执行的读读代码不可以跨过lfence指令，乱序执行的读写，写读，写写代码可以跨过lfence指令。

sfence，所有在sfence前的写指令，都会被sfence后的写指令感知到(可见),这么说可能比较抽象,我们看下面的图。

为了满足该正式定义。

在sfence之前的读写代码，可以乱序执行。
在sfence之后的读写代码，可以乱序执行。
乱序执行的写写代码不可以跨过sfence指令，乱序执行的读写，写读，读读代码可以跨过sfence指令。

C++的内存模型

我们按照难易度开始介绍C++的内存模型

RELAXED

这个内存模型是最宽松的内存模型（性能最好），也就是约束最小的，简单来说，它只保证原子操作，不会对CPU的乱序执行进行任何保证（编译乱序，乱序执行，只要有一个不保证，就是全不保证）。因此它适合计数功能，比如shared_ptr的引用计数(在析构时不适用).

一般使用的方式为

atomic<int> ref;
ref.store(1, memory_order_relaxed);

atomic_thread_fence(memory_order_relaxed); // 这个语句没有任何用处

SEQUENTIALLY-CONSISTENT

这个内存模型是最严格的内存模型（性能最差）

Atomic operation

作为原子操作时，该模型除了保证原子操作外，它同时保证我们可以将各个线程的原子操作拿出来，然后确定一个执行顺序(即先执行线程A，再线程B，再线程A)。而这个顺序是全局确定的，即每一个线程看到的都是这个顺序。简单来说就是

代码实际执行时按照你写的顺序执行
对内存修改读取的顺序，所有线程唯一。

这也是最符合直觉的内存模型。

std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
   x.store(true,std::memory_order_seq_cst);     // A
}
void write_y()
{
    y.store(true,std::memory_order_seq_cst);   // B
}
void read_x_then_y()
{
    while(!x.load(std::memory_order_seq_cst));   // C
    if(y.load(std::memory_order_seq_cst))		// E
        ++z;
}
void read_y_then_x()
{
    while(!y.load(std::memory_order_seq_cst));  // D
    if(x.load(std::memory_order_seq_cst))       // F
        ++z;
}
int main()
{
    x=false;
    y=false;
    z=0;
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join();
    b.join();
    c.join();
    d.join();
    assert(z.load()!=0); // always true!
}

从这个代码中我们可以得知

当C发生时,A一定已经发生
当D发生时,B一定已经发生
当E发生时,C, A一定已经发生
当F发生时,D, B一定已经发生

因此若E,F都为false,说明A,B都未发生但C,D都已经发生.这与1,2矛盾.因此assert一定为true.

Fence

当std::memory_order_seq_cst用于std::atomic_thread_fence等价于代码中插入mfence,对于mfence的定义可以看前面的描述.

ACQUIRE_RELEASE

这个内存模型的严格性居中（性能居中）

Atomic operation

这个模型没法再取得全局唯一的执行顺序了(即有可能线程A看到线程C的执行顺序是ABCD,但是线程B看到线程C的执行顺序为BDAC). 它只能通过acquire和release来进行同步线程.

那么acqurie与release分别代表了什么样的语义呢?

从图中可以看到如果acquire不允许后面的指令重排序越过该语句,而release不允许前面的指令重排序越过该语句,因此他们两个一对很好的构成了临界区.是的,我们可以把acquire看作lock,把release看作unlock.

那么这两个怎么对线程取得同步呢?很简单,如果acquire获得了release存储的值,那么这两个线程就取得了同步.release之前的所有内存操作都会被acquire感知到.

现在我们看下面的例子.

std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
   x.store(true,std::memory_order_release);     // A
}
void write_y()
{
    y.store(true,std::memory_order_release);   // B
}
void read_x_then_y()
{
    while(!x.load(std::memory_order_acquire));   // C
    if(y.load(std::memory_order_acquire))		// E
        ++z;
}
void read_y_then_x()
{
    while(!y.load(std::memory_order_acquire));  // D
    if(x.load(std::memory_order_acquire))       // F
        ++z;
}
int main()
{
    x=false;
    y=false;
    z=0;
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join();
    b.join();
    c.join();
    d.join();
    assert(z.load()!=0); // maybe failed!
}

从这个代码中我们可以知道在SEQUENTIALLY-CONSISTENT的情况下,assert一定为true,但是在acquire-release中,这个assert可能失败.

尽管A和C同步,B和D同步. 但是E和B, F和A并不同步,他们仍然可能读到的是false.从而导致assert失败.(因为读到了线程本地的cache)

Fence

当std::memory_order_acquire用于std::atomic_thread_fence等价于禁止LOAD，STORE重排序到这条语句前，同时禁止LOAD语句重排序到这条语句后。

当std::memory_order_acquire用于std::atomic_thread_fence等价于禁止LOAD，STORE重排序到这条语句后，同时禁止STORE语句重排序到这条语句前。

Overview

因为对于CPU的MESI协议等的理解还不够,因此这篇文章写的还比较浅显,等后面完全弄懂了CPU的缓存一致性协议再继续完成.

Effective cpp

tang-hi — Mon, 15 May 2023 00:00:00 GMT

1. 视C++为一个语言联邦

C++ 可以认为由C, Object-Oriented C++, Template C++, STL组成, 将他们分开看，这样子当写代码时，写到特定的领域，使用特定的写法。

2.尽量以const，enum，inline替代#define

使用#define定义的变量可能会宏展开，被编译器移走，从而从未进入符号表，这种情况下难以debug，而且也可能导致目标码变大，因为可能有多份数据。

对于常量我们使用

const double PI = 3.14;
const char* const NAME = "Tang donghai";
const std::string NAME("Tang donghai");

类专属的常量

class Const {
    static const int FOUR = 3; // 整数类型
   	constexpr static const char* NAME const = "NAME"; // non 整数类型, 或在实现文件中定义
}

// 如果需要取地址，需要在实现文件中加上
// const int Const::FOUR;

一些简单的函数

#define CALL_WITH_MAX(a, b) f ((a) > (b) ? (a) : (b))

template <typename T>
inline void callWithMax(const T& a, const T& b) {
	f(a > b ? a : b);
}

3. 尽可能使用const

如果一个变量，参数，函数不该产生变化，那么就使用const.

const在星号左边表示所指的内容不可变，在星号右边表示指针不变。

const int* a; // *a 不变
int* const a; // a 不变

const std::vector<int>::iterator iter;  =====> T* const // 配合typedef时尤其要注意。
std::vector<int>::const_iterator citer; =====> const T*

如果返回值是value, 最好加上const

Rational operator+ (Rational& a, Rational& b); // bad
(a + b) = c; // ok

const Rational operator+ (Rational& a, Rational& b); // good
(a + b) = c; // wrong!

如果成员函数不会被修改，那就应该声明为const,const的函数可以被重载。
如果想要取得逻辑不变性，可以对成员变量声明为mutable，这样即使在const函数中依旧可以修改。
当既要实现const函数，又要实现非const函数版本

class TextBook {
  public:
    const char& operator[](std::size_t position) const {
        //...
        //...
        //...
        return text[position];
    }
    
    char& operator[](std::size_t position) {
        return const_cast<char&>(static_cast<const TextBook&>(*this)[position]);
    }
};

4. 确保对象被使用前已被初始化

内置类型最好手动初始化
成员变量初始化顺序为它的申明顺序，可以在申明的时候初始化。
不同编译单元的non-local static 不保证初始化顺序。可以将其变为local-static放到函数里面，通过调用函数保证初始化

static Global global;
||
||
||
\/
Global& getGlobal() {
    static Global global;
    return global;
}

5. 了解C++默默编写并调用哪些函数

默认构造函数
1. 如果用户没有提供
2. 成员变量都有默认构造函数/基类有默认构造函数
拷贝构造函数
1. 如果用户没有提供
2. 用户的基类，成员可被拷贝
3. 用户的基类，成员有析构函数
4. 用户并未定义提供移动构造函数，移动赋值函数。
拷贝赋值函数
1. 如果用户没有提供
2. 类的成员都可被拷贝赋值即没有引用类型或者const修饰的非class类型。
3. 用户并未定义移动构造函数，移动赋值函数。
移动构造函数
1. 用户没有提供
2. 用户未定义，拷贝构造函数，移动赋值函数，拷贝赋值函数，析构函数
3. 非静态成员可被移动，基类可被移动，基类含有析构函数
移动赋值函数
1. 用户没有提供
2. 用户未定义，拷贝构造函数，移动构造函数，拷贝赋值函数，析构函数
3. 非静态成员可被移动，基类可被移动，基类含有析构函数
4. 非静态成员没有引用类型，const类型
析构函数
1. 用户没有提供
2. 非静态成员不可被析构。

6. 若不想使用编译器自动生成的函数，就该明确拒绝

明确使用= delete;将编译器生成的函数明确拒绝。

7. 为多态基类声明virtual析构函数

如果一个类有virtual函数，那么你需要将析构函数声明为virtual。否则的话，你可能造成内存泄漏，因为如果你delete derived class,可能不会调用子类的析构函数。

8.别让异常函数逃离析构函数

如果析构函数中会抛出异常，很有可能在抛出一个异常后，再析构的时候又抛出异常，这样子程序会直接结束。

如果可能抛出异常，应该将可能抛出异常的代码包装在一个函数中，由析构函数去调用它。

DBConn::~DBConn() {
    if (!closed) {
    	try {
        	db.close()
    	} catch(...) {
            
        }
    }
}

class DBConn {
    void close() {
        db.close();
        closed = true;
    }
}

交给用户权利去调用close,如果他们不去，依赖析构函数，那么析构函数吞下异常也应该是意料之中的行为。

9. 绝不在构造和析构过程中调用virtual函数

当你的类执行构造函数时，首先执行的是base的构造函数，而在这期间因为derived还未构造完成，因此你调用的virtual函数将会是base类的.析构函数同理。

class Base {
public:
	Base() {
		hello();    // error!!
	}
	virtual void hello();
};

class Derived {
public:
	Derived() {
		
	}
	void hello() override {
		///....
	}
};

10. 令operator= 返回一个reference to *thiss

C++世界的默认规矩

Widget& operator=(const Widget& rhs) {
	//....
	return *this;
}

11. 在operator= 中处理“自我赋值”

需要考虑是否为同一个变量思考以下代码

Widget&
Widget::operator=(const Widget& rhs) {
	delete rhs.xxx; // bad!!!!
	pb = new XXX(*rhs.xxx);
	return *this;
}

需要考虑是否为同一个，可以使用以下方式

Widget&
Widget::operator=(const Widget& rhs) {
	if (this == &rhs) return *this;
	delete rhss.xxx;           // ok
	pb = new XXX(*rhs.xxx);
	return *this;
}

或者采用copy-swap

Widget&
Widget::operator=(const Widget& rhs) {
	Widget temp(rhs);
	swap(temp);
	return *this;
}

12. 复制对象时勿忘记其每一个成分

没什么好说的，复制时不要忘记就好！子类不要忘记父类！

class Derived{
public:
	Derive(const Derived& derived) : Base(derived), xxx(xxx) {}
	Derived& operator=(const Derived& derived) {
		//..........
		Base::operator=(derived);
		//..........
	}
};

13. 以对象管理资源

使用RAII的方式进行管理，同时注意条款8,在管理资源时别让异常逃出异构函数

14. 在资源管理类中小心copying行为

复制RAII对象时，必须一并复制它所管理的资源，所以资源的copying行为决定RAII对象的行为

一般而言，我们对RAII对象会采取如下方式

禁止copy mutex
采用引用计数，当计数变为0时，释放资源 shared_ptr
转移资源 unique_ptr

15. 在资源管理类中提供对原始资源的访问

一般而言，我们有两种做法

显示提供get接口

class A {
    data_ptr* get() const;
};

提供隐式转换接口
```
class A {
    operator B() const;
};
```

隐式转换接口，增加了误用的概率，尽管相比于显式更加自然。我更倾向于显示的接口。

16. 成对使用new 和 delete时要采取相同形式

被new出来的对象,要使用delete删除，被new []出来的对象，要使用delete []删除。

17. 以独立语句将newed对象置入智能指针

考虑以下的函数

process(std::shared_ptr(new Widget), processor());

对于这样的语句，编译器可以任意决定执行顺序，只要new Widget在shared_ptr的构造函数前执行就行。

因此，我们可以以下顺序

new Widget
processor()
shared_ptr's ctor

如果2抛了异常，我们就面临内存泄漏的问题。

因此为了保证异常安全，我们应该以独立的语句将new对象放入智能指针。

即

auto p = std::shared_ptr(new Widget);
process(p, processor());

18. 让接口容易被正确使用，不易被误用

不易被误用，这需要加许多限制(最好是编译器的限制)。
接口最好与内置类型保持一致性。
使用条款13, 以对象管理资源。

19. 设计class犹如设计type

假设你将为系统中引入一个新的type来设计class，应该如何被创建和销毁，对象的初始化和赋值有什么差别....

20.宁以pass-by-reference-to-const 替换 pass-by-value

这条本义是减少拷贝，但是考虑到rvo机制，也许不一定需要如此，对于内置类型，可能pass-by-value性能更好。

21. 必须返回对象时，别忘想返回reference

const A operator*(const A* lhs, const A* rhs) { // fine copy it
    // 
    return a;
}

//------------------------------------------
const A& operator*(const A* lhs, const A* rhs) {
    A = lhs * rhs;
    return A;           // error! dangling reference!
}
//------------------------------------------
const A& operator*(const A* lhs, const A* rhs) {
    static A a;
    a = ///...
    return a; /// error!!!!
}

auto a = a1 * a2;
auto b = a * a2;
a == b // true!

22. 将成员变量声明为private

将成员声明为private，从而保证了封装以及日后随时修改的权利

封装性是当你删去该代码时，所影响的代码量。

以这个评判角度来看，public(所有使用的代码)和protected（所有继承的代码）有着一样的封装性

因此尽可能将成员变量声明为private

23. 宁以non-member non-friend 替换member函数

和条款22一样，当我们采用member函数/friend函数，意味着我们增加了一个函数可以访问private的成员变量，这就意味着我们的代码封装性下降了（更多的代码可以访问private成员了）。

因此如果可以的话，使用non-member non-friend替换member函数，同时将同一个类的non-member函数分类存放在不同的头文件中。减少编译依赖。

如果想将一个member函数转化为非member函数，不要先考虑变为friend函数，因为这两个封装性一致。要考虑转化为non-member函数。

24. 若所有参数皆需类型转换，请为此采用non-member函数。

考虑一个乘法

const Rational operator*(const Rational& lhs, const Rational& rhs); // 1


const Rational operator*(const Rational& rhs); // 2

1 比 2好，因为两种参数都可以进行隐式转换。

25. 考虑出写出一个不抛异常的swap函数。

首先swap函数不应当抛出异常，因为如果你想要写出异常安全的代码，很大程度上你要依赖swap函数，因此不要写出会抛出异常的代码

怎么自定义高效的swap函数？

template<typename T>
class Efficient {
public:
    void swap(Efficient& a) noexcept {
        // efficient
    }
    
    
};

template <typename T>
void swap(Efficient<T>& lhs, Efficient<T>& rhs) {
    lhs.swap(rhs);
}

namespace std {
    template<>
    void swap<Widget>(Widget& lhs, Widget& rhs) {
        
    }
}

自定义高效的swap函数

定义public的成员函数，实现具体逻辑
定义non-member的模板函数，调用成员函数。
如果你定义的不是class template，而是class，可以全特化std中的swap。

26. 尽可能延后变量定义式的出现时间

尽可能仅在必要时定义你所需要的变量，尤其是class具有constructor的成本，防止无意义的构造成本。

27. 尽量少做转型动作

尽量少做转型动作，这并不是没有代价的，很有可能会产生对应的汇编代码。

如果转型也尽量使用新式的转型static_cast dynamic_cast....

28. 避免返回handles指向对象内部成分

避免将内部private的函数通过引用，指针等方式泄露出去，有时我们必须这么干，如果不想用户可以更改它，将返回值加上const的限制。并且保证handle的生命周期一直有效。

29. 为"异常安全"而努力是值得的

时刻保证即使抛出异常，各成员，class也处于有效的合法的状态（基本保证）

强烈保证（要么调用前，要么成功）

使用智能指针控制new的内存，copy and swap机制来保证。

30. 透彻了解inlining的里里外外

仅将inline加在短小的函数中，被频繁调用的函数。

31. 将文件间的编译依存关系降至最低

现在还没什么体会。

32. 确定你的public继承塑模出is-a关系

适用于base class身上的每一件事情一定也适用于derived class身上，因为每一个derived class对象也都是一个base class对象。这个可能需要后面的体会。

33. 避免遮掩继承而来的名称

如果你有一个base class

class Base {
 public:
    void mf1();
    void mf1(int x);
    void mf1(int x, int y);
};

你想写一个derivedclass，并且重新override一部分函数

class Derived : public Base{
  public:
    void mf1();
};

但是这样就掩盖了Base class的其他mf1的函数了，如果你仍然想要使用Base class的mf1函数，那么使用using

class Derived : public Base {
  public:
    using Base::mf1;         // use this!!
    void mf1();
};

但是如果你只想继承部分的基类函数(例如private 继承)，那么你需要使用forward function

class Derived : private Base {
    void mf1() {           // 名称掩盖
        Base::mf1();      // 内部使用Base
    }
};

34. 区分接口继承和实现继承

继承分为继承成员函数的接口以及成员函数的实现

当你声明一个函数为pure virtual,说明你只希望他们继承接口，而不是实现。
当你声明一个函数为virtual，说明你希望他们继承接口，同时提供一份默认实现。
当你将一个函数声明为non virtual时，说明你希望他们继承接口，但是接受一个强制的实现。

但是有时候，我们会担心后续开发者，忘记修改默认的virtual.

class Airplane {
public:
    virtual void fly() = 0;
protected:
    void defaultFly();
};

缺省实现放在defaultFly函数中，同时将fly设置为pure virtual,这样就可以防止后续开发者忘记实现fly

35.考虑virtual函数以外的其他选择

virtual函数的一些替换方案是

使用函数指针，由调用者决定不同的表现形式
使用NVI，即public的non-virtual函数，调用private 的 virtual函数。

36.绝不重新定义继承而来的non-virtual函数

不要定义继承而来的non-virtual函数，第一这违反oop原则，其次调用者可能会错误使用，例如

class Base {
    void mf1();
};


class Derived : public Base{
    
};

D x;
B* b = &x;
D* d = &d;
b->mf1(); // diff if you derived
d->mf1();

37. 绝不重新定义继承而来的缺省参数值

因为参数缺省定义是静态绑定的，这个和virtual函数相反，virtual函数是动态的绑定的。因此如果你重新定义继承而来的缺省参数，从而导致一个错误的情况。

class Base {
    virtual void hello(int a = 1);
};

class Derived {
    virtual void hello(int a = 2);  // ooooops! 
}；

38. 通过复合塑造出has-a 或"根据某物实现出"

继承是is-a关系，而复合是has-a，你并不一定需要继承它的接口，那么你可以使用复合的方式在内部将该对象设置为成员变量，通过该对象的调用完成。

39. 明智而审慎的使用private继承

private继承意味着并不会在引用时自动转换，同时所有继承而来的成员变量以及函数都是private类型的。

这意味着你并不想继承函数定义，你只是想要它的部分实现，这很类似于复合的方式。

但是选取private而不是复合的原因是因为涉及到virtual函数以及部分protected的成员变量。

当没有更好的办法时，private是个好方法。

40.明智而审慎的使用多重继承

使用多重继承，会非常复杂，而且更可能增加名称冲突的概率，而如果是菱形继承那么，你可能需要virtual继承消除多个成员变量的重复值。

而你最应该的使用的使用public继承接口，然后用private继承继承实现部分。

41. 了解隐式接口和编译器多态

class和template都支持接口和多态。对class而言接口是显式的，而且多态要通过virtual来保证。

而template则是隐式的，而且编译期就可以实现多态

42. 了解typename的双重意义

用于在template指定模板形参。
用于指定类内一些嵌套的类型名称。

43. 学习处理模板化基类内的名称

如果我们继承一个模板类，我们想要调用基类继承而来的成员函数，可能会遇到麻烦

假设以下的代码

template<typename T>
class Base {
public:
   void hello();  
};

template<typename T>
class Derived : public Base<T> {
public:
    void hello2() {
        hello();   // error! couldn't find it!
    }
};

之所以会出现这样的原因是，编译器不确定你是不是会全特化Baseclass，全特化可能不实现成员函数了。因此，他对你继承的template class不会做任何假设。比如

template<>
class Base<int> {
    public:
    void yes();
}

这样就没有hello函数了，对此我们可以有以下三种方式解决

template<typename T>
class Derived : public Base<T> {
    public:
    void hello2() {
        this->hello();     // 假设hello可以被调用
    }
};

template<typename T>
class Derived : public Base<T> {
    public:
    using Base<T>::hello; // 告诉编译器，可以从Base中寻找该定义。揭露出命名。
    void hello2() {
        hello();     
    }
};

template<typename T>
class Derived : public Base<T> {
    public:
    
    void hello2() {
        Base<T>::hello();      // 指定hello的应用，但是这样子就会丧失多态性，因为不是用this调用的
    }
};

44. 将于参数无关的代码抽离template

如果template与参数无关，那么我们应该抽离，考虑如下函数

template<typename T, size_t n>
class Base {
    
};

对这种代码，不同的n会生成不同的模板代码，因此我们需要将n与T分割开

template<typename T, size_t n>
class BaseV2 : public BaseV1<T> {
    
};

45.运用成员函数模板接受所有兼容类型

考虑shared_ptr,我们希望可以通过shared_ptr<Bottom>初始化构造shared_ptr<Up>，但是如果我们这样子写的话

template<typename T>
class shared_ptr {
  shared_ptr(const shared_ptr<T>& other);  
};

这样只能够shared_ptr<Up>初始化构造shared_ptr<Up>

所以我们使用范化的构造函数

template<typename T>
class shared_ptr {
    template<typename U>
    shared_ptr(const shared_ptr<U>& other);
}

这样子我们得到了很多的构造函数，超过了我们的要求，甚至可以用shared_ptr<double>初始化构造shared_ptr<Up>,为了对此加以限制。

template<typename T>
class shared_ptr {
    template<typename U>
    shared_ptr(const shared_ptr<U>& other) : 
    data(other.get()) // add some restriction
    {}
    
    T* get();
    T* data;
};

通过上述手法加以限制后，我们可以确定只有U可以隐式的转化为T时，我们才可以做成这样的事情。

注意我们这里并没有加上explicit，因为指针的隐式转化是被允许的，因此shared_ptr也被允许隐式转化。

同时注意泛化的成员模板函数，并不会对原来的生成规则产生影响，你可以将其视为一个普通的成员函数，而不是特殊的构造函数。

46. 需要类型转换时请为模板定义非成员函数

考虑以下代码

template<typename T>
class NumberType {
	NumberType(T val); 
};	

template<typename T>
const NumberType<T> operator*(const NumberType<T>& lhs, const NumberType<T>& rhs) {
    //....
}

如果我们调用

NumberType<int> a;
a * 3;

这样是不会调用成功的，第二个参数也无法隐式转化。因为C++会先进行template推倒，再实例化，因此你需要将其声明为friend并提供定义

template<typename T>
class NumberType {
	NumberType(T val);
    friend
    const NumberType<T> operator*(const NumberType<T>& lhs, const NumberType<T>& rhs) {
    //....
	}     
};

这样子在你声明NumberType<int>时，就会实例化该friend函数，在你调用时就可以直接引进类型转化了。

47. 请使用traits classes表明类型信息

即类内根据std的规则typedef一定的东西

48. 认识template元编程

nothing to say

49. 了解new-handler的行为

new-handler可以让你在内存无法分配至，指定一个函数，让其被调用。

50. 了解new 和 delete的合理替换时机

当你需要log，检查bug，测试性能等原因时，可以自定义new delete

51. 编写new和delete时要固守常规

例如，当用户需要new 0 byte时，需要返回1byte，或者如果无法分配内存就需要调用new handler等

52. 写了placement new 也要写placement delete

placement new是指定一个地方调用构造函数，new这个操作符

调用operator new 申请内存
指定位置上调用构造函数

53. 不要轻忽编译器的警告

54. 让自己熟悉标准库

55. 让自己熟悉Boost

PQIVF(Prodcut Quantization)

tang-hi — Wed, 05 Apr 2023 00:00:00 GMT

Product Quantization是一种用于向量量化的方法，由Hervé Jégou和Olivier Chum于2011年在论文《Product quantization for nearest neighbor search》中首次提出。这篇论文解决了在大规模数据集上进行最近邻搜索的问题。相较于其他的ANN搜索算法通过牺牲一定的内存空间来减少搜索空间，PQ通过对向量进行量化压缩，使得所需要的内存空间大幅减小。

What is Product Quantization?

如果想要搞清楚PQ，那么首先得明白 Quantization 的含义，Quantization即量化，这是一个从信号处理领域来的名词，按照Wikipedia的定义

Quantization is the process of constraining an input from a continuous or otherwise large set of values to a discrete set (such as the integers).

量化就是将一个集合映射到另一个离散的集合之中，那么PQ实际上就是将数据集中的每一个向量映射到另一个集合中，比如说整数集合。即

通过这种方式我们可以用一个INT来表示整个向量，也就是说如果向量原来的维数为128.那么单个向量的内存占用就会从128 * 4 byte (float) 降低为 4 byte(int), 内存可以减少128倍。但是量化要求我们的映射为满射，即被映射的集合中任一元素，原始集合中至少存在一个元素与之对应。也就是说如果我们将向量映射到整个LONG空间，我们会有LONG_MAX * 8 byte 即18,446GB的内存占用，这显然是不可接受的。那如果我们仅将向量映射到UINT8,我们会有 256 * 2 byte 即512B的内存占用，内存问题看似解决了。但是如果我们数据集中有100万个向量，我们将其映射到0-255，平均一个整数有3906个向量与其对应，我们在搜索时无法区分这3906个向量与待查询向量之间的距离差别，这就会导致糟糕的召回率。因此选择一个合适的映射空间大小，使得内存占用以及召回率达到一个sweet point，这就是PQ所要达成的效果。

请注意，Product Quantization并不是通过降维来减少空间大小。

在Product Quantization中量化后向量的数量仍然保持不变。但是，压缩后的向量值现在被转换为一个小整数，你可以认为这个整数仅仅只是一个符号。通过将高维向量转化为一个符号，从而减少了向量的内存空间占用。

How Product Quantization Work?

1. Split And Train

假设向量的维数为128,我们首先将该向量平均分为8份，每一个的维数为16。

如果我们有1000个向量，我们将每一份向量平均分为8份，V0, V1.... V7。那么我们会有1000个V0, 1000个V1,...1000个V7。我们分别对他们进行K-means聚类，同时假定我们选定K-means的K为256。那么我们在每一个Vi中会得到256个centroids。

2. Encode

在得到centroids后，我们为每一个向量进行编码，我们首先分别对V0,V1,....V7进行编码，编码的规则相当简单。我们拿V0进行举例，首先我们遍历V0中每一个向量，并且选择距离当前向量最近的一个centroids作为其对应的编码值，如果向量A距离centroids42最近，那么A的编码值即为42。在完成V0,V1....V7的编码值后，我们将其拼接起来就会得到所有向量的编码值。

Summary

目前PQ的构建部分已经讲完了，我们现在重新审视一下PQ这个算法，首先考虑PQ的映射空间大小，尽管我们单个Vi只有256个质心，也就是说无论向量有多少，它们只会映射到0-255,但是我们通过对向量进行split，使得整体的映射空间大小为 $256^M$ (M为向量被split的份数), 从而极大的扩充了映射空间的大小，那我们再考虑内存占用的大小，V0占用的空间大小为$N * 2 byte$ 整体的空间大小为$M * N * 2 byte$, 相比于原始的大小$ N * D * 4bytes$极大的减小了内存占用(D 为向量的维数)。因此PQ在保证了内存占用低的同时提供了一个较大的映射空间，从而保证了召回率。

3. Search

当我们已经对数据库中的所有向量都已经进行了量化，我们需要在用户给定任意向量后，从数据库中寻找与该向量最相似的N个向量，首先我们仍旧对query进行split。

注意我们不会对Query进行量化，仅仅只做Split，原因是这样子可以在距离计算时保证更高的精度，详情可以参考论文原文

在Split后，我们会计算对应的Split部分与每一个centroids的距离，并将该距离进行保存。

随后我们遍历数据库中所有的向量，通过查表分别计算出Query中每一段与他们之间的距离，随后相加得到最终的距离，并最终选择距离最近的K个向量。

4. IVF

尽管PQ大幅减少了我们所需要的内存，但是在搜索时，我们仍然需要与数据集中的每一个向量计算他们之间的距离，因此我们希望可以在保证召回率的同时，减少计算量。为此我们引入了IVF(Inverted File Index), 首先我们先对数据集中的向量进行k-means聚类，从而将数据集划分为k个小数据集，同时计算数据集中的向量与其对应的质心之间的残差，并对残差进行PQ压缩。

<div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/PQ8.png"/> </div> 需要注意的是每一个centroid后面都会挂载属于这个分区的向量，但这向量不是数据集中的原始向量，而是原始向量与centroid的残差。之所以选择残差是因为这样可以降低数据集中的方差，从而在进行ANN搜索时，距离计算的误差也会更小，从而产生更好的搜索质量。

可以认为我们实际上是将整个数据集划分为几个更小的数据集后，再对每一个分区进行PQ压缩。当我们搜索时，我们会选择probe个距离最近的分区(通过计算与centroid之间的距离决定),同样的计算Query与对应centroid的残差，并使用这个残差在该分区进行PQ搜索，在完成所有分区的搜索后，返回距离最近的K个向量。

这里之所以选择probe个距离最近的分区而不是选择最近的那一个分区，是因为作者在论文中提到

The query vector and its nearest neighbors are often not quantized to the same partition centroid, but to nearby ones

因此为了召回率我们会扩大一部分的搜索范围，因此probe的数量也体现了召回率与搜索速度之间的tradeoff。

5. Summary

PQ通过对向量进行量化，降低向量搜索时所使用的内存，同时通过对向量分段量化，扩展映射空间。而IVF在PQ的基础上进一步减少搜索空间，从而降低了搜索时间。

HNSW (Hierarchical Navigable Small World)

tang-hi — Sat, 25 Mar 2023 00:00:00 GMT

HNSW是通过图的方式来解决向量搜索问题的算法，由Y.Malkov与D.Yashunin在论文中首次提出。

这一个Section安排如下

图拥有什么样的性质可以有效的找到最近的K个向量
NSW(Navigable Small World)
HNSW(Hierarchical Navigable Small World)

1. 图的性质

我们先直观的感受一下使用图的方式来表现向量空间。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/raw_vector.png"/> </div>

图中的点代表向量，我们可以看到，如果两个向量的距离较近，那么在图中这两个点之间的距离也会更近。当我们想要通过图的方式来解决向量搜索时，我们会希望从任一点出发可以到达图中其他所有的点，即这个图是一张联通图。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/compare_vec.png"/> </div>

但仅仅只是联通图，仍然无法做到快速有效的找到距离最近的K个向量。考虑如下的情况，A点与B点之间相隔较远，因此如果想要从A点到达B点需要途经许多点(代表着大量的计算)，同时我们可以看到点C与许多其他的点都有连接，因此如果我们从点C开始寻找距离查询向量最近的K个点，我们会计算大量无关的点(因为与点C相连的点，其中很多大概率是与结果无关的)。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/two_conn.png"/> </div>

综上所述，为了可以高效而准确的找到距离查询向量最近的K个向量。我们希望构建的图有以下几个性质

联通图(没有孤岛)
距离较远的点,有边可以相连(long range edge)
构建的图中边的数量不宜过多(大量的计算)
距离相近的点，有边连接（保证召回率）其中3,4是召回率与计算量的tradeoff。

2. NSW

NSW通过有效且简单的算法构建出满足上述要求的图，下面分别从构建以及查询两个方面来介绍NSW算法。

1. 构建

首先我们将通过随机的方式，将向量一个一个添加到图中，每一个新添加的点都会与当前图中距离该点最近的M个点相连。之所以通过M对相连的点的数量进行限制，是为了防止连接的边过多，从而影响查询效率。

我们通过一个例子来描述构建的过程，假设我们将M设置为3，并且已经将待加入的向量随机打乱。

首先，添加点A，因为当前图中没有其他任何的点，所以我们只需要添加A，而不用作任何其他的操作。后面我们继续添加点B，此时图中只有点A，点的个数小于3,因此我们可以直接将两者相连。

类似的我们向图中加入点C，点D，我们会获得以下的图

随后我们继续添加点E，此时我们会找到当前图中距离点E最近的M个点，即A，B，C并将其相互连接。

用相同的方式，我们继续添加点F，G，H，最终得到的图如下所示。

我们逐个检查NSW图是否满足我们之前要求的性质

联通图，显而易见这是一张联通图
距离较远的点有边可以相连，我们可以发现因为随机添加，最开始认为距离较近的点，比如A，D，随着添加的点越来越多，A，D相连的这条边成为了一条long range边。
构建的图中，边的数量不宜过多。这一条因为我们始终用M控制边的数量，所以也可以满足
距离较近的点，有边连接。因为我们始终与距离最近的M条边相连，因此也满足了该要求。

因此，我们只需要随机添加向量，并且在随机添加的过程中，与当前最近的M个点相连，我们就可以构建出一幅可以高效的进行ANN查询的图。下面我们讨论搜索的过程。

2. 搜索

因为我们通过NSW构建出的图,具有良好的特性,因此我们只需要使用简单的贪心算法就可以获得较好的搜索结果。在给定了一个query point后

我们在图中随机的选择一个点作为出发点(entry point)
我们计算每一个与该点相连的点，选出最近的一个点。

a. 若该点即为entry point,搜索结束,返回entry point。

b. 若该点不为entry point, 设置该点为entry point, 重复过程2

下图为搜索的示意图,我们可以看到因为有long range，这一高速通道的存在，我们可以快速搜索到结果。

3.HNSW

尽管NSW已经可以很好的为我们解决ANN查询的问题，但其仍然有不足之处。

搜索时，NSW无法区分long range与short range，从而无法先查询long range再查询short range。
当数据的聚类效应特别明显时，即使我们乱序加入向量，cluster之间相互连接的边仍然十分稀疏，从而搜索结果容易陷入局部最优，同时效率也会比较低下。

因此为了解决上述问题，HNSW作为NSW的改良版被提了出来。

1.构建

我们首先直观的感受HNSW图。我们可以看到hnsw相比于nsw多了层级的概念。我们从图中可以看到，level0中有全部的向量，随着层数的增加，向量的数量也相应的减少。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/hnsw.png"/> </div>

HNSW并不要求我们乱序插入向量，当我们向HNSW添加新的向量时，我们首先会通过一个指数衰减的概率函数，得到这个向量所处的最大层级(如果最大层级计算出来是3,那么level3, level2,level1,level0中都含有这个向量)。

这就意味着，绝大多数的向量所处的最高层级都是level0, 同样我们也可以认为高层级是低层级的草图（抽样），因此高层级中的向量之间大概率是long range连接，低层级中的向量则是short range连接。这样子做给我们带来的好处就是搜索时，我们可以先寻找long range的边，再寻找short range的边，即先粗查再精查。从而尽可能减少搜索的次数。

当得到这个向量所处的最大层级后，我们便需要将其添加到图中。假设，新增的向量为V, 这个向量所处的最大层级为I,HNSW的最高层级为J。添加向量时因为需要从最高层级J，一直走到最低层级0,我们将添加时所处的层级设置为C。添加的过程可以分为3个阶段

J >= C > I 这一阶段我们使用NSW的贪心算法，寻找距离最近的向量，随后在下一层级以这个点为搜索起点.
I >= C > 0 在这一阶段，我们不仅要找距离最近的向量，我们还需要将V存放到这一级的图中。我们仍旧使用贪心算法寻找距离最近的向量，不同之处在于我们会维护一个动态的列表，保存距离V最近的efCounstruction个向量，efConstruction为可调节的参数。当我们在这一层搜索完成后，我们会将这一动态列表作为Candidate,并从中取出M个向量，与其连接。
I = 0 这一阶段，我们采用和第二阶段一样的策略，不同的是在level0,向量V可以与最多2M个向量进行连接。

下图为一个简单的示例，新增向量所处的最高层级为1。我们首先在level2中，寻找与其最近的点(黄色标示)，找到后，我们以这个点为起点在level1中寻找与其最近的efCounstruction个点，随后与其中的M个向量进行连接。最后当我们到达level0时，我们用上一层连接的M个向量作为起点寻找符合要求的2M个向量，并与其相连。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/new-insert.png"/> </div>

当我们在某一层中(I >= C >= 0)找到距离V最近的efConstruction向量后，我们需要从中挑选出M个向量用以与V连接。一种简单的做法是直接从efCounstruction中挑选出最近的M个向量，但是这种做法当数据的聚类效果特别明显时，会导致不同cluster之间的连接十分稀疏，导致搜索陷入局部最优，并且查询效率降低。

因此我们采用启发式的搜索方式，假定我们新增的向量为V，挑选出的efConstruction个向量为Candidate,当前我们已经选择出需要连接的向量为Result,启发式的算法为

while len(Candidate) > 0 and len(Result) < M:
    c = pop nearest element from Candidate to V
    for r in Result:
        lowest = min(lowest, distance(r, c))
    if dis(c, V) < lowest:
        Result += c

用一张图来描述这种情况,我们从C1,C2中决定哪个点应该作为下一个连接点时，我们会选择与inserted之间的距离相比与其他result更近的点，而不是距离inserted更近的点。按照论文中的说法，这可以帮助我们在高度聚类的数据中，取得更好的搜索效果以及效率。

The heuristic enhances the diversity of a vertex’s neighborhood and leads to better search efficiency for the case of highly clustered data. <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/her.png"/> </div>

2.搜索

搜索的过程分为两个阶段

J >= C > 0 这一阶段我们使用NSW的贪心算法，在这一层中寻找距离最近的向量，随后在下一层级以这个点为搜索起点继续搜索.
I = 0 这一阶段，我们仍旧使用贪心的搜索策略，不同之处在于，我们会维护一个距离最近的efSearch个向量，并最终返回结果。 <div style="text-align: center"> <img src="https://hayesx-1302722143.cos.ap-singapore.myqcloud.com/img/search.png"/> </div>

3.Summary

HNSW,通过引入层级的概念以及启发式搜索，解决了搜索时，NSW无法区分long range与short range，以及面对高度聚集的数据时，搜索效率的低下。

文本相关性

tang-hi — Mon, 20 Feb 2023 00:00:00 GMT

文本相关性是信息检索和自然语言处理中的一个核心问题。在文本相关性中，我们希望能够量化文本之间的相似程度或相关程度，以便有效地处理和组织文本数据。例如，在搜索引擎中，我们希望通过用户的查询来找到与查询相关的最相关的文档或网页。在文档分类和聚类中，我们希望将相似的文档放在一起，以便更好地管理和分析它们。在文本匹配和相似性匹配中，我们希望找到两个文本之间的相似度，以便评估它们之间的关系。

这篇博客会介绍 TF-IDF 以及 BM25

tf-idf

tf-idf（Term Frequency-Inverse Document Frequency）是一种用于评估文档中单词重要性的统计方法，广泛应用于信息检索、自然语言处理等领域。

他的整体公式如下

$$ \text{tf-idf}(t,d,D) = \text{tf}(t,d) \cdot \text{idf}(t,D) $$

其中，$t$ 是指某个单词(term),$d$ 是指某个文档(document),$D$ 是指整个文档集合。tf 表示单词在文档中的频率(term frequency)。idf 表示单词在整个文档集合中的逆文档频率(inverse document frequency)。

它们的计算公式如下:

$$ \text{tf}(t,d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} $$

其中，$f_{t,d}$ 是指单词 $t$ 在文档 $d$ 中出现的次数。

$$ \text{idf}(t,D) = \log{\frac{N}{|{d \in D : t \in d}|}} $$

其中，$N$ 是指整个文档集合中文档的总数，$|{d \in D : t \in d}|$ 是指包含单词 $t$ 的文档数。

tf-idf考虑了一个单词在文档中的频率以及在整个文档集合中的频率，从而确定它在文档中的重要性。一个单词在某个文档中出现的次数越多，其重要性就越高（即tf越高）,但是如果它在整个文集中出现的次数也很多，那么它的重要性就会降低(即idf越低)

通过例子深入理解tf-idf

假设我们有一个包含以下 4 个文档的文档集合:

Doc 1: the cat in the hat

Doc 2: the rat in the hat

Doc 3: the cat and the rat

Doc 4: the cat sat on the hat

现在，我们想要计算单词 "cat" 在文档集合中的 tf-idf 值。首先，我们需要计算单词 "cat" 在每个文档中的 tf 值。计算公式如下：

$$\text{tf}(t,d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$

其中，$f_{t,d}$ 表示单词 $t$ 在文档 $d$ 中出现的次数，$\sum_{t' \in d} f_{t',d}$ 表示文档 $d$ 中所有单词的出现次数之和。因此，单词 "cat" 在每个文档中的 tf 值为:

tf(cat, Doc 1) = 1/5 = 0.2
tf(cat, Doc 2) = 0/5 = 0
tf(cat, Doc 3) = 1/6 = 0.1667
tf(cat, Doc 4) = 1/7 = 0.1429

接下来，我们需要计算单词 "cat" 在整个文档集合中的 idf 值。计算公式如下：

$$\text{idf}(t,D) = \log{\frac{N}{|{d \in D : t \in d}|}}$$

其中，$N$ 表示文档集合中文档的总数，$|{d \in D : t \in d}|$ 表示包含单词 $t$ 的文档数。因此，单词 "cat" 在整个文档集合中的 idf 值为：

$$\text{idf}(cat, D) = \log{\frac{4}{3}} \approx 0.2877$$

最后，我们可以计算单词 "cat" 在每个文档中的 tf-idf 值，计算公式如下：

$$\text{tf-idf}(t,d,D) = \text{tf}(t,d) \cdot \text{idf}(t,D)$$

因此，单词 "cat" 在每个文档中的 tf-idf 值为:

tf-idf(cat, Doc 1, D) = 0.2 * 0.2877 = 0.0575
tf-idf(cat, Doc 2, D) = 0 * 0.2877 = 0
tf-idf(cat, Doc 3, D) = 0.1667 * 0.2877 = 0.0481
tf-idf(cat, Doc 4, D) = 0.1429 * 0.2877 = 0.0412

这样，我们就计算出了单词 "cat" 在每个文档中的 tf-idf 值。可以看到，单词 "cat" 在 Doc 1 和 Doc 3中的 tf-idf 值比较高，因为它们在文档中出现得比较少，并且在文档集合中出现的文档数也比较少，表明它们在文档集合中比较重要。

BM25

与 tf-idf 相似，BM25 也是基于词频的算法，但与 tf-idf 不同的是，BM25 引入了文档长度的因素，同时对词频的权重进行了调整。BM25 的全称是 Best Matching 25，它计算的是一个查询（query）与一个文档（document）之间的相似度得分。BM25 基于以下三个因素来计算文档的得分：

查询项（query term）在文档中的出现次数
文档的长度（即包含的单词数）
查询项的文档频率（即包含查询项的文档数量）

BM25 的公式如下：

$$ \text{BM25}(q, d) = \sum_{i=1}^{|q|} \text{idf}(q_i) \cdot \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})} $$

其中，

$q$ 是查询项
$d$ 是文档
$|q|$ 是查询项的数量
$q_i$ 是第 $i$ 个查询项
$\text{idf}(q_i)$ 是查询项 $q_i$ 的逆文档频率（inverse document frequency），定义为 $\text{idf}(q_i) = \log{\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}}$
- $N$ 是文档集合中的文档数
- $n(q_i)$ 是包含查询项 $q_i$ 的文档数
$f(q_i, d)$ 是查询项 $q_i$ 在文档 $d$ 中出现的次数
$|d|$ 是文档 $d$ 中的单词数
$\text{avgdl}$ 是文档集合中所有文档的平均长度
$k_1$ 和 $b$ 是常数，通常取 $k_1 = 1.2$，$b = 0.75$

在 BM25 中，查询项的权重由两个因素决定：逆文档频率（idf）和词频（tf）。与 tf-idf 相似，idf 用于衡量一个查询项的重要程度，tf 用于衡量查询项在文档中的出现频率。不同之处在于，BM25 引入了文档长度和查询项的文档频率来对词频进行加权。

具体来说，当文档长度很小时，BM25 对词频进行较大的加权，这可以帮助我们区分出现次数很少但重要的查询项；而当文档长度很大时，BM25 对词频进行较小的加权，以避免受过多出现的常见查询项的影响。此外，BM25 的常数参数 $k_1$ 和 $b$ 也可以根据实际情况进行调整，以获得更好的结果。

计算 BM25 的例子：

假设我们有一个文档集合，其中包含三个文档 $D_1, D_2, D_3$，它们的长度分别为 $|D_1|=100, |D_2|=200, |D_3|=300$。我们还有一个查询项 $q$，其中包含两个单词 $q_1$ 和 $q_2$。假设 $q_1$ 在 $D_1$ 中出现了 2 次，在 $D_2$ 中出现了 5 次，在 $D_3$ 中出现了 10 次；$q_2$ 在 $D_1$ 中出现了 3 次，在 $D_2$ 中出现了 1 次，在 $D_3$ 中没有出现。

我们需要计算每个文档与查询项 $q$ 的 BM25 得分。为了简化，我们假设 $k_1 = 1.2$，$b = 0.75$。

首先，我们需要计算每个查询项的逆文档频率 $\text{idf}(q_i)$。根据公式，我们可以得到：

$$\text{idf}(q_1) = \log{\frac{3 - 2 + 0.5}{2 + 0.5}} \approx 0.29$$

$$\text{idf}(q_2) = \log{\frac{3 - 0 + 0.5}{0 + 0.5}} \approx 1.79$$

接下来，我们需要计算每个文档与查询项 $q$ 的 BM25 得分。根据公式，我们可以得到：

$$ \text{BM25}(q, D_1) = \text{idf}(q_1) \cdot \frac{2 \cdot (1.2 + 1)}{2 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{100}{200})} + \text{idf}(q_2) \cdot \frac{3 \cdot (1.2 + 1)}{3 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{100}{200})} \approx 0.78 $$

$$ \text{BM25}(q, D_2) = \text{idf}(q_1) \cdot \frac{5 \cdot (1.2 + 1)}{5 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{200}{200})} + \text{idf}(q_2) \cdot \frac{1 \cdot (1.2 + 1)}{1 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{200}{200})} \approx 0.63 $$

$$ \text{BM25}(q, D_3) = \text{idf}(q_1) \cdot \frac{10 \cdot (1.2 + 1)}{10 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{300}{200})} + \text{idf}(q_2) \cdot \frac{0 \cdot (1.2 + 1)}{0 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{300}{200})} \approx 0.69 $$

因此，我们可以得到三个文档与查询项 $q$ 的 BM25 得分分别为 0.78、0.63 和 0.69。

这个例子说明了，虽然 $q_1$ 在 $D_3$ 中出现的次数最多，但是由于 $D_3$ 的长度较长，而且 $q_2$ 没有在 $D_3$ 中出现，因此 $D_1$ 的得分最高。这也说明了 BM25 算法的优点之一，即克服了 tf-idf 算法中常见查询项对结果的影响。

BM25的公式进一步解读

BM25中的idf公式与原版的idf公式不一致

传统的 idf 公式是 $\log\frac{N}{df}$，其中 $N$ 是文档集合中文档的总数，$df$ 是包含查询项 $t$ 的文档数。而 BM25 中使用的 idf 公式是 $\log\frac{N-df+0.5}{df+0.5}$。这个公式与传统的 idf 公式相比，主要做了两个改动：
1. 加入了平滑因子：在传统的 idf 公式中，当某个查询项在文档集合中未出现时，其 idf 值会变成负无穷。为了避免这种情况，BM25 中的 idf 公式加入了平滑因子 0.5。
2. 减少 idf 的影响：在传统的 idf 公式中，当某个查询项在很少的文档中出现时，其 idf 值会很大，对结果产生过大的影响。BM25 中的 idf 公式通过减少 idf 的影响，使得查询项在出现文档数较少时，不会对结果产生过大的影响。但是当文档出现的数量超过一半时，计算出的idf值为负数，Lucene中为了解决这个问题，更改了idf公式为$\log1+\frac{N-df+0.5}{df+0.5}$，从而防止了负数的产生。
BM25中的k是如何影响计算出的结果 $k$ 的值控制了词频对得分的影响程度，可以看作是一个词频的归一化因子。

当 $k$ 的值较小时，词频的影响就相对较小，得分的变化范围也相对较小；当 $k$ 的值较大时，词频的影响就相对较大，得分的变化范围也相对较大。当 $k$ 的值等于 $0$ 时，相当于将文档中所有词项的词频都视为 $1$，此时得分只与文档与查询语句的匹配程度有关。同时，因为有$k$的存在，即使词频特别大，也不会对最终计算的结果有大的影响。即当词频达到一定程度，计算出的BM25的值并不会线性提升。
BM25中的b是如何影响计算出的结果在 BM25 中，参数 $b$ 用来平衡文档长度对得分的影响。它控制了文档长度对得分的影响程度，可以看作是一个文档长度的归一化因子。$b$ 的取值会影响文档中的词项权重 $w_i$ 的大小，这里的 $w_i$ 是指包含词项 $i$ 的文档的 $i$ 的权重。当 $b$ 越大时，表示文档长度对词项权重的影响越大，这意味着文档中的词项权重 $w_i$ 会相应地趋于缩小；反之，当 $b$ 越小时，表示文档长度对词项权重的影响越小，这意味着文档中的词项权重 $w_i$ 会相应地趋于扩大。
BM25中的文档长度如何影响计算出的结果

一篇文档如果所含的单词越少，那么 $\frac{|d|}{\text{avgdl}}$ 越小，从而导致最终的BM25越大，因此文档字数越少，相关性越高.
idf为什么要用对数计算? 在计算文档中每个词项的逆文档频率时，使用对数函数的目的是将idf值的范围压缩到一个较小的区间内。由于文档的大小通常会很大，因此一个词项可能会出现几千甚至几十万次，这样计算得到的idf值就会非常大。使用对数函数可以将这些大数值压缩到一个较小的区间内，便于计算和处理。此外，对数函数还能够使得低频词项的idf值更加突出。如果不使用对数函数，那么在某些情况下，一些低频词项的idf值可能会非常小，甚至可能会被忽略。而使用对数函数后，这些低频词项的idf值就会被放大，使得它们在检索时能够更好地区分文档的相关性。

BM25F

BM25F是BM25算法的一种变体，它在BM25的基础上增加了对多字段的支持。在BM25F中，每个文档可以包含多个字段（例如标题、正文、标签等），每个字段都有一个权重。BM25F通过将每个字段的得分相加来计算文档的相关性得分。BM25F的公式如下：

$score(D,Q) = \sum_{i=1}^{n}weight(q_i)\cdot IDF(q_i)\cdot \frac{f(q_i,D)\cdot (k_i + 1)}{f(q_i,D) + k_i\cdot (1 - b_i + b_i \cdot \frac{|D|}{avgdl_i})}$

其中，$D$表示文档，$Q$表示查询，$n$表示查询中的词项数，$q_i$表示查询中的第$i$个词项，$weight(q_i)$表示第$i$个词项的权重，$IDF(q_i)$表示第$i$个词项的逆文档频率，$f(q_i,D)$表示文档$D$中第$i$个词项的出现频率，$k_i$和$b_i$分别表示第$i$个词项的参数$k$和$b$，$|D|$表示文档$D$的长度，$avgdl_i$表示包含第$i$个字段的所有文档的平均长度。

在BM25F中，每个词项的权重由其所在的字段的权重和全局权重两部分组成。全局权重表示该词项在整个文集中的重要性，字段权重则表示该词项在当前字段中的重要性。词项的权重可以通过以下公式计算：

$weight(q_i) = weight_{field}(q_i)\cdot weight_{global}(q_i)$

其中，$weight_{field}(q_i)$表示第$i$个词项在当前字段中的权重，$weight_{global}(q_i)$表示第$i$个词项在全局文档集中的权重。

BM25F的优点是能够有效地处理多字段查询，可以更好地匹配查询和文档中不同字段的相关性。它可以通过调整字段的权重来对不同字段的重要性进行调整，从而提高搜索结果的准确性。

总结

tf-idf 词频越高，词频在整个文档集中越稀少，值越高
BM25 词频在整个文档集中越稀少，词频越高，文档的单词数越少，值越高
BM25F 词频在整个文档集中越稀少，词频越高，文档的单词数越少，权重越高，值越高

Virtual 机制

tang-hi — Sat, 18 Feb 2023 00:00:00 GMT

这篇文章会尝试使用GDB来分析C++中虚函数的实现机制。希望可以帮助你更加透彻的理解C++的虚函数实现。

我们用来测试的程序

#include <iostream>
using namespace std;
struct Simple {
  int one;
};
struct Base {
  virtual void v1() {
    cout << "Base::V1" << endl;
  }

  virtual void v2() {
    cout << "Base::V2" << endl;
  }

  int one;

};


struct Derived : Base {
  void v1() override {
    cout << "Derived::v1" << endl;
  }

};

int main() {
  Base* derived = new Derived();
  Base* derived1 = new Derived();
  Base* base = new Base();
  Base* base1 = new Base();

  Simple* simple = new Simple();

  derived->v1();
  derived->v2();

}

下面我们将代码进行编译后，然后使用gdb进行分析

g++ virtual.cc --std=c++11 -g
gdb a.out

我们首先分别看一下derived,derived1,base,base1,simple中的内容

variable name	address
derived	0x55555556aeb0
derived1	0x55555556aed0
base	0x55555556aef0
base1	0x55555556af10
simple	0x55555556af30

从这两张图，我们可以发现如下几件事

当一个class有虚函数时，该class的对象中会有一个vptr.
该vptr的大小为8byte(0x55555556aeb8 - 0x55555556aeb0)
该vptr所指向的内容仅与class的类型有关，与对象无关 (derived.vptr == derived.vptr1)

我们下面以derived为例,看一下vptr所指向的内容。

我们可以看到vptr指向了一些东西，但具体是什么我们还不知道,但是我们可以发现这个地址的值0x5555555553a6（小端写法）好像是一个地址，那么我们可以查看一下这个地址指向的是什么。

结果很明显，这里面的值指向的是函数Derived::v1的定义,我们可以通过这个地址对该函数进行调用。我们再看一下其他的值。

所以结论很清楚，当你的class中含有虚函数时，编译器会为该类创建一个专属的vtable,vtable中存放着各个虚函数的实现，如果该类有自己的实现，那么指向的就是它自己的实现，否则指向父类的实现。然后当你创建一个类的对象时，编译器会将指向该vtable的指针给到对象的vptr中。

我们最后再看一下，调用的过程。

derived->v1();

derived->v2();

其中rbp为栈帧,其中-0x38(%rbp)为获取derived的地址，即0x55555556aeb0,也就是vptr的地址，随后通过mov (%rax),%rax得到vtable的地址并保存在%rax中，因为调用的函数不同，因此derived->v2();的汇编需要将%rax + 8得到对应的地址。然后通过mov (%rax),%rdx得到需要调用的函数地址，最后通过call *%rdx完成多态的函数调用。

总结

当当一个class有虚函数时，编译器会为该class对象生成一个vptr，该vptr的大小为8byte,所指向的内容仅与class的类型有关，与对象无关,这里面的值指向的是函数Derived::v1的定义,我们可以通过这个地址对该函数进行调用。当实际调用时，编译器会根据你调用的函数不同，调整vtable所指的entry,最后根据entry项中的地址，完成函数调用。

More Effective C++

tang-hi — Sat, 11 Feb 2023 00:00:00 GMT

这篇博客主要是用来加深自己对读过的书的记忆。写的内容可能只对我自己产生价值

Item 1: Distinguish between pointers and references

引用相较于指针

优势他总是有效的，即没有null reference，指针则需要检查是否为空

劣势指针可以指向一个新的对象，引用不行。指针可以使用nullptr表示不存在，如果你需要该变量拥有不存在的语义，使用pointer。

总结当你确认你需要指向某个东西，并且绝对不会改变指向其它东西，使用reference，不然的话使用pointer

Item 2: Prefer C++-style casts

C的转型，无法区分想做的是什么类型的转型，而且较难分辨，尽量使用C++的新式转型

static_cast 基本拥有C旧式转型的相同威力与意义
cons_cast 用于强转const属性
dynamic_cast 用于在继承体系中向下转型，转型失败时会以nullptr或者exception表现出来
reinterpret_cast 用于转换二进制和序列化，或者函数指针的转型

Item 3: Never treat arrays polymorphically :skull:

数组类型不能被当作多态来进行传递，即

void printBSTArray(const BST array[]);
class BalancedBST: public BST {};
printBSTArray(BalancedBST) // error!

Why? 当你读取数组元素时，偏移是根据你申明的类型来进行计算的，但是子类的大小和父类基本都是不一致的，因此你实际使用的偏移是错误的，这是一个未定义行为！

Item 4: Avoid gratuitous default constructors

如果一个类不借助外部的信息就无法正确初始化，那么就应该避免提供默认构造函数，但这会带来以下几个问题

对于数组类型 A a[10] 没有默认构造函数即无法生成，需要使用别的方式生成，例如使用指针数组，而不是对象数组
对于一些基于模板的容器类型无法很好的兼容，因为他们可能假设你的类拥有默认构造函数
如果virtual base class 缺乏默认构造函数，后续继承他的类都需要知道其意义(bad design)。

结论，这是一个case by case的问题，根据实际情况进行抉择。

Item 5: Be wary of user-defined conversion functions

对于自己定义转换函数需要格外的小心，因为他们可能导致非预期的函数调用，编译器会想尽办法帮你编译成功，因此可能在你未预料的地方给你进行了隐饰转换，解决办法

定义 **asType()**的成员函数，进行显式的类型转换
使用explicit去除单自变量的constructor的隐式转换

Item 6: Distinguish between prefix and postfix forms of increment and decrement operators

前置++返回引用，后置++返回const 对象(const 对象防止 a++++)

后置++有一个临时变量的负担。

prefer prefix

Item 7: Never overload &&, ||, or ,

这些符号是由短路特性，而且保证从左往右计算，如果你对其进行重载，函数传进来的参数是无法保证计算顺序的，会导致与常规理解不符，从而导致未定义行为。

Item 8: Understand the different meanings of new and delete

new

分配内存
在该内存上调用构造函数

operator new （void* operator new(size_t size))

返回一块原始的未初始化的内存

placement new ( new (memory pointer) Type(args) )

在memory pointer上调用构造函数

new [] 和 operator new[] 对应的数组版

delete 与new对应，需要成对出现

delete - new

operator delete - operator new

Item 9: Use destructors to prevent resource leaks

因为有异常的存在，可能你释放资源之前就抛出了异常，导致资源泄漏。如果不断写catch会使代码乱七八糟，因此将资源释放放到析构函数中，即RAII

Item 10: Prevent resource leaks in constructors

如果contructor抛出异常，因为对象尚未完全构建完全，因此析构函数不会被调用，从而导致内存泄漏，解决办法为尽量使member不要是指针并且为智能指针。

Item 11: Prevent exceptions from leaving destructors

如果析构函数中抛出了异常有两个坏处1. 可能导致程序直接终止 2.导致析构函数需要执行的语句没有执行完，即内存泄漏，因此需要尽力避免析构函数抛出异常。

Item 12: Understand how throwing an exception differs from passing a parameter or calling a virtual function

异常类型永远会复制一份，无论捕获方式是什么
被抛出作为exception的对象，其被允许的类型转化方式比被传递到函数的去的方式少
异常比对是第一个成功就执行，而不是最佳匹配。

Item 13: Catch exceptions by reference

用指针捕获，容易导致传进来的指针已经失效，或者不知道该不该释放这个指针

用值捕获，需要多复制一份且不支持多态

用引用捕获，没有缺点！

Item 14: Use exception specifications judiciously

C++11基本不怎么使用了，仅用noexcept

Item 15: Understand the costs of exception handling

使用profile去检查性能的影响

Item 16: Remember the 80-20 rule

在真正关键的地方进行努力

Item 17: Consider using lazy evaluation

经典的计算机思想，仅在需要时计算。

Item 18: Amortize the cost of expected computations

将计算平坦到每一次调用中，例如你需要计算一个数组中的最大值，可以在每一次添加元素时，对最大值进行更新。

Item 19: Understand the origin of temporary objects

临时对象可能很耗成本，所以应该尽可能消除它们。例如reference to const 以及 value的地方就可能产生临时对象.

Item 20: Facilitate the return value optimization

详情看RVO

Item 21: Overload to avoid implicit type conversions

使用重载来消除隐式转换，从而消除临时变量，例如

const UPInt operator+(const UPInt& lhs, // add UPInt
					  const UPInt& rhs); // and UPInt

const UPInt operator+(const UPInt& lhs, // add UPInt
					  int rhs); // and int

const UPInt operator+(int lhs, // add int and
					  const UPInt& rhs); // UPInt

这样当执行 upi3 = upi1 + 10; 就不会有因为类型转换而产生临时变量。

Item 22: Consider using op= instead of stand-alone op

复合版本即+=，一般效率高于+，因为不需要产生临时变量。

Item 23: Consider alternative libraries

这个没啥说的，有什么高性能库就用什么吧。

Item 24: Understand the costs of virtual functions, multiple inheritance, virtual base classes, and RTTI

这个也没啥说的，只有实际碰到才能知道。

Item 25: Virtualizing constructors and non-member functions

虚构造函数，实际就是一个虚static成员函数，在构造函数中调用，从而实现虚构造函数

虚non-member函数，写一个虚函数做实际工作，再安排非虚函数对其进行调用。

Item 26: Limiting the number of objects of a class

设计一个Counted类，在内部进行计算，从而用户无感知

Item 27: Requiring or prohibiting heap-based objects

有一个hack的方式检查对象是否在heap中(利用程序的内存布局，但不具有可扩展性)

bool onHeap(const void *address)
{
	char onTheStack; // local stack variable
	return address < &onTheStack;
}

我们没有完美的方式来限制对象是否在heap中

Item 28: Smart pointers

C++11 已经支持了

Item 29: Reference counting

经典问题，不展开了

Item 30: Proxy classes

使用proxy对象来表示某些并不存在的对象，并且让用户无感知即为proxy classes

Item 31: Making functions virtual with respect to more than one object

multi dispatch，最佳解决手段，自己写虚表。

Item 32: Program in the future tense

时刻想着自己写的代码会被各种扩展，以及各种神奇的需求

Item 33: Make non-leaf classes abstract

专门抽象出Abstract类，让其他类来继承。

Item 34: Understand how to combine C++ and C in the same program

#ifdef __cplusplus
extern "C" {
#endif
void drawLine(int x1, int y1, int x2, int y2); // 以这种方式避免编译器重命名
void twiddleBits(unsigned char bits);
void simulate(int iterations);
...
#ifdef __cplusplus
}
#endif

If you want to mix C++ and C in the same program, remember the following simple guidelines:

■ Make sure the C++ and C compilers produce compatible object files.

■ Declare functions to be used by both languages extern "C".

■ If at all possible, write main in C++.

■ Always use delete with memory from new; always use free with memory from malloc.

■ Limit what you pass between the two languages to data structures that compile under C; the C++ version of structs may contain nonvirtual member functions.

Item 35: Familiarize yourself with the language standard

熟悉语言标准！多看看RFC！

Return Value Optimization

tang-hi — Sun, 15 Jan 2023 00:00:00 GMT

之所以写这篇文章是因为在油管上看了Jon Kalb在2018年的CppCon上做的演讲，深受启发，决定换一个视角来审视C++的RVO机制。

1. calling conventions

1.1 返回值为int, float....

int simple() {
    return 1;
}

int main() {
	return 1 + simple();
}

上述的代码经过编译后得到的汇编代码如下所示

simple():
        push    rbp
        mov     rbp, rsp
        mov     eax, 1
        pop     rbp
        ret
main:
        push    rbp
        mov     rbp, rsp
        call    simple()
        add     eax, 1
        pop     rbp
        ret

因为是RVO，所以我们只关心 return value，我们可以发现simple中的一条汇编语句move eax 1,这条语句对应于simple中的return 1;也就是说在C++中，我们会将需要返回的值存在rax寄存器中。当然前提是rax可以放下需要返回的值。

1.2 返回值为struct类型

如果返回值为struct类型，也就是rax不一定可以放的下该类型应该怎么办？

观察下面的代码

struct BigObject {
    int data[6];
};

BigObject big() {
    return BigObject{1,2,3,5,6,7};
}

int main() {
	BigObject bo = big();
    return 0;
}

该代码经过编译后得到的汇编代码如下所示

big():
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], rdi
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax], 1
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+4], 2
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+8], 3
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+12], 5
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+16], 6
        mov     rax, QWORD PTR [rbp-8]
        mov     DWORD PTR [rax+20], 7
        mov     rax, QWORD PTR [rbp-8]
        pop     rbp
        ret
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        lea     rax, [rbp-32]
        mov     rdi, rax
        call    big()
        mov     eax, 0
        leave
        ret

我们可以看到仍旧是将返回值存入rax中，只不过这里的rax更像是一个指针，通过offset将对应的值存入mov DWORD PTR [rax+4], 2

整个调用过程我们用两张图来进行总结

整个调用过程就算不太了解也没有关系，我们只需要记住函数的返回值一定是存在rax中，区别在于是把rax当作int这种标量，还是当作指针对待。

2 使用RAX实现RVO

RVO实际上就是在函数返回时，将原本需要进行的拷贝操作省略掉，那么怎么实现呢？通过上面的描述，我们知道返回值实际都在rax中，那么只要我们在调用函数前，自己开辟一块空间（在栈帧中），然后将这块空间的地址给到 rax，等到函数返回时，我们就无须对返回的临时变量进行拷贝，因为返回值已经在rax（我们开辟的空间）中了，我们可以直接使用。

还是用一张图来总结这个过程。

3.RVO 的适用场景

当粗略的了解了RVO的实现原理后，我们便可以，从另一种视角对RVO的适用场景进行审视。

3.1 unamed rvo :white_check_mark:

Foo URVO() { return Foo(); }
Foo foo = URVO();

这种场景下，因为整个返回值都是临时变量，所以我们可以直接在开辟的空间中进行构造，无需拷贝。因此这种场景下，RVO是可以被使用的。

3.1 named rvo :white_check_mark:

Foo NRVO() {
  Foo foo;
  return foo;
}
Foo foo = NRVO();

这种场景下，返回值是一个局部变量，但是我们可以在开辟的空间中直接对局部变量进行构造。因此这种场景下，RVO是可以被使用的。

3.3 named rvo with compile-time condition :white_check_mark:

Foo NRVO_Compile_BRANCH(int x) {
  Foo foo;
  if (x % 2 == 0) {
    return foo;
  } else {
    return foo;
  }
}

Foo foo = NRVO_Compile_BRANCH();

这种场景下，返回值是一个局部变量，并且不论条件变量如何，我们都明确只返回那一个局部变量(编译期即可确定)，因此我们可以直接在开辟的空间中构造局部变量，rvo适用。

3.4 named rvo with run-time condition :x:

Foo NRVO_RUNTIME_BRANCH(int x) {
  Foo foo, foo1;
  if (x % 2 == 0) {
    return foo;
  }
  return foo1;
}

Foo foo = NRVO_RUNTIME_BRANCH();

这种场景下，我们有两个局部变量，且这两个局部变量都有可能成为返回值，只有在runtime我们才能确定，因此我们无法直接在开辟的空间中进行构造（因为只有运行到return时，我们才知道那一个是返回值，而这时候该值早就已经构造好了）,只能通过拷贝构造函数进行生成，rvo不适用。

3.5 return global variable :white_check_mark:

Foo Global_FOO() { return global_foo; }

Foo foo = Global_FOO();

尽管很多博客文章都说这种场景下，不会使用RVO，但是经过测试结果显示，虽然我们返回的是全局变量，该变量早就已经构造完成，有它专属的物理地址，但是我们依然可以在返回地址处直接使用拷贝构造函数进行生成。rvo适用。

以下是我做的实验

struct Foo {
  Foo() : data(0), id(++version) {
    ++object_create;
    cout << "Foo ctor, version :" << id << endl;
  }

  Foo(const Foo &rhs) : data(rhs.data), id(++version), aaaa(rhs.aaaa) {
    ++object_create;
    cout << "Foo copy ctor, version: " << rhs.id << " -> " << id << endl;
  }


  Foo &operator=(const Foo &rhs) {
    data = rhs.data;
    cout << "Foo copy assign version: " << rhs.id << " -> " << id << endl;
    return *this;
  }


  ~Foo() { cout << "Foo destory version: " << id << endl; }

  /* data */
  int data;
  int id;
};

Foo global_foo;
Foo foo1 = Global_FOO();
---------------------------------------------------------------
g++ -o enable -O0 -std=c++98 & ./enable
Foo copy ctor, version: 1 -> 1
Foo destory version: 1
create 1 objects

g++ -o disable -O0 -std=c++98 -fno-elide-constructors & ./disable
Foo copy ctor, version: 1 -> 1
Foo copy ctor, version: 1 -> 2
Foo destory version: 1
Foo destory version: 2
create 2 objects

可以看到开启rvo时的确少调用一次拷贝构造函数,当然其实这也可以认为是对unamed的rvo优化，而不是global的。

3.6 return parameter :white_check_mark:

Foo Return_Para(Foo foo) { return foo; }
Foo foo = Return_Para();

这种场景下，和上一个场景很相似，尽管都需要对参数进行一次拷贝，但是RVO可以在返回时进行优化直接拷贝到新开辟的空间中，从而相比与禁止RVO少调用一次拷贝构造函数。

这次我们通过汇编代码进行论证

开启RVO的汇编代码

可以看到整个过程中仅仅只调用了一次拷贝构造函数。

关闭RVO的汇编代码

可以看到一共调用了两次拷贝构造函数，这证明了RVO确实在发生作用。

3.7 return by move :x:

Foo Return_BY_MOVE() {
  Foo foo;
  return std::move(foo);
}
Foo foo = Return_BY_MOVE();

这个没有什么好说的，C++标准不允许，当你使用std::move时，会禁用RVO。

3.8 一个没啥用的发现

当你的class没有自己写拷贝构造函数，并且里面的成员变量都是没有自己定义的拷贝构造函数，这时候开启RVO，编译器甚至不会给你生成拷贝构造函数。

测试代码

struct Foo {
  Foo() : data(0), id(++version) {
    ++object_create;
    cout << "Foo ctor, version :" << id << endl;
  }

//   Foo(const Foo &rhs) : data(rhs.data), id(++version) {
//     ++object_create;
//     cout << "Foo copy ctor, version: " << rhs.id << " -> " << id << endl;
//   }

//   Foo &operator=(const Foo &rhs) {
//     data = rhs.data;
//     cout << "Foo copy assign version: " << rhs.id << " -> " << id << endl;
//     return *this;
//   }

  ~Foo() { cout << "Foo destory version: " << id << endl; }

  /* data */
  int data;
  int id;
// std::vector<int> vec;
};

开启RVO生成的汇编代码

可以看到完全就是寄存器和堆栈的运算。

当你关闭RVO，生成的汇编代码

编译器会为你生成拷贝构造函数，并且被调用。

但如果你的class没有自己写拷贝构造函数，并且里面的成员变量都是没有自己定义的拷贝构造函数这两个条件有一个没满足，编译器都会为你生成拷贝构造函数。

总结，RVO只要被开启，当你返回时基本总是会被使用，即直接在开辟的新空间中直接进行生成，从而节省了一次拷贝。但对于某些特殊的情况，例如返回参数，返回全局变量时，对这种对象的拷贝是无法被省略的。

RVO与std::move

当std::move参与到rvo时，情况又会有点微妙。

先说一个非常一般，并且绝大多数都对的结论，当class可以被move，那么当你返回时，如果可以直接构造那么直接构造，如果不能，调用移动构造函数

其实用一句话说，你return的值会被当作右值处理，要么使用RVO，要么使用移动构造函数，但也有例外。

我们先看官网文档

注意这里加粗的意思是说如果我们return的类型和函数申明的返回类型对不上，那么就会把返回值看作左值也就是会调用拷贝构造函数。例如

struct Foo {
  Foo() : data(0), id(++version) {
    ++object_create;
    cout << "Foo ctor, version :" << id << endl;
  }

  Foo(const Foo &rhs) : data(rhs.data), id(++version), aaaa(rhs.aaaa) {
    ++object_create;
    cout << "Foo copy ctor, version: " << rhs.id << " -> " << id << endl;
  }

  Foo(Foo &&rhs) : data{rhs.data}, id{++version} {
    cout << "Foo move ctor, version: " << rhs.id << " -> " << id << endl;
  }

  Foo &operator=(const Foo &rhs) {
    data = rhs.data;
    cout << "Foo copy assign version: " << rhs.id << " -> " << id << endl;
    return *this;
  }

  Foo &operator=(Foo &&rhs) {
    data = rhs.data;
    cout << "Foo move assign version: " << rhs.id << " -> " << id << endl;
    return *this;
  }

  ~Foo() { cout << "Foo destory version: " << id << endl; }

  /* data */
  int data;
  int id;
  Complex complex;
  std::vector<int> aaaa;
};

struct FOOS : public FOO {
    
}

FOO return_derived() {
    FOOS foos;
    return foos; // treat is as lvalue
}

因为FOOS并不完全是FOO，所以与FOO(FOO &&rhs)对不上，因此会将返回值视作左值，导致RVO，move都无法使用。

至此，我对于RVO的总结就全部完成了。

Map Reduce

tang-hi — Tue, 15 Nov 2022 00:00:00 GMT

MapReduce是谷歌在2004年发表的论文,根据它在论文中的描述

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

MapReduce本质上是为了处理大数据而诞生的框架，它含有两个原语，分别是Map和Reduce(从函数式编程中借鉴过来的概念)，而这两个原语因为抽象程度高，因此可以相互组合完成大部分的大数据处理任务

Map: 将一组数据转化为另一组数据,可以将这个任务看作输入为一个单一元素，输出为一个tuple的建值对

// ele是你需要处理的原始数据
// 输出为你根据原始数据生成的键值对
func map(String ele) {
    return (generateKey(ele), generateValue(ele))
}

Reduce: 将多个Map任务的结果按Key聚合后作为输入, 对该输入进行计算后输出最终结果

// 输入中的Key为多个Map任务中产生的一个Key
// values为map任务产生的所有(k,v)中属于这个key的value集合
// 即 values = values + (value_1 | if (key_1, value_1) key_1 = key)
func reduce(Key key, List<Value> values) {
    result := process(values)
    return (key, values)
}

MapReduce解决了什么问题

通过分布式的方法解决大数据处理的问题
Fault Tolerance (可以部署在商用服务器上，容忍一定的机器损坏)
程序员只需专注于编写数据的处理程序(即map和reduce这两个函数),无需关注分布式的问题,便可以让其进行分布式计算

MapReduce的实现

Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large NUMA multi-processor, and yet another for an even larger collection of networked machines

根据论文中的描述,MapReduce只是一个计算模型，你可以按照你自己的需要，设计并实现最适合你需求的架构.在这里我们介绍谷歌所使用的架构.

角色描述

从图中我们可以看出在整个MapReduce中有三种不同的角色

InputFile: 待处理的输入文件
Master: 调度整个任务的执行，并且检测Worker是否存活
Worker: 听从Master调度,并执行用户指定的Map或者Reduce任务

我们先详细介绍这三种不同的角色，然后描述MapReduce的总体流程

InputFile
因为MapReduce的应用场景是大数据处理，所以输入的文件较大，往往是无法完全放在内存之中，因此我们需要将输入的文件分割为大小相等的文件块(split1, split2, split3....),在谷歌的实现中,文件分割后，每份大小为16-64MB
Master
Master是一个较为特殊的角色，全局仅有一个Master,它有以下的几个职责
- 监听Worker状态,当Worker处于Idle状态时，给他分配任务(map/reduce)
- 通过心跳探活Worker,当Worker宕机时,执行容灾操作，即重新执行.
- 提供给Worker需要的信息,以使其正常运行.
Worker
Worker是实际执行用户编写程序的角色，它听从Master的调度（执行map或者reduce）,并且从Master那里获得执行程序所需要的一切信息并将执行结果反馈给Master.

总体流程

用户提交需要计算的任务,和输入文件
Master接收到任务后，任务会被分解为M个Map任务，R个Reduce任务.
Master不断选择处于Idle状态的机器，并让他们执行Map任务，直至Map任务被全部执行完成.
被调度执行Map任务的Worker会读取对应的文件分片，并对该文件进行解析后作为用户Map程序的输入，然后将Map输出的键值对缓存在内存之中,最终写在本地磁盘上. 写在磁盘上的键值对文件，会根据Key划分为R个文件. ie.(K % R)
当Map执行完成后会向Master汇报执行完成，并且将所有键值对文件的位置告知Master
当所有Map任务完成后，Master会选择状态处于Idle的Worker,让其执行Reduce任务，同时会告知他所需要处理的键值对文件位置
当Worker被调度执行Reduce任务时, 他首先会发起一个rpc来读取键值对文件,当他将所有文件读取完毕后,他对键值对进行排序,这样子相同键的键值对就会聚集在一起如果键值对文件过大，无法全部保存在内存之中，那么需要进行外排序.Worker最后将具有相同键的值聚合在一起形成(Key, list Value)传给Reduce函数进行计算,并将结果写在文件中
当所有的Reduce执行完成后,MapReduce也就执行结束了.

MapReduce如何解决了那些问题

通过分布式的方法解决大数据处理的问题
通过一台Master来调度多台Worker可以实现分布式计算，同时我们可以注意到Master中需要记录的Worker信息所需要的存储空间较小因此可以使用上千的Worker来同时计算,而不会给Master带来太大的负担.
Fault Tolerance
- Master Fail
  Master会周期写内部数据的checkpoint,如果Master宕机，一个新的备份机器可以通过读取checkpoint来恢复状态. 我们可以发现根据以上的设计,Master宕机后就算丢失了一些任务的进度，例如,不知道map_3任务已经执行完成,但是通过重新执行，对于最后结果的正确性并没有影响.
- Worker Fail
  如果Worker宕机，那么该Worker完成的所有Map任务全部设置为取消，并且需要全部重新执行,还没有执行完的Map和Reduce任务也全部取消,并需要全部执行. 之所以完成的所有Map任务需要全部重新执行,是因为Map任务的结果写在本地磁盘上，当机器宕机时，这些结果就全部不可获取了，因此需要全部重新执行. 而执行完成的Reduce任务不需要重新执行，是因为Reduce任务的结果写在了分布式文件系统上.我们可以发现，只需要通过简单的重新执行，便可以保证即使机器宕机仍然可以完成分布式计算.
程序员只需专注于编写数据的处理程序(即map和reduce这两个函数),无需关注分布式的问题,便可以让其进行分布式计算
根据MapReduce的设计，程序员唯一需要做的就是map和reduce这两个函数，其他的分布式调度，容灾等策略均在MapReduce内部完成.

MapReduce的优化

任务的粒度:MapReduce一般会划分为M个Map任务和R个Reduce任务,M和R的选择一般会远大于机器数量，这样有利于负载均衡，同时如果机器宕机的话，也可以快速恢复.论文中给的例子是当有2000台机器时，M=200,000 R=5000
Backup机制: 实际生产环境中，我们经常会遇到长尾效应,即有某几台机器执行的任务特别慢，从而拖累了整体任务的进度.MapReduce通过Backup机制来解决.即同时给多个机器发出同样的任务任意一台机器返回结果即视为任务结束.
Combiner函数: combiner函数是在Map执行后再执行的函数。举一个例子，word count中，因为map函数会产生大量的(the,1),这些数据都会通过网络发送给Reduce 这加大了无谓的网络带宽.因此使用combiner函数可以在map后聚合这些数据，再传给reduce减少网络带宽.
跳过Bad Records: 用户编写的Map和Reduce函数可能存在Bug,这就导致当Master给Worker分配任务时，会将该机器打挂，而后Master 再让其他worker重新执行，再次打挂Worker，最坏的可能是把整个集群打挂,因此谷歌在启动MapReduce时，会注册相应的signal handler,当特定的signal被捕获时，例如segment fault等, 会给Master发送一条UDP，当Master发现相同的UDP >= 1时，就会拒绝再次调度对应的map/reduce任务了.

Don't Panic

fsync is Costly, But Don't Avoid It

What is fsync?

How Slow is fsync?

Why does it impact our system so much?

1. Too many files

2. Misuse of fsync

What can we do to alleviate its impact?

Direct I/O

io_uring

Lessons Learned

设计一款自己的代码配色

前置知识

色轮的概念

暗色和亮色

互补色(complementary)

三色组(Triadic)

分裂互补三色组(Split)

类似色(Analogous)

开始设计

颜色配置文件

配色选择

函数，关键字，变量

类型，注释

字符串，数字

完整配置文件

总结

期权学习笔记

什么是期权?

1. 你认为会涨，甲认为会跌

2. 你认为会跌，甲认为会涨

期权有什么用?

投机

对冲

保护性看跌期权(protective put)

抛补看涨期权(covered call)

跨式期权(straddle)

双限期权(collar)

价差套利(spread)

如何对期权定价?

[译] Binary quantization

什么是 binary quantization?

使用二值化向量的细节

二值化向量的距离计算

BQ下数据分布的重要性

一维向量的BQ

N维向量的BQ

BQ的性能提升

索引时间的提升

内存占用的提升

延迟分析

PQ与BQ的对比

用你自己的数据来测试BQ

DuckDB -- 浮点数的压缩

前置知识

IEEE 754 Double 的表示方法

压缩

1. 将浮点数转化为为整数

2. 分部分进行压缩

ALP

ALPRD

总结

DuckDB -- table's file format

Background Information

Block Types

Field Reader

Segment Tree

文件结构

Columns

table data

row group

Why only one row group pointer?

Const Column

uncompress column

RLE column and bitpacking

Dictionary column

Last

有趣的知识 -- CPU利用率，延迟，吞吐量之间的关系

CPU利用率和延迟之间的关系

1. Little's Law

What is `fsync`?

How Slow is `fsync`?

2. Misuse of `fsync`