Structure of FTS5 Index in SQLite

26 February 2023
SQLite

Recently Signal has open-sourced a SQLite extension that provides better support for non-latin languages (Chinese, Japanese, etc) in the Full-Text Search (FTS) virtual table. I was one of the engineers who worked on this extension and in the course of this endeavor I got to learn about the structure of the SQLite's FTS implementation. The existing documentation focuses mostly on API and its use patterns, and even though it covers some of the internal storage format, I found it a bit confusing. Thus this article was born. Not as alternative documentation for FTS5, but as a complement for developers who want to dive in past the officially documented bits.

How is FTS5 used #

Before anything else, though, let's see what the FTS5 looks like to the API consumer. As with many other features of SQLite we start by creating a table:

CREATE VIRTUAL TABLE search
    USING fts5(content);

This virtual table supports insertion, modification, and deletion of textual content:

INSERT INTO search(content) VALUES
    ('hello world'),
    ('halló heimur'),
    ('你好世界');

DELETE FROM search
    WHERE content IS 'halló heimur';

As well as special full-text search queries that can find an entry that has a word (term) starting with a certain prefix:

SELECT content FROM search
    WHERE content MATCH 'wo*';
-- Output: 'hello world'

In addition to that you could match on multiple words (not necessarily adjacent) and sort by the search rank. The details of this are covered by the official documentation so we won't be discussing them much more here.

Signal's FTS5 extension #

In the example above you might have noticed that we inserted a phrase in Simplified Chinese: 你好世界. What happens if we search for the second word of it (世界)?

SELECT content FROM search
    WHERE content MATCH '世界*';
-- Output: nothing

No resulting rows! The reason for that is that the default tokenizer for FTS5 has segmented 你好世界 as a single word (term), instead of either splitting it into two words (example in JavaScript):

const segmenter = new Intl.Segmenter([], {
    granularity: 'word',
});

const segments = [
    ...segmenter.segment('你好世界')
].map(s => s.segment);
console.log(segments);
// Output: [ '你好', '世界' ]

or at the very least into separate CJK symbols:

[ '你', '好', '世', '界' ]

Since FTS5 only supports indexed searches by the start of the term - it cannot search for Chinese/Japanese words in the middle of the sentence.

This is where Signal's FTS5 Extension comes to the rescue. It is a Rust crate that could be built as either static or shared library. When plugged into SQLite this library provides an alternative tokenizer conspicuously named signal_tokenizer. Creating a table with this tokenizer is fairly straightforward and one could easily verify that the CJK search works without issues now:

-- With extension loaded
-- (e.g. after ".load signal-fts5-extension.dylib")

CREATE VIRTUAL TABLE search
    USING fts5(content, tokenize='signal_tokenizer');

INSERT INTO search(content) VALUES
    ('你好世界');

SELECT content FROM search
    WHERE content MATCH '世界*';
-- Output: '你好世界'

Note that default sqlite3 shell doesn't support extensions so you'd have to build your own or use Signal's fork of better-sqlite3 which automatically loads signal_tokenizer.

FTS5 Internal Structure #

With this context in mind, we are ready to take a look at the internal structure of FTS5! Let's define the common terminology:

document - a rowid primary key along with one or more indexed text columns (for simplicity we will assume one)
term - a "word" in an indexed column. Tokenizer is responsible for segmenting the text with multiple sentences into terms
segment - a collection of pages/leaves
page (leaf) - a sorted list of tokens and their positions in the documents.

Naive Idea #

Naively, given a document we could imagine compiling a list of terms and their positions sorted lexicographically by the term:

"hello"  -> rowid1 + position of "hello" in the document
"hooray" -> rowid2 + ...
"howdy"  -> rowid3 + ...

A binary search could be then performed over the document to find the first term (and the following terms) with a given prefix. The cost of lookups would be: O(log N), but the insertions are going to be painful (O(N)) since most new entries would have to be put into the middle of the list and thus we'd have to move all older entries forward to make space for them.

B-Tree #

B-Trees were invented to compensate for that. There is a lot to be said about them, but the only relevant part for our discussion is that B-Trees work by splitting the data into pages, each page individually sorted, and then organizing the pages into a binary tree-like structure. Inserting new data then mostly touches a single page and some nodes in the tree along the way to that page. This amortizes the insertion performance to O(log N) (from O(N) of the "big sorted list" described above).

Because of these performance characteristics, B-Trees are ubiquitous in the "database world". Some version of them (e.g., B+ Tree) is used in practically every popular database.

Outline of FTS5 Approach #

While the above approach would work for full-text search, it doesn't take in account the specifics of it. Namely, users usually insert large documents that consist of many terms, and commonly perform lookup by just a few terms.

I believe this was the motivation for the approach that SQLite took with FTS5. Instead of plain B-Trees FTS5 uses a structure that has even better amortization for insertions at a price of a slight overhead during the lookup.

Roughly the data is split into three tables:

%_data - holds the header of the structure with levels description, and the contents of segments (sorted lists) of doclists (term + document rowid + positions of the term in the document). This table is indexed by rowid (i.e. an integer primary key)
%_idx - holds the triples (segment id, term, page number) to index within each segment. This table is indexed by segment id and term, so that we could find the page number in a segment efficiently
%_content - holds the full contents of the stored documents (note that this table is omitted when FTS5 is configured to use external content).

(Note: "%" is the virtual table name, e.g. search in our examples)

With the overview above, let's build the structure bottom up. When a new document is inserted into the table:

It is tokenized into multiple terms and their positions within the document
A new segment is created for all these terms, and it is essentially what we called "big sorted list" above. A segment could be one page if it is small (less than 1000 bytes, see pgsz config option), or it could be split into multiple sorted pages (leaves)
If the segment is split into multiple pages - we take a prefix of the first term from each sorted page and put them into %_idx table. Thus we indirectly create a B-Tree! %_idx becomes a tree part of B-Tree (because it is indexed by a B+-Tree under the hood), and this tree points to pages in the %_data table. Each page small enough that we could search it for terms and their documents
Finally, if FTS5 isn't running in the "external content" mode - we insert the original (non-tokenized) contents of the document into the %_content table.

With this structure it is easy to see how FTS5 performs a search over a single segment, but what if we insert multiple documents into the index over time?

Merging segments #

Each time we insert a new document - we create a new segment with its terms (just as described above). If the number of segments is small, when looking up a term in the index we'd just iterate through all of them starting with the newest segments! This is the insertion amortization technique in the nutshell. We just create a brand new B-Tree (through a combination of rows in %_idx and %_data) and get a forest of them.

When the number of segments becomes too large, however, B-Tree forest can quickly become impractical. Thus every now and then (every 64 insertions in SQLite) SQLite has to perform "merges" of some of these segments and create larger (but more efficient) B-Trees. There are various ways to configure how these merges work, and how much data is merged. I previously covered the mechanics of the merges so we won't concentrate on them here.

Every time a merge is performed - we have to combine together two segments and move all their data into a new one. This sounds like it could be... slow, right? In fact it very well would be if not for a trick that FTS5 employs! Instead of just merging everything together any time we get too many segments - we assign each segment a level. They all start with level 0 and the merges only affect segments on the same level. When they are finally merged - the resulting segment is one level above the source segments. Since we merge every 64 insertions, with each new level segments become ~64 times larger. This way we reduce the merge frequency for older and larger segments on a higher level, and perform most merges on the lower levels when they become overcrowded.

Inspecting the Structure #

Let's see how this works in practice. There is a "secret" test-only function in FTS5 named fts5_decode that is unfortunately only available when SQLITE_TEST was defined at the build time. sqlite3 shell isn't supposed to be built with this define, but with a small patch we could get it running:

CREATE VIRTUAL TABLE search
    USING fts5(content);

-- Limit max page size for instructiveness
INSERT INTO search(search, rank) VALUES
    ('pgsz', 32);

INSERT INTO search(content) VALUES
    ('hello world');

SELECT rowid, fts5_decode(rowid, block)
    FROM search_data;
-- Output (with manual indenting):
--
-- 1|{averages} 1 2
-- 10|{structure} {lvl=0 nMerge=0 nSeg=1 {id=1 leaves=1..1}}
-- 137438953473|{segid=1 h=0 pgno=1}
--     term=0hello id=1 nPos=1 2
--     term=0world id=1 nPos=1 3

If we ignore the {averages} (it is used for ranking), we see that in the {structure} we have one level (lvl=0) with one segment (nSeg=1 {id=1 leaves=1..1}) that consists of a single page (1..1).

Row 137438953473 (0x2000000001, see the official documentation on the value) has the actual segment page which contains a sorted list of terms and their positions within the document (as promised!). They all start with "0", because FTS5 supports optional prefix indexes which require this encoding.

We can insert more data into the index:

INSERT INTO search(content) VALUES
    ('how was your day');

SELECT rowid, fts5_decode(rowid, block)
    FROM search_data;
-- Output (from now on without {averages}, and
--   with hex segment ids):
--
-- 10|{structure} {lvl=0 nMerge=0 nSeg=2
--   {id=1 leaves=1..1}
--   {id=2 leaves=1..2}}
-- 0x2000000001|{segid=1 h=0 pgno=1}
--   term=0hello id=1 nPos=1 2
--   term=0world id=1 nPos=1 3
-- 0x4000000001|{segid=2 h=0 pgno=1}
--   term=0day id=2 nPos=1 5
--   term=0how id=2 nPos=1 2
--   term=0was id=2 nPos=1 3
-- 0x4000000002|{segid=2 h=0 pgno=2}
--   term=0your id=2 nPos=1 4

It is easy to see that the segment 1 wasn't changed, and we added one more level zero segment ({id=2 leaves=1..2}) with two pages (1..2, 0x4000000001 and 0x4000000002) because the terms didn't all fit into a single page. Each page is again a sorted list, and they are now inserted into the %_idx table:

SELECT * FROM search_idx;

-- Output (with column names):
-- segment id | term | page number
-- 1          |      | 2
-- 2          |      | 2
-- 2          | 0y   | 4

One can see that for the newly inserted segment 2 we have two entries (one per each page), and that they correctly start with "0y" term prefix which lets us quickly find the page if we are searching by "y*".

To finish this, let's simulate merging of the segments. We could insert 62 more entries to trigger it, but a similar result could be achieved by running optimize:

INSERT INTO search(search) VALUES
    ('optimize');

SELECT rowid, fts5_decode(rowid, block)
    FROM search_data;
-- Output:
-- 10|{structure}
--   {lvl=0 nMerge=0 nSeg=0}
--   {lvl=1 nMerge=0 nSeg=1 {id=3 leaves=1..3}}
-- 0x6000000001|{segid=3 h=0 pgno=1}
--   term=0day id=2 nPos=1 5
--   term=0hello id=1 nPos=1 2
--   term=0how id=2 nPos=1 2
-- 0x6000000002|{segid=3 h=0 pgno=2}
--    term=0was id=2 nPos=1 3
--    term=0world id=1 nPos=1 3
--    term=0your id=2 nPos=1
-- 0x6000000003|{segid=3 h=0 pgno=3}
--    4

SELECT * FROM search_idx;
-- Output:
-- segment id | term | page number
-- 3          |      | 2
-- 3          | 0w   | 4

As promised, the segments 1 and 2 got merged into a segment 3 on the newly created level 1. The result has three pages, where the last one has the left over term position that didn't fit into page 2.

If we insert more data into the table - it will be put again on the level 0 so the next automatic merge would only work with the segments of that level.

Closing Note #

I don't know about you, but for me this was quite a wild ride! B-Trees, tokenizers, merges... There's certainly a lot going on under the hood of one of the most popular databases in the world. As it often is in engineering, by reusing and combining simple blocks we can create a very complex structure that might be hard to understand when approaching head on. In this article I attempted to deconstruct FTS5 down to its roots (or at least a few levels lower), but there is still much more that could be said (like compact encoding of terms and positions in the pages).

If you have any requests or ideas - feel free to "toot" at me on Mastodon. Thanks for spending your time reading this!

Previous: HashWick V8 Vulnerability