How Postgres stores data on disk – this one's a page turner

silcock.dev

Summary

I remember when I first started on server-side applications and not getting what the big deal about databases was. Once I started working with real-life applications I realised that databases are basically magic. In fact, it’s easy to think of databases like a black box where you make sure your tables are indexed sensibly and your queries aren’t doing anything silly.

Postgres stores all its data in a directory sensibly called /var/lib/postgresql/data 1. Let’s spin up an empty Postgres installation with Docker and mount the data directory in a local folder so that we can see what’S going on in there.

Transactions are units of work which are executed, committed and succeed or fail as a single unit of work. If a transaction fails, it is rolled back and all the changes made in the transaction are undone. Each transaction sees its own copy of the database, called its snapshot.

Base/Contains a subdirectory for each database. Inside each sub-directory are the files with the actual data in them. We’ll dig into this more in a second.

Write-Ahead Log (WAL) works in full, but in short, Postgres writes changes that it’s going to make to the WAL. If it crashes it can just re-read and re-run all the operations in the Wal to get back to the expected database state.

Pg_snapshots/ contains exported snapshots, used e.g. by pg_dump which can dump a database in parallel. This folder contains data for the various replication slots, which are a way of ensuring WAL entries are kept for particular replicas.

This is where the Write-Ahead Log (WAL) files are stored. The ‘prepare’ transaction is dissociated from the current session. This is useful for two-phase commits.

When you start up a fresh Postgres server, Postgres will automatically create 3 databases for you. These databases are templates used to create future databases. These in-built tables don’t have anything in them and are generally boring, so let’s create ourselves a new database and put them data in.

We’re going to download some country data into our container and load it into a new database. We can use a tool like psql or TablePlus to examine the database, but we’ll just exec into the container and use psql from inside the container. This way, we don’t have to worry about mapping ports or mismatching psql and Postgres server versions.

Rw------- 25k  1247_fsm.rw------- 524  pg_filenode.map.rw------ 160k  pg_internal.init.rw----- 8.2k  1247-fsm;. 2k 1249-f Sm;. 3k 16390-fSm;. 4k 16389-f sm;. 16395-f SM;. 1k 16393-fSM;. 5k 16395 -fsm:. 2200 -f Sm.

Postgres splits large objects into multiple files called segments, to avoid issues some operating systems have with large files. By default these are 1 GB in size, although this is configurable.

The object heap consists of many different pages (also known as blocks) sequentially stored in the file. By default, a page is 8 KB in size so we’d expect all our object files to be multiple of 8 KB.

You might be thinking – why not just have one page per segment? The answer is that each page is written in one atomic operation and the larger the size of the page, the more likely there will be a write failure during the write. In our example here, our main table is 2.7 GiB which requires 3 separate segments of 1 GiB each.

The offset to the end of the line pointers (a.k.a. “lower”) goes from ‘lower’ to ‘upper’ – you can calculate the remaining free space in the page by doing “upper” - “ lower” – “higher” The page header itself contains an in-built extension called pageinspect.

Postgres disables checksum protection by default for performance reasons. The special data section is not used for table pages, only for other types like indexes.

The data for Equatorial Guinea is stored on disk. We can see the actual data for the item here as well, which is pretty cool. The ctid (Current Tuple ID)4 tells you where the item is located in the form (page index, item index within page) so (1, 1) means the first item in page 1.

Postgres uses MVCC (Multiversion Concurrency Control) to handle concurrent access to data. The “multiversion” here means that when a transaction comes in and modifies a row, it doesn’t touch the existing tuples. Instead, it creates a new tuples at the end of the last page with the modified row.

The data is still there! That’s because Postgres doesn’t bother actually deleting the data, it just marks the data as deleted. But you might be thinking, if rows are constantly getting deleted and added, you’ll end up with constantly increasing segments files full of deleted data. This is where vacuuming comes in.

In certain rare circumstances, it can actually be quite useful for data recovery. It helps understand how Postgres queries your data on disk and how MVCC works.

In this week’s episode of ‘The Daily Discussion’, we’re looking at the role of the internet in the world of business. We’ll be looking at how the internet has changed the way business is done.

web site website internet site site book jacket dust cover dust jacket dust wrapper cash machine cash dispenser automated teller machine automatic teller machine automated teller automatic teller ATM microphone mike-0

2 CommentsShow all comments

throwschenOP

It really makes a lot of sense once you know that it never really garbage collects!

Enjoying SpeakBits?

How Postgres stores data on disk – this one's a page turner

Summary

2 CommentsShow all comments