types of scans
2022-10-24 ยท 3 min read
When analyzing Postgres query plans, you'll see a tree of different kinds of "scans". These scans determine how the DB plans to look through the different internal datastructures and indexes in order to fetch the data needed to execute your query.
For example, we can use explain
to see that this query uses a single "Index Scan":
psql$ explain (ANALYZE, VERBOSE, BUFFERS)
select "user"."user_pk" from "user"
where "user"."node_pk" = '028be3dce0eb45b794fc083df3f84f4f746724a86b161592463565e6343ff83ecf';
QUERY PLAN
----------------------------------------------------------------------
Index Scan using user_node_pk_key on public."user" (cost=0.42..8.44 rows=1 width=65) (actual time=0.071..0.073 rows=1 loops=1)
Output: user_pk
Index Cond: (("user".node_pk)::text = '028be3dce0eb45b794fc083df3f84f4f746724a86b161592463565e6343ff83ecf'::text)
Buffers: shared hit=4
Planning:
Buffers: shared hit=112
Planning Time: 1.194 ms
Execution Time: 0.109 ms
(8 rows)
internal data structures #
Before jumping into the different scans, we need to quickly cover the internal data structures that Postgres uses to store table and index data.
heap #
Common storage area for the whole rows in the table. The rows are grouped into pages (default: 8 KiB / page). Each row has its index in the heap, called the TID.
TID #
The TID or tuple identifier is a 6-byte struct that describes the unique storage location for each row / tuple:
struct Tid {
/// The index of the page in the heap containing this tuple.
page_idx: u32,
/// The index of this tuple inside the page.
in_page_idx: u16,
}
index storage #
Each index has its own storage separate from the heap, which stores only the values (columns, materialized values, etc...) contained by the index. Also divided into (default) 8 KiB pages like the heap.
scans #
sequential scan (Seq Scan) #
- Sequentially scan all item pointers in all pages of the corresponding table(s).
- For small tables, sequential scans are often faster than index scans, especially if the whole table fits in a few pages.
- For large tables, sequential scans can be a performance disaster if the query has to load GiB's of data just to find a few items.
index scan #
- Depends on the specific type of index (e.g., btree, hash, GIS, ...).
- Scan and fetch required data from the index, including the TID of each data in the heap.
- Then use the TID to fetch the whole data for the relevant items.
- The row fetches from the heap will often look like random IO, which is often slower than sequential IO (esp. on spinning disk drives).
index-only scan #
- If the index contains all the data needed to fulfill the query, then the query engine never actually needs to follow the TID indirection to load rows from the heap, which is usually faster.
- Maintaining the index can make inserts/updates/deletes slower, since more data needs to be stored in the index.
bitmap scan #
- Bitmap Index Scan: First, perform an index scan; but, instead of fetching each item from the heap one-at-a-time while scanning the index, you rather build a bitmap of heap page idxs from the TID.
- Bitmap Heap Scan: Using the bitmap, you perform a heap scan and select only heap pages matching the bitmap.
- The selected heap pages are then filtered/processed for the query.