Understand EXPLAIN query plans (part 1 of 2)
Have you ever had a slow-running SQL query but you weren’t sure where to even start looking for optimizations? If only there was some way to pinpoint exactly which parts of your query were causing the whole thing to run slow, so that you can focus your efforts there, rather than making tweaks all over the place.
That’s exactly what the
EXPLAIN statement allows us to do. The
EXPLAIN statement will give you the query plan produced by your database’s query planner, and also gives you estimates of what the most costly parts of your query are.However, deciphering what each part of the query plan is telling us can be a bit tricky at first.
In this article, we’ll walk through how we can understand what the query plan is telling us and how we can use that information to improve the performance of our queries.
This article is part 1 of a set of 2 articles and is focused on understanding the query planner output using the
We’ll be using Postgres as our database here, so just be aware that there may be some minor differences, especially in the Postgres-related features, if you’re using a different RDBMS. If you’re using a data warehouse you may actually have a separate set of tools to analyze your query performance, since RDBMS and Data Warehouses go about storing and retrieving data in different ways.
To start things off, let’s create a basic table that we can use throughout to write queries against and analyze performance.
CREATE TABLE user_event_log (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
Although we won’t create them, we’ll assume that event_id and user_id are foreign key references to a separate event and user table.
Next, let’s insert some data.
INSERT INTO user_event_log(event_id, event_metadata, user_id) VALUES
(1, ‘“field”: “some value”’, 1),
(2, ‘“field”: “some value”’, 1);
Although we only have two rows at the moment, there’s actually already a lot of interesting information that we can get from here, so let’s start using our EXPLAIN query.
EXPLAIN SELECT *
WHERE user_id = 1;
And this is our output:
Seq Scan on user_event_log (cost=0.00..21.50 rows=5 width=60)
Filter: (user_id = 1)
Alright, so what does this mean? Let’s take it bit-by-bit, starting at the top with the actual EXPLAIN keyword.
The EXPLAIN keyword will give us the query plan, i.e. the approach the query planner deemed as optimal, which will be taken to perform this query if it’s executed. In addition, we will also get some estimated cost metrics.
In more complex plans, which we’ll see later, we can get multiple nodes as output, but it this query we just get a single node.
The first thing we see is the
Seq Scan keyword, in other words, we have a full table scan. This means that our query is going to go through every record in this table when performing this query. Yikes!
Now this obviously isn’t a problem when we only have two rows, but imagine having a table with 10 million rows, and you now run this query multiple times per second. That means tons of data will be read from disk into memory, and only the relevant bits are kept.
That’s also where the
Filter: (user_id = 1)statement comes in.
The full picture of what’s happening here is that our query is traversing through our whole data, reading the data from disk, and any records it finds that fits this condition will be kept to be returned with the query, the rest are discarded.
We’ll look at how we can fix this in a second. Some of you probably already know that the solution here is indices, but let’s take it step-by-step and look at the rest of the output in the parenthesis.
We also see this:
cost=0.00...21.50. What does this mean?
This is Postgres’ estimate as to the cost of this part of the query. The cost is in arbitrary units and doesn’t mean much by itself, but it can be pretty useful to compare relative costs between different parts of a query. In our case we don’t have multiple parts yet, but this can be useful later.
0.00 is the estimated start-up cost. The start-up cost is a measure of how long it will take until this part of the query produces its first row. In this case it’s 0, but we’ll see later on that this doesn’t always start at 0.
21.50 is the total cost of this part of the query, from when it starts to when it finishes.
Important to note about both cost components is that they are cumulative, meaning that the costs of higher nodes include the costs of lower components.
Finally, Postgres estimates that it will return 5 rows, and that the size in bytes (what it calls the width) is 60 bytes per row. The row estimate is obviously wrong, but that’s fine, and we’ll get into the row estimates a bit later and when you should be worried.
So now that we’ve seen our query plan, let’s optimize things a bit. When you see a full table scan you usually want to think about optimizing it by creating an index on the column.
CREATE INDEX event_log_user_id_idx ON user_event_log(user_id);
Our query plan now looks like this:
Seq Scan on user_event_log (cost=0.00..1.02 rows=1 width=60)
Filter: (user_id = 1)
You’re probably wondering now, wait, isn’t that basically the same plan we had previously?
Yes, and we’ve actually reached an interesting edge-case here.
Postgres now has to decide between using the index and using the full table scan, and it decided that the full table scan is actually the more efficient operation in this case. Usually, when you end up returning at least around 10% of the data in your table, the full table scan becomes more efficient over the index scan.
This is because using an index requires several IO operations — looking up the indices, and then looking up the relevant records determined from the indices. Whereas a full table scan is just reading in the records directly.
If we add in some more more data into our database using a simple python script