> 2021年12月14日信息消化 ### postSQL: SOME INDEXING BEST PRACTICES > MEMO > 数据量过滤很少的情况,index并不会有太大帮助。 > 对于永远使用的filter,比如软删除flag,可以用[partial indexes](https://www.postgresql.org/docs/current/indexes-partial.html). Origin: [SOME INDEXING BEST PRACTICES](https://www.pgmustard.com/blog/indexing-best-practices-postgresql) #### Don’t index every column The **more write-heavy** a table is, the more carefully you should think about the benefit of adding an index. If you’ve inherited a database with too many indexes, you can start to understand how much each one is used with the view `pg_stat_user_indexes`. Remember to check any read replicas too! #### Index columns that you filter on The best candidates for indexes are columns that you **filter on regularly**. Primary keys are indexed by Postgres automatically, but foreign keys, and other columns, are not. Let’s consider a simple users table with only two columns: ```sql CREATE TABLE users ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, email text NOT NULL ); INSERT INTO users (email) SELECT n || '@gmail.com' FROM generate_series(1, 100000) AS n; ANALYZE users; ``` If we want to look up a user by their email address, here is the query plan: ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT id, email FROM users WHERE email = '56789@gmail.com'; ------------------------------------------------------------ Seq Scan on users (cost=0.00..1887.00 rows=1 width=23) (actual time=21.498..34.505 rows=1 loops=1) Filter: (email = '56789@gmail.com'::text) Rows Removed by Filter: 99999 Buffers: shared hit=637 Planning Time: 0.113 ms Execution Time: 34.539 ms ``` Now let’s add an index on the email column and get a new query plan: ```sql CREATE INDEX users_email ON users(email); EXPLAIN (ANALYZE, BUFFERS) SELECT id, email FROM users WHERE email = '56789@gmail.com'; ------------------------------------------------------------ Index Scan using users_email on users (cost=0.42..3.44 rows=1 width=23) (actual time=0.076..0.079 rows=1 loops=1)| Index Cond: (email = '56789@gmail.com'::text) Buffers: shared hit=4 Planning Time: 0.219 ms Execution Time: 0.116 ms ``` We can see from the query plans that using the index is 150x more efficient (4 blocks of data, instead of 637) and 100x faster (0.335ms total instead of 34.652ms). That’s quite a difference, and this is only on 100k rows of data. **Naturally, the lower the proportion of rows filtered out, the less difference an index will make.** Once the ratio gets too high, it might not even be faster than a sequential scan – as we saw in our post [why isn’t Postgres using my index](https://www.pgmustard.com/blog/why-isnt-postgres-using-my-index). #### Only index data that you need to look up If you have a proportion of a table that you rarely look up, and almost always filter out, there is little benefit to having it indexed. A common example given is a table containing soft-deleted data, where queries will normally contain `WHERE deleted_at IS NULL` For these cases, Postgres has [partial indexes](https://www.postgresql.org/docs/current/indexes-partial.html). These are smaller, faster, and don’t need to be updated as often as full indexes. You do need to be careful, though, as they can only be used for queries that Postgres can guarantee matches the WHERE condition. #### Consider other index types You can get quite far with the trusty b-tree index – which is the default for good reasons! The short version is that some index types can be faster, or much smaller, for specific use cases, while others can enable things that b-trees don’t support, like **full-text search**, or spatial queries. #### Use indexes to pre-sort data Sorting a large dataset is an expensive operation, in both time and memory. Sorting the same data over ad over can be avoided by adding an index with the order needed. As of PostgreSQL 13 (Sep 2020), it can even do an [Incremental Sort](https://www.pgmustard.com/docs/explain/incremental-sort), making use of an index that only satisfies the first part(s) of the sort order. For an added performance boost, it is great to [load data in a sorted fashion](https://hakibenita.com/sql-tricks-application-dba#always-load-sorted-data). This will reduce the number of page reads needed for queries using the sort order, making your queries more efficient, and again, faster. #### Use multi-column indexes, but sparingly Postgres lets you add more than one column to an index. You can order by a second (and third, etc) column, or include the additional column(s) in the payload to enable index-only scans. However, [as we noted when we wrote about multi-column indexes](https://www.pgmustard.com/blog/2019/05/20/multi-column-indexes), there are downsides to consider. Postgres does have other options, so our advice is to use multi-column indexes, but to do so sparingly. #### Look after your indexes Indexes can require a bit of looking after (or maintenance) to stay performant over time. Vacuum (and autovacuum) will clear out deleted (and updated) entries, but it won’t re-combine pages that are mostly empty. As such, over time your index can grow in size, and reduce in efficiency: an effect often referred to as bloat. You can monitor for signs of bloat, and you can also see clues in query plans (when you use BUFFERS). It is one of the types of issue pointed out in our product, [pgMustard](https://www.pgmustard.com/). To completely remove bloat from an index, you can use `REINDEX`. As of Postgres 12, we have the very useful `REINDEX CONCURRENTLY`, but you do need to be careful if that ever fails, as it can leave behind invalid indexes. [pg_repack](https://github.com/reorg/pg_repack) and [pg_squeeze](https://github.com/cybertec-postgresql/pg_squeeze) are other options that involve minimal locking. ### Tackle.to https://2021.tackle.to/ #### Work Calendar ![image-20211216000102625](https://raw.githubusercontent.com/Phalacrocorax/memo-image-host/master/PicGo/image-20211216000102625.png) - **397** is the total number of meetings - **Friday** is the busiest day - **10AM** (JST) is the busiest hour - **3** of these meetings were outside of work hours - **1** meetings on average on weekdays - **19%** Average work hours spent in meetings #### Personal Calendar ![image-20211215235450091](https://raw.githubusercontent.com/Phalacrocorax/memo-image-host/master/PicGo/image-20211215235450091.png) - **44** is the total number of meetings - **40** Hours in meeting with tanji-san lol - **Saturday** is the busiest day - **11AM** (JST) is the busiest hour - **33** of these meetings were outside of work hours - **29** were on weekends ### Misc - [Architecting Distributed Systems: The Importance of **Idempotence**](https://betterprogramming.pub/architecting-distributed-systems-the-importance-of-idempotence-138722a6b88e) - 幂等性:Idempotence (or *idempotency* if you like), is a characteristic of an operation, such as an HTTP endpoint or an RPC call, allowing to **execute it multiple times and observe the same result as if the operation was only applied once.** - **PUT** over **POST** - [Marketing Is Scary for a Solo Developer](https://raumet.com/marketing) - **Getting over the fear of exposure**: do something else to get my mind off it. - **Reminding myself that it's necessary**: It's just the way it is. > HN [Comments](https://news.ycombinator.com/item?id=29538355) > > I think marketers and developers need a stronger culture between ourselves. The tools of marketers (data mining, consumer behavior analysis, surveys, market testing, etc) aren't just a veneer to be applied at the completion of a development project. Marketing, when done correctly, informs UX. Marketing can provide developers with insights into the problems they should try to solve. > > From what I can tell, many developers view the development process as a sacred protected space for creation. This is not the optimal means to delivering a high-impact product. Developers should be asking themselves during the development process: "Is this feature going to excite a group of users, and do we have some statistical mechanism to predict that excitement?" > > When you create something that you already know will excite people, the product launch becomes much simpler. - What Makes TikTok Tick | [The Batch #122](https://read.deeplearning.ai/the-batch/issue-122/) - ![tiktok3](https://raw.githubusercontent.com/Phalacrocorax/memo-image-host/master/uPic/tiktok3.gif)