标签:dep learn may blank fuse options sam varchar lan
原文:
https://severalnines.com/database-blog/guide-partitioning-data-postgresql
For databases with extremely large tables, partitioning is a wonderful and crafty trick for database designers to improve database performance and make maintenance much easier.
The maximum table size allowed in a PostgreSQL database is 32TB, however unless it’s running on a not-yet-invented computer from the future, performance issues may arise on a table with only a hundredth of that space.
Partitioning splits a table into multiple tables, and generally is done in a way that applications accessing the table don’t notice any difference, other than being faster to access the data that it needs.
By splitting the table into multiple tables, the idea is to allow the execution of the queries to have to scan much smaller tables and indexes to find the data needed.
Regardless of how efficient an index strategy is, scanning an index for a table that’s 50GB will always be much faster than an index that’s for a table at 500GB. This applies to table scans as well, because sometimes table scans are just unavoidable.
When introducing a partitioned table to the query planner, there are a few things to know and understand about the query planner itself.
Before any query is actually executed, the query planner will take the query and plan out the most efficient way it will access the data.
By having the data split up across different tables, the planner can decide what tables to access, and what tables to completely ignore, based on what each table contains.
This is done by adding constraints to the split up tables that define what data is allowed in each table, and with a good design, we can have the query planner scan a small subset of data rather than the whole thing.
Partitioning can drastically improve performance on a table when done right, but if done wrong or when not needed, it can make performance worse, even unusable.
There is no real hardline rule for how big a table must be before partitioning is an option, but based on database access trends, database users and administrators will start to see performance on a specific table start to degrade as it gets bigger.
In general, partitioning should only be considered when someone says “I can’t do X because the table is too big.” For some hosts, 200 GB could be the right time to partition, for others, it may be time to partition when it hits 1TB.
If the table is determined to be “too big”, it’s time to look at the access patterns. Either by knowing the applications that access the database, or by monitoring logs and generating query reports with something like pgBadger, we can see how a table is accessed, and depending on how it’s accessed, we can have options for a good partitioning strategy.
To learn more about pgBadger and how to use it, please check out our previous article about pgBadger.
Updated and deleted rows results in dead tuples that ultimately need to be cleaned up.
Vacuuming tables, whether manually or automatically, goes over every row in the table and determines if it is to be reclaimed or left alone.
The larger the table, the longer this process takes, and the more system resources used.Even if 90% of a table is unchanging data, it must be scanned each time a vacuum is run.
Partitioning the table can help reduce the table that needs vacuuming to smaller ones, reducing the amount of unchanging data needing to be scanned, less time vacuuming overall, and more system resources freed up for user access rather than system maintenance.
If data is deleted on a schedule, say data older than 4 years get deleted and archived, this could result in heavy hitting delete statements that can take time to run, and as mentioned before, creating dead rows that need to be vacuumed.
If a good partitioning strategy is implemented, a multi- hour DELETE statement with vacuuming maintenance afterward could be turned into a one minute DROP TABLE statement on a old monthly table with zero vacuum maintenance.
The keys for access patterns are in the WHERE clause and JOIN conditions. Any time a query specifies columns in the WHERE and JOIN clauses, it tells the database “this is the data I want”.
Much like designing indexes that target these clauses, partitioning strategies rely on targeting these columns to separate data and have the query access as few partitions as possible.
Examples:
The most common columns to focus on for partitioning are usually timestamps, since usually a huge chunk of data is historical information, and likely will have a rather predictable data spread across different time groupings.
Once we identify which columns to partition on we should take a look at the spread of data, with the goal of creating partition sizes that spread the data as evenly as possible across the different child partitions.
severalnines=# SELECT DATE_TRUNC(‘year‘, view_date)::DATE, COUNT(*) FROM website_views GROUP BY 1 ORDER BY 1; date_trunc | count ------------+---------- 2013-01-01 | 11625147 2014-01-01 | 20819125 2015-01-01 | 20277739 2016-01-01 | 20584545 2017-01-01 | 20777354 2018-01-01 | 491002 (6 rows)
In this example, we truncate the timestamp column to a yearly table, resulting in about 20 million rows per year.
If all of our queries specify a date(s), or date range(s), and those specified usually cover data within a single year, this may be a great starting strategy for partitioning, as it would result in a single table per year, with a manageable number of rows per table.
There are a couple ways to create partitioned tables, however we will focus mainly on the most feature rich type available, trigger based partitioning. This requires manual setup and a bit of coding in the plpgsql procedural language to get working.
It operates by having a parent table that will ultimately become empty (or remain empty if it’s a new table), and child tables that INHERIT the parent table.
When the parent table is queried, the child tables are also searched for data due to the INHERIT applied to the child tables.
However, since child tables only contain subsets of the parent’s data, we add a CONSTRAINT on the table that does a CHECK and verifies that the data matches what’s allowed in the table.
This does two things: First it refuses data that doesn’t belong, and second it tells the query planner that only data matching this CHECK CONSTRAINT is allowed in this table, so if searching for data that doesn’t match the table, don’t even bother searching it.
Lastly, we apply a trigger to the parent table that executes a stored procedure that decides which child table to put the data.
Creating the parent table is like any other table creation.
severalnines=# CREATE TABLE data_log (data_log_sid SERIAL PRIMARY KEY, date TIMESTAMP WITHOUT TIME ZONE DEFAULT NOW(), event_details VARCHAR); CREATE TABLE
Creating the child tables are similar, but involve some additions. For organizational sake, we’ll have our child tables exist in a separate schema. Do this for each child table, changing the details accordingly.
NOTE: The name of the sequence used in the nextval() comes from the sequence that the parent created. This is crucial for all child tables to use the same sequence.
What is Data Partitioning?(转发)(未完待续)
标签:dep learn may blank fuse options sam varchar lan
原文地址:https://www.cnblogs.com/panpanwelcome/p/14336500.html