Postgresql upgrade checklist

Checklist for major postgresql database upgrade:

  • shutdown standby servers
  • on each standby perform pg_upgrade
  • shutdown master
  • perform pg_upgrade
  • if pg_upgrade fails on postgres database with encoding mismatch, re-init database with initdb
  • start master
  • verify master is ok
  • check if recovery.conf present in slave datadir
  • start slaves one-by-one
  • check if slaves are live and recovering
  • if not recovering, check if entry in pg_replication_slots is present
  • if not there create it with pg_create_physical_replication_slot
  • once everything is live, check pg_replication_slots

This procedure will bring DB system down for a while, because we shutdown slaves and the master at once. It should be possible to start slave, while master is not yet upgraded and hope that upgraded slave will deal correctly with non-upgraded data feed from master. Or one can perform a master switch to an already upgraded slave. Both this options, are considered potentially unsafe and should be tested first on non-production database.

For re-creation of standby:

  • perform pg_basebackup on master (rep1 is our replication user)
    master: pg_basebackup -U rep1 -Ft -D - | bzip2 | nc SLAVE-HOST 9999
    slave: nc -l -p 9999 -v > backup.tar.bz2
    
  • unpack backup.tar.gz (replace existing datadir)
  • add recovery.conf to datadir
  • check replication slots as by upgrade

Fixing pg_dump invalid memory alloc request size

I’ve encountered unusual problem while dumping one of the bigger PostgreSQL databases. Every try to run pg_dump resulted in this:

sam@cerberus:~/backup$ pg_dump -v -c js1 >x
pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613
pg_dump: The command was: COPY job.description (create_time, job_id, content, hash, last_check_time) TO stdout;
pg_dump: *** aborted because of error

or this

sam@cerberus:~/backup$ pg_dump -v -c --inserts js1 >x
pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613
pg_dump: The command was: FETCH 100 FROM _pg_dump_cursor
pg_dump: *** aborted because of error

Few weeks ago I’ve experienced almost fatal failure of two(!) disks in my RAID5 array – well really ‘funny’ situation, and a bit uncomfortable way how to find out why not having monitoring and alarming system is a really bad idea 😉
Fortunately no serious data damage happened, PostgreSQL seemed to recover from this without any problems and all my databases worked fine – until I’ve tried to perform an iregullar database dump.

And now comes the important question, how to find out which table rows are invalid (and pg_filedump says everything is OK) ?
The answer is, use custom function:

CREATE OR REPLACE FUNCTION
find_bad_row(tableName TEXT)
RETURNS tid
as $find_bad_row$
DECLARE
result tid;
curs REFCURSOR;
row1 RECORD;
row2 RECORD;
tabName TEXT;
count BIGINT := 0;
BEGIN
SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;

OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;

count := 1;
FETCH curs INTO row1;

WHILE row1.ctid IS NOT NULL LOOP
result = row1.ctid;

count := count + 1;
FETCH curs INTO row1;

EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
|| tableName || ' WHERE ctid = $1' INTO row2
USING row1.ctid;

IF count % 100000 = 0 THEN
RAISE NOTICE 'rows processed: %', count;
END IF;
END LOOP;

CLOSE curs;
RETURN row1.ctid;
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE 'LAST CTID: %', result;
RAISE NOTICE '%: %', SQLSTATE, SQLERRM;
RETURN result;
END
$find_bad_row$
LANGUAGE plpgsql;

It goes over all records in given table, expands them one by one – what will results in exception if some expansion error occurs. Exception will also contain CTID of last correctly processed row. And next row with higher CTID will be the corrupted one.

Like this:

js1=# select find_bad_row('public.description');
NOTICE: LAST CTID: (78497,6)
NOTICE: XX000: invalid memory alloc request size 18446744073709551613
find_bad_row
--------------
(78497,6)
(1 row)

js1=# select * from job.description where ctid = '(78498,1)';
ERROR: invalid memory alloc request size 18446744073709551613
js1=# delete from job.description where ctid = '(78498,1)';
DELETE 1
js1=# select find_bad_row('job.description');
NOTICE: rows processed: 100000
NOTICE: rows processed: 200000
NOTICE: rows processed: 300000
NOTICE: rows processed: 400000
NOTICE: rows processed: 500000
NOTICE: rows processed: 600000
NOTICE: rows processed: 700000
NOTICE: rows processed: 800000
find_bad_row
--------------

(1 row)

Note: this function requires hstore extension – it is part of PostgreSQL distribution, but you may need to create it with:

CREATE EXTENSION hstore;

Records in this table were not that important, because I could restore them from external source. But if you can’t delete them, you will have to play directly with data files – like described here – good luck 🙂