Fixing pg_dump invalid memory alloc request size

I’ve encountered unusual problem while dumping one of the bigger PostgreSQL databases. Every try to run pg_dump resulted in this:

sam@cerberus:~/backup$ pg_dump -v -c js1 >x
pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613
pg_dump: The command was: COPY job.description (create_time, job_id, content, hash, last_check_time) TO stdout;
pg_dump: *** aborted because of error

or this

sam@cerberus:~/backup$ pg_dump -v -c --inserts js1 >x
pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613
pg_dump: The command was: FETCH 100 FROM _pg_dump_cursor
pg_dump: *** aborted because of error

Few weeks ago I’ve experienced almost fatal failure of two(!) disks in my RAID5 array – well really ‘funny’ situation, and a bit uncomfortable way how to find out why not having monitoring and alarming system is a really bad idea 😉
Fortunately no serious data damage happened, PostgreSQL seemed to recover from this without any problems and all my databases worked fine – until I’ve tried to perform an iregullar database dump.

And now comes the important question, how to find out which table rows are invalid (and pg_filedump says everything is OK) ?
The answer is, use custom function:

CREATE OR REPLACE FUNCTION
find_bad_row(tableName TEXT)
RETURNS tid
as $find_bad_row$
DECLARE
result tid;
curs REFCURSOR;
row1 RECORD;
row2 RECORD;
tabName TEXT;
count BIGINT := 0;
BEGIN
SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;

OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;

count := 1;
FETCH curs INTO row1;

WHILE row1.ctid IS NOT NULL LOOP
result = row1.ctid;

count := count + 1;
FETCH curs INTO row1;

EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
|| tableName || ' WHERE ctid = $1' INTO row2
USING row1.ctid;

IF count % 100000 = 0 THEN
RAISE NOTICE 'rows processed: %', count;
END IF;
END LOOP;

CLOSE curs;
RETURN row1.ctid;
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE 'LAST CTID: %', result;
RAISE NOTICE '%: %', SQLSTATE, SQLERRM;
RETURN result;
END
$find_bad_row$
LANGUAGE plpgsql;

It goes over all records in given table, expands them one by one – what will results in exception if some expansion error occurs. Exception will also contain CTID of last correctly processed row. And next row with higher CTID will be the corrupted one.

Like this:

js1=# select find_bad_row('public.description');
NOTICE: LAST CTID: (78497,6)
NOTICE: XX000: invalid memory alloc request size 18446744073709551613
find_bad_row
--------------
(78497,6)
(1 row)

js1=# select * from job.description where ctid = '(78498,1)';
ERROR: invalid memory alloc request size 18446744073709551613
js1=# delete from job.description where ctid = '(78498,1)';
DELETE 1
js1=# select find_bad_row('job.description');
NOTICE: rows processed: 100000
NOTICE: rows processed: 200000
NOTICE: rows processed: 300000
NOTICE: rows processed: 400000
NOTICE: rows processed: 500000
NOTICE: rows processed: 600000
NOTICE: rows processed: 700000
NOTICE: rows processed: 800000
find_bad_row
--------------

(1 row)

Note: this function requires hstore extension – it is part of PostgreSQL distribution, but you may need to create it with:

CREATE EXTENSION hstore;

Records in this table were not that important, because I could restore them from external source. But if you can’t delete them, you will have to play directly with data files – like described here – good luck 🙂