Solving MAX(COUNT()) problem

I’ve been solving problem of doing grouped MAX(COUNT()) in PostgreSQL, and because I’ve not found anything really usable out there (doing correlated sub-queries is definitely not good idea for thousands of records) I’ve had to find my own solution.

Situation plan

My configuration is a bit complicated to explain this, so i’ll try do it on a fictive (not tested) example.  It is a system that collects data records from multiple data providers and from multiple data sources, where data providers can publish all or some of data on multiple data sources. Our mission is for each data provider select best data source, that is the one on which data provider has most data records published. This way we remove duplicate data records.

SQL tables:

  • data_source
CREATE TABLE data_source (
   ds_id SERIAL NOT NULL PRIMARY KEY,
   ds_name VARCHAR(10) NOT NULL UNIQUE
);
  • data_provider
CREATE TABLE data_provider (
   dp_id SERIAL NOT NULL PRIMARY KEY,
   dp_name VARCHAR(10) NOT NULL UNIQUE
);
  • data
CREATE TABLE data (
  id SERIAL NOT NULL PRIMARY KEY,
  ds_id INT NOT NULL REFERENCES data_source (ds_id),
  dp_id INT NOT NULL REFERENCES data_provider (dp_id),
  record TEXT NOT NULL
);

Solution

We will create view data_view_source_count with count of data records per data_source and data_provider, another view data_view_source that will contain best combinations for data_source and data_provider, and final table data_view that will contain only records from one data_source for each data_provider.

CREATE VIEW data_view_source_count AS
     SELECT dp_id, ds_id, count(id) as count
     FROM data
     GROUP BY ds_id, dp_id;
CREATE VIEW data_view_source AS
     SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
     ( SELECT dp_id, ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
     WHERE
     (dp_id, count) IN
        (SELECT dp_id, max(count) as max FROM
             ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
                      GROUP BY dp_id)
         ORDER BY dp_id;
CREATE VIEW data_view AS
      SELECT d.* FROM data d
      JOIN data_view_source dvs ON (dvs.dp_id = d.dp_id AND dvs.ds_id = d.ds_id);

And that all, how easy 😉 And it took me only few hours to figure out how to do it. Before creating this solution i thought that anything I do will be too slow and i will have to make data_view_source as permanent table regenerated on hourly basis, but right now this seems to be fast enough for having this as view.
In my situation, for cca. 11 thousand of active (young enough) records, select to data_view takes about 2.5x longer (280ms) as to data (110 ms). Nothing spectacular, but right now I don’t need this to be anyhow faster, so why to bother with more optimizations 😉

Notes

  • SELECT DISTINCT ON () is not part of SQL standard and not every RDBMS must implement it, or should work as expected here. I’m also slightly abusing the way it works in PostgreSQL because i expect it throws out every but first data record for unique data_provider, and thus for rightly sorted table it does what I need. But, again it is not guaranteed that this will work as expected in other RDBMS, nor future PostgreSQL versions, because of database query engine query optimizations and so on… Maybe  late I’ll try to replace this with something ANSI compatible (well, and then you will have to find ANSI compatible RDBMS 😉 ).