## Solving MAX(COUNT()) problem 2 – optimizations

In previous post I’ve tried to present my solution for solving max(count()) problem.  The solution was slightly suboptimal and I’ve needed to speed it up a bit, because I’m using it now in some statistical calculations and every millisecond is important.

This is original solution.

```CREATE VIEW data_view_source AS
SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
WHERE
(dp_id, count) IN
(SELECT dp_id, max(count) as max FROM
( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
GROUP BY dp_id)
ORDER BY dp_id;```

and after bit of thinking a realizing how dummy I’m

```CREATE VIEW data_view_source AS
WITH tmp1 AS (
SELECT dp_id,  ds_id, count FROM data_view_source_count
)
SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
( SELECT dp_id,  ds_id, count FROM tmp1 ORDER BY dp_id, count DESC) as glob
WHERE
(dp_id, count) IN
(SELECT dp_id, max(count) as max FROM
( SELECT dp_id,  ds_id, count FROM tmp1 ORDER BY dp_id, count DESC) as minmax
GROUP BY dp_id)
ORDER BY dp_id;```

Using WITH clause removes duplicate selects and speeds up data_view_source view by cca. 15-30% (for me).

## SQL timeline and statistical computations

This is simple way how-to generate time table (list of consequencing timestamps, or timeline) in PostgreSQL. Nothing spectacular, but might help you ,when trying to do some time based statistical selects (can feel like OLAP :-)).

Code:

```CREATE OR REPLACE FUNCTION gen_time_list(
tim IN TIMESTAMP,
tim_stop IN TIMESTAMP,
step IN INT)
RETURNS TABLE(ts1 TIMESTAMP, ts2 TIMESTAMP)
AS \$\$
BEGIN
RETURN QUERY SELECT (TIMESTAMP 'epoch' + h * INTERVAL '1 second') AS h,
(TIMESTAMP 'epoch' + (h + step) * INTERVAL '1 second') AS h_prev FROM
(
SELECT generate_series(EXTRACT(EPOCH FROM DATE_TRUNC('hour', tim))::bigint,
(EXTRACT(EPOCH FROM tim_stop)::bigint),
step) as h
) as hour_lists;
END;
\$\$ LANGUAGE plpgsql STRICT IMMUTABLE;
```

This creates list of timestamps ts1, ts2 where ts2 = (ts + step).

Example:

```select * FROM gen_time_list(now()::timestamp, (now() - interval '10 days')::timestamp, -3600);
```

Lists all hour timestamps for last 10 days.

With this, cross join and some group by magic you can get very easily statistical information for data you have, like i.e. how many new orders we had weekly in last 6 months. Quick & easy, everything done within database, thats how i like it 😉
Modifying for your needs should be simple.

My Sample:

I’ve data with multiple labels of different kind, and with this i get how many new labels have been published per week grouped by labels. Maybe there is simplier and faster way out there, show me plz 😉

```SELECT
count(j.id) AS count,
gtl.ts1 AS ts1, gtl.ts2 as ts2,
l1.id AS label_id1,
l2.id AS label_id2
FROM j
JOIN jlabel jl1 ON j.id::bigint = jl1.job_id::bigint
JOIN label l1 ON l1.category = 1 AND l1.id::smallint = jl1.label_id::smallint
JOIN jlabel jl2 ON j.id::bigint = jl2.job_id::bigint
JOIN label l2 ON l2.category = 2 AND l2.id::smallint = jl2.label_id::smallint
CROSS JOIN ( SELECT * FROM gen_time_list(now()::timestamp, (now() - interval '1 year')::timestamp, -86400) ) gtl
WHERE j.pubtime  gtl.ts2
GROUP BY l1.id, l2.id, ts1,ts2
ORDER BY label_id1, label_id2, ts1 DESC, ts2 DESC;
```

## Solving MAX(COUNT()) problem

I’ve been solving problem of doing grouped MAX(COUNT()) in PostgreSQL, and because I’ve not found anything really usable out there (doing correlated sub-queries is definitely not good idea for thousands of records) I’ve had to find my own solution.

Situation plan

My configuration is a bit complicated to explain this, so i’ll try do it on a fictive (not tested) example.  It is a system that collects data records from multiple data providers and from multiple data sources, where data providers can publish all or some of data on multiple data sources. Our mission is for each data provider select best data source, that is the one on which data provider has most data records published. This way we remove duplicate data records.

SQL tables:

• data_source
```CREATE TABLE data_source (
ds_id SERIAL NOT NULL PRIMARY KEY,
ds_name VARCHAR(10) NOT NULL UNIQUE
);```
• data_provider
```CREATE TABLE data_provider (
dp_id SERIAL NOT NULL PRIMARY KEY,
dp_name VARCHAR(10) NOT NULL UNIQUE
);```
• data
```CREATE TABLE data (
id SERIAL NOT NULL PRIMARY KEY,
ds_id INT NOT NULL REFERENCES data_source (ds_id),
dp_id INT NOT NULL REFERENCES data_provider (dp_id),
record TEXT NOT NULL
);```

Solution

We will create view data_view_source_count with count of data records per data_source and data_provider, another view data_view_source that will contain best combinations for data_source and data_provider, and final table data_view that will contain only records from one data_source for each data_provider.

```CREATE VIEW data_view_source_count AS
SELECT dp_id, ds_id, count(id) as count
FROM data
GROUP BY ds_id, dp_id;```
```CREATE VIEW data_view_source AS
SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
( SELECT dp_id, ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
WHERE
(dp_id, count) IN
(SELECT dp_id, max(count) as max FROM
( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
GROUP BY dp_id)
ORDER BY dp_id;```
```CREATE VIEW data_view AS
SELECT d.* FROM data d
JOIN data_view_source dvs ON (dvs.dp_id = d.dp_id AND dvs.ds_id = d.ds_id);```

And that all, how easy 😉 And it took me only few hours to figure out how to do it. Before creating this solution i thought that anything I do will be too slow and i will have to make data_view_source as permanent table regenerated on hourly basis, but right now this seems to be fast enough for having this as view.
In my situation, for cca. 11 thousand of active (young enough) records, select to data_view takes about 2.5x longer (280ms) as to data (110 ms). Nothing spectacular, but right now I don’t need this to be anyhow faster, so why to bother with more optimizations 😉

Notes

• SELECT DISTINCT ON () is not part of SQL standard and not every RDBMS must implement it, or should work as expected here. I’m also slightly abusing the way it works in PostgreSQL because i expect it throws out every but first data record for unique data_provider, and thus for rightly sorted table it does what I need. But, again it is not guaranteed that this will work as expected in other RDBMS, nor future PostgreSQL versions, because of database query engine query optimizations and so on… MaybeÂ  late I’ll try to replace this with something ANSI compatible (well, and then you will have to find ANSI compatible RDBMS 😉 ).

## LDAP + SASL authentification

I’ve found again, that simple things don’t get so simple in OpenSource world, and you have to do it yourself when you want something. That’s the difference, in the world of commercial software you have to, either pay big money or you never get feature you need (and mostly both ;-))

I’veÂ  slightly unusual and paranoid subversions svnserve setup (which i’ll describe later) and my requirement was to use LDAP for user authentication. Well i’know there is possibility to use apache ldap authentication for this, but running separate web server only for having this, is imho not a good idea. So my next try was SASL authentication that svn now supports.Â  New dilemma raised, because subversion disallows PLAIN authentication mechanism, I needed either plain text passwords in LDAP or write my own SASL plugin. Fortunately i’ve found this page http://southbrain.com/south/2008/06/writing-a-cyrus-sasl-ldap-auxp.html, with existing solution and (almost) working code.

Subversion however didn’t cooperated well with this (ie. it passes realm within username to sasl etc…), so i’ve to made some modifications to let this work for me and my existing setup. So here it is eldap (easy ldap) auxprop module.

I’ve made some cleanups to the code, simplified installation, added some little features and fixed some bugs and I’m releasing this as separate module (however still compatible with original gienger ldap). Later i’m planning to describe how to modify your LDAP (and some ldap user management applications) to let it work with this. Till than you have to read original post about how to do it (or bug me).

I hope it helps you.

## Worpress-mu + WPML + SuperCache integration

Seems to be fixed in SuperCache 0.9.5.

This is very simple patch for WPML plugin v1.2.1 and SuperCache plugin 0.8.9.1.

I’ve had some problems with static caching of multilingual pages on my wordpress-mu based CMS system, running both WPML and SuperCache plugin. They both work almost good out of the box, but I’ve experienced problems for non-post pages (like site index). Webserver (serving static content prepared by Supercache) & Supercache didn’t cooperate well. and after second language switch I always received the page in the same language. This patch fixes this.

Install

Apply patch by issuing path -p0 <supercache+wmpl-0.2.patch in wp-content directory of your WordPress installation.

Background

WPML rewrites \$_SERVER[REQUEST_URI] and this way removes language identificator from all URLs it serves, i.e.

`dob.sk/sk -> dob.sk`

But! SuperCache uses \$_SERVER[REQUEST_URI] variable to compute how should be the cache page named, and because of already mentioned WPML rewrite hack, for the same page in different languages sees SuperCache only one REQUEST_URI (aha!).

The patch does simple thing, it saves original REQUEST_URI on SuperCache plugin load and than it uses this saved value to the path computations. How easy 😉