mysql - Performance implications of allowing alias to be used in HAVING clause -


i made bit of fool out of myself earlier today on this question. question using sql server, , correct answer involved adding having clause. initial mistake made think alias in select statement used in having clause, not allowed in sql server. made error because assumed sql server had same rules mysql, allow alias used in having clause.

this got me curious, , poked around on stack overflow , elsewhere, finding bunch of material explaining why these rules enforced on 2 respective rdbms. did find explanation of performance implications of allowing/disallowing alias in having clause.

to give concrete example, duplicate query occurred in above-mentioned question:

select students.camid, campus.camname, count(students.stuid) studentcount students join campus     on campus.camid = students.camid group students.camid, campus.camname having count(students.stuid) > 3 order studentcount 

what performance implications of using alias in having clause instead of re specifying count? question can answered directly in mysql, , give insight happen in sql if support alias in having clause.

this rare instance might ok tag sql question both mysql , sql server, enjoy moment in sun.

narrowly focused on particular query, , sample data loaded below. address other queries such count(distinct ...) mentioned others.

the alias in having appears either outperform or quite bit outperform alternative (depending on query).

this uses pre-existing table 5 million rows in created via answer of mine takes 3 5 minutes.

resulting structure:

create table `ratings` (   `id` int(11) not null auto_increment,   `thing` int(11) not null,   primary key (`id`) ) engine=innodb auto_increment=5046214 default charset=utf8; 

but using innodb instead. creates expected innodb gap anomaly due range reservation inserts. saying, makes no difference. 4.7 million rows.

modify table near tim's assumed schema.

rename table ratings students; -- not instanteous (a copy) alter table students add column camid int; -- near tim's schema -- don't add `camid` index yet 

the following take while. run again , again in chunks or else connection may timeout. timeout due 5 million rows without limit clause in update statement. note, do have limit clause.

so doing in half million row iterations. sets column random number between 1 , 20

update students set camid=floor(rand()*20+1) camid null limit 500000; -- took while (no surprise) 

keep running above until no camid null.

i ran 10 times (the whole thing takes 7 10 minutes)

select camid,count(*) students group camid order 1 ;  1   235641 2   236060 3   236249 4   235736 5   236333 6   235540 7   235870 8   236815 9   235950 10  235594 11  236504 12  236483 13  235656 14  236264 15  236050 16  236176 17  236097 18  235239 19  235556 20  234779  select count(*) students; -- 4.7 million rows 

create useful index (after inserts of course).

create index `ix_stu_cam` on students(camid); -- takes 45 seconds  analyze table students; -- update stats: http://dev.mysql.com/doc/refman/5.7/en/analyze-table.html -- above fine, takes 1 second 

create campus table.

create table campus (   camid int auto_increment primary key,     camname varchar(100) not null ); insert campus(camname) values ('one'),('2'),('3'),('4'),('5'), ('6'),('7'),('8'),('9'),('ten'), ('etc'),('etc'),('etc'),('etc'),('etc'), ('etc'),('etc'),('etc'),('etc'),('twenty'); -- ok 20 of them 

run 2 queries:

select students.camid, campus.camname, count(students.id) studentcount  students  join campus      on campus.camid = students.camid  group students.camid, campus.camname  having count(students.id) > 3  order studentcount;  -- run many many times, back, 5.50 seconds, 20 rows of output 

and

select students.camid, campus.camname, count(students.id) studentcount  students  join campus      on campus.camid = students.camid  group students.camid, campus.camname  having studentcount > 3  order studentcount;  -- run many many times, back, 5.50 seconds, 20 rows of output 

so times identical. ran each dozen times.

the explain output same both

+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+ | id | select_type | table    | type | possible_keys | key        | key_len | ref                  | rows   |                           | +----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+ |  1 | simple      | campus   |  | primary       | null       | null    | null                 |     20 | using temporary; using filesort | |  1 | simple      | students | ref  | ix_stu_cam    | ix_stu_cam | 5       | bigtest.campus.camid | 123766 | using index                     | +----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+ 

using avg() function, getting 12% increase in performance alias in having (with identical explain output) following 2 queries.

select students.camid, campus.camname, avg(students.id) studentavg  students  join campus      on campus.camid = students.camid  group students.camid, campus.camname  having avg(students.id) > 2200000  order students.camid;  -- avg time 7.5  explain   select students.camid, campus.camname, avg(students.id) studentavg  students  join campus      on campus.camid = students.camid  group students.camid, campus.camname  having studentavg > 2200000 order students.camid; -- avg time 6.5 

and lastly, distinct:

select students.camid, count(distinct students.id) studentdistinct  students  join campus      on campus.camid = students.camid  group students.camid  having count(distinct students.id) > 1000000  order students.camid; -- 10.6   10.84   12.1   11.49   10.1   9.97   10.27   11.53   9.84 9.98 -- 9.9   select students.camid, count(distinct students.id) studentdistinct   students   join campus      on campus.camid = students.camid   group students.camid   having studentdistinct > 1000000   order students.camid; -- 6.81    6.55   6.75   6.31   7.11 6.36   6.55 -- 6.45 

the alias in having consistently runs 35% faster same explain output. seen below. same explain output has been shown twice not result in same performance, general clue.

+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+ | id | select_type | table    | type  | possible_keys | key        | key_len | ref                  | rows   |                                        | +----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+ |  1 | simple      | campus   | index | primary       | primary    | 4       | null                 |     20 | using index; using temporary; using filesort | |  1 | simple      | students | ref   | ix_stu_cam    | ix_stu_cam | 5       | bigtest.campus.camid | 123766 | using index                                  | +----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+ 

the optimizer appears favor alias in having @ moment, distinct.


Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -