mysql - Performance implications of allowing alias to be used in HAVING clause -
i made bit of fool out of myself earlier today on this question. question using sql server, , correct answer involved adding having
clause. initial mistake made think alias in select
statement used in having
clause, not allowed in sql server. made error because assumed sql server had same rules mysql, allow alias used in having
clause.
this got me curious, , poked around on stack overflow , elsewhere, finding bunch of material explaining why these rules enforced on 2 respective rdbms. did find explanation of performance implications of allowing/disallowing alias in having
clause.
to give concrete example, duplicate query occurred in above-mentioned question:
select students.camid, campus.camname, count(students.stuid) studentcount students join campus on campus.camid = students.camid group students.camid, campus.camname having count(students.stuid) > 3 order studentcount
what performance implications of using alias in having
clause instead of re specifying count
? question can answered directly in mysql, , give insight happen in sql if support alias in having
clause.
this rare instance might ok tag sql question both mysql , sql server, enjoy moment in sun.
narrowly focused on particular query, , sample data loaded below. address other queries such count(distinct ...)
mentioned others.
the alias in having
appears either outperform or quite bit outperform alternative (depending on query).
this uses pre-existing table 5 million rows in created via answer of mine takes 3 5 minutes.
resulting structure:
create table `ratings` ( `id` int(11) not null auto_increment, `thing` int(11) not null, primary key (`id`) ) engine=innodb auto_increment=5046214 default charset=utf8;
but using innodb instead. creates expected innodb gap anomaly due range reservation inserts. saying, makes no difference. 4.7 million rows.
modify table near tim's assumed schema.
rename table ratings students; -- not instanteous (a copy) alter table students add column camid int; -- near tim's schema -- don't add `camid` index yet
the following take while. run again , again in chunks or else connection may timeout. timeout due 5 million rows without limit clause in update statement. note, do have limit clause.
so doing in half million row iterations. sets column random number between 1 , 20
update students set camid=floor(rand()*20+1) camid null limit 500000; -- took while (no surprise)
keep running above until no camid
null.
i ran 10 times (the whole thing takes 7 10 minutes)
select camid,count(*) students group camid order 1 ; 1 235641 2 236060 3 236249 4 235736 5 236333 6 235540 7 235870 8 236815 9 235950 10 235594 11 236504 12 236483 13 235656 14 236264 15 236050 16 236176 17 236097 18 235239 19 235556 20 234779 select count(*) students; -- 4.7 million rows
create useful index (after inserts of course).
create index `ix_stu_cam` on students(camid); -- takes 45 seconds analyze table students; -- update stats: http://dev.mysql.com/doc/refman/5.7/en/analyze-table.html -- above fine, takes 1 second
create campus table.
create table campus ( camid int auto_increment primary key, camname varchar(100) not null ); insert campus(camname) values ('one'),('2'),('3'),('4'),('5'), ('6'),('7'),('8'),('9'),('ten'), ('etc'),('etc'),('etc'),('etc'),('etc'), ('etc'),('etc'),('etc'),('etc'),('twenty'); -- ok 20 of them
run 2 queries:
select students.camid, campus.camname, count(students.id) studentcount students join campus on campus.camid = students.camid group students.camid, campus.camname having count(students.id) > 3 order studentcount; -- run many many times, back, 5.50 seconds, 20 rows of output
and
select students.camid, campus.camname, count(students.id) studentcount students join campus on campus.camid = students.camid group students.camid, campus.camname having studentcount > 3 order studentcount; -- run many many times, back, 5.50 seconds, 20 rows of output
so times identical. ran each dozen times.
the explain
output same both
+----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | | +----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+ | 1 | simple | campus | | primary | null | null | null | 20 | using temporary; using filesort | | 1 | simple | students | ref | ix_stu_cam | ix_stu_cam | 5 | bigtest.campus.camid | 123766 | using index | +----+-------------+----------+------+---------------+------------+---------+----------------------+--------+---------------------------------+
using avg() function, getting 12% increase in performance alias in having
(with identical explain
output) following 2 queries.
select students.camid, campus.camname, avg(students.id) studentavg students join campus on campus.camid = students.camid group students.camid, campus.camname having avg(students.id) > 2200000 order students.camid; -- avg time 7.5 explain select students.camid, campus.camname, avg(students.id) studentavg students join campus on campus.camid = students.camid group students.camid, campus.camname having studentavg > 2200000 order students.camid; -- avg time 6.5
and lastly, distinct
:
select students.camid, count(distinct students.id) studentdistinct students join campus on campus.camid = students.camid group students.camid having count(distinct students.id) > 1000000 order students.camid; -- 10.6 10.84 12.1 11.49 10.1 9.97 10.27 11.53 9.84 9.98 -- 9.9 select students.camid, count(distinct students.id) studentdistinct students join campus on campus.camid = students.camid group students.camid having studentdistinct > 1000000 order students.camid; -- 6.81 6.55 6.75 6.31 7.11 6.36 6.55 -- 6.45
the alias in having consistently runs 35% faster same explain
output. seen below. same explain output has been shown twice not result in same performance, general clue.
+----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | | +----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+ | 1 | simple | campus | index | primary | primary | 4 | null | 20 | using index; using temporary; using filesort | | 1 | simple | students | ref | ix_stu_cam | ix_stu_cam | 5 | bigtest.campus.camid | 123766 | using index | +----+-------------+----------+-------+---------------+------------+---------+----------------------+--------+----------------------------------------------+
the optimizer appears favor alias in having @ moment, distinct.
Comments
Post a Comment