分析师于日常数据分析及图表制作工作当中,时常会遇到一些离群值和异常值,其数值显著偏离所属样本的其余观测值,可能不具备参考价值,同时可能致使整体数据呈现偏高或偏低之状,难以精确剖析出普遍规律。
因而,在数据分析过程之中,经常会面临一类问题,即剔除极端值。因此在 SQL 面试里,“剔除最大最小值取平均”等各类指标统计,同样是一类常见的题目。
举例:存在一部门薪资表,期望剔除该部门的最高工资与最低工资,以求得平均工资。
临时表生成的 SQL 语句如下:
-- ---------------- start 生成临时测试表,如已有可用表忽略此步骤------------------------------
WITH table_employee_salary AS (
select 802264760 AS employee_id,5000 AS salary_amount
union all
select 802264761 AS employee_id,7000 AS salary_amount
union all
select 802264763 AS employee_id,10000 AS salary_amount
union all
select 802264764 AS employee_id,7000 AS salary_amount
union all
select 802264765 AS employee_id,3000 AS salary_amount
union all
select 802264767 AS employee_id,2500 AS salary_amount
union all
select 802264768 AS employee_id,1000 AS salary_amount
union all
select 970014504 AS employee_id,5000 AS salary_amount
union all
select 970014503 AS employee_id,6000 AS salary_amount
union all
select 96620202 AS employee_id,3000 AS salary_amount
)
-- ----------------------------- 测试表生成 end -----------------------
对于此类问题,最为常见的一种处理方式是进行正序、倒序操作之后,剔除正数第一和倒数第一的行,而后求其均值,SQL如下:
select avg(salary_amount)
from (select employee_id,
salary_amount,
row_number() OVER (order by salary_amount asc) up_rank,
row_number() OVER (order by salary_amount desc) desc_rank
from table_employee_salary
) a
where up_rank>1 and desc_rank>1 -- 剔除正数第一和倒数第一的行
数据执行展示结果如下:
然而,在真实的场景当中,最大最小值或许不止一个。在面试的过程之中,多数面试官亦会以真实数据作为题目,在此种情形之下,应当如何处置呢?
方案 :依旧运用排序函数,深入探究几类排序函数的使用场景。
row_number():无重复排名(相同排名的按序排名)
dense_rank(): 排序相同时会重复,但不会跳过,占用的排名,总数变少。
rank(): 有相同排名会重复,但会跳过占用的排名,总数不变
解法参考如下:
select avg(salary_amount)
from (select employee_id,
salary_amount,
rank() OVER (order by salary_amount asc) up_rank,
rank() OVER (order by salary_amount desc) desc_rank
from table_employee_salary
) a
where up_rank>1 and desc_rank>1
有问题欢迎探讨,其他类似衍生题目补充中...