Hive SQL 实现问题,请大佬看下~

336 天前
 shuaishuaivip
Hive SQL 实现问题:数据展示如下:
第一条数据的 end 等于第二条数据的 start 就认为是连续的,一直到 LS,5,6,2024-01-27 08:44:37 2024-01-27 08:44:47 这条数据就断开了,我想取第一条的 start 和断开那条数据的 end 那就查询到的数据结果是这样的,该怎么写?
name1,start,end,start_time,end_time
LSJA24390MS139778 55.8 56.6 2024-01-27 08:44:23 2024-01-27 08:46:51
LSJA24390MS139778 56.7 57.3 2024-01-27 08:46:53 2024-01-27 08:48:35
LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47

name1,start,end,start_time,end_time
LSJA24390MS139778 55.8 55.8 2024-01-27 08:44:23 2024-01-27 08:44:23
LSJA24390MS139778 NULL 55.9 2024-01-27 08:44:25 2024-01-27 08:44:35
LSJA24390MS139778 NULL 55.9 2024-01-27 08:44:37 2024-01-27 08:44:47
LSJA24390MS139778 NULL 56 2024-01-27 08:44:49 2024-01-27 08:44:51
LSJA24390MS139778 NULL 56.1 2024-01-27 08:44:53 2024-01-27 08:45:11
LSJA24390MS139778 NULL 56.1 2024-01-27 08:45:13 2024-01-27 08:45:21
LSJA24390MS139778 NULL 56.1 2024-01-27 08:45:23 2024-01-27 08:45:23
LSJA24390MS139778 NULL 56.3 2024-01-27 08:45:25 2024-01-27 08:45:51
LSJA24390MS139778 NULL 56.5 2024-01-27 08:45:53 2024-01-27 08:46:21
LSJA24390MS139778 NULL 56.5 2024-01-27 08:46:23 2024-01-27 08:46:23
LSJA24390MS139778 NULL 56.6 2024-01-27 08:46:25 2024-01-27 08:46:51
LSJA24390MS139778 56.7 56.7 2024-01-27 08:46:53 2024-01-27 08:46:59
LSJA24390MS139778 NULL 56.8 2024-01-27 08:47:01 2024-01-27 08:47:11
LSJA24390MS139778 NULL 56.8 2024-01-27 08:47:13 2024-01-27 08:47:21
LSJA24390MS139778 NULL 56.8 2024-01-27 08:47:23 2024-01-27 08:47:23
LSJA24390MS139778 NULL 56.9 2024-01-27 08:47:25 2024-01-27 08:47:35
LSJA24390MS139778 NULL 57 2024-01-27 08:47:37 2024-01-27 08:47:47
LSJA24390MS139778 NULL 57 2024-01-27 08:47:49 2024-01-27 08:47:51
LSJA24390MS139778 NULL 57 2024-01-27 08:47:53 2024-01-27 08:47:59
LSJA24390MS139778 NULL 57.1 2024-01-27 08:48:01 2024-01-27 08:48:11
LSJA24390MS139778 NULL 57.2 2024-01-27 08:48:13 2024-01-27 08:48:21
LSJA24390MS139778 NULL 57.2 2024-01-27 08:48:23 2024-01-27 08:48:23
LSJA24390MS139778 NULL 57.3 2024-01-27 08:48:25 2024-01-27 08:48:35
LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47
1491 次点击
所在节点    程序员
11 条回复
moyi97
336 天前
rownumber 确定第一条, lag 或者 lead 做个判断 取断掉那条, 应该就可以实现了吧
Jinnrry
336 天前
你这个第一条 end 和第二条 start 也不相等啊,这不差了 2 秒吗?

如果开始结束时间完全一致的时候,先分列,把 starttime 和 endtime 合并成一个字段,然后计数,如果某个时间点计数值是 1 ,那就说明不连续。
shuaishuaivip
336 天前
Hive SQL 实现问题:数据展示如下:

name1,start,end,start_time,end_time
LSJA24390MS139778 55.8 55.8 ,2024-01-27 08:44:23 2024-01-27 08:44:23
LSJA24390MS139778 55.8 55.9, 2024-01-27 08:44:25 2024-01-27 08:44:35
LSJA24390MS139778 55.9 55.9 ,2024-01-27 08:44:37 2024-01-27 08:44:47
LSJA24390MS139778 55.9 56 , 2024-01-27 08:44:49 2024-01-27 08:44:51
LSJA24390MS139778 56 56.1, 2024-01-27 08:44:53 2024-01-27 08:45:11
LSJA24390MS139778 56.1 56.1, 2024-01-27 08:45:13 2024-01-27 08:45:21
LSJA24390MS139778 56.1 56.1, 2024-01-27 08:45:23 2024-01-27 08:45:23
LSJA24390MS139778 56.1 56.3, 2024-01-27 08:45:25 2024-01-27 08:45:51
LSJA24390MS139778 56.3 56.5, 2024-01-27 08:45:53 2024-01-27 08:46:21
LSJA24390MS139778 56.5 56.5, 2024-01-27 08:46:23 2024-01-27 08:46:23
LSJA24390MS139778 56.5 56.6, 2024-01-27 08:46:25 2024-01-27 08:46:51
LSJA24390MS139778 56.7 56.7, 2024-01-27 08:46:53 2024-01-27 08:46:59
LSJA24390MS139778 56.7 56.8, 2024-01-27 08:47:01 2024-01-27 08:47:11
LSJA24390MS139778 56.8 56.8, 2024-01-27 08:47:13 2024-01-27 08:47:21
LSJA24390MS139778 56.8 56.8, 2024-01-27 08:47:23 2024-01-27 08:47:23
LSJA24390MS139778 56.8 56.9, 2024-01-27 08:47:25 2024-01-27 08:47:35
LSJA24390MS139778 56.9 57, 2024-01-27 08:47:37 2024-01-27 08:47:47
LSJA24390MS139778 57 57, 2024-01-27 08:47:49 2024-01-27 08:47:51
LSJA24390MS139778 57 57, 2024-01-27 08:47:53 2024-01-27 08:47:59
LSJA24390MS139778 57 57.1, 2024-01-27 08:48:01 2024-01-27 08:48:11
LSJA24390MS139778 57.1 57.2, 2024-01-27 08:48:13 2024-01-27 08:48:21
LSJA24390MS139778 57.2 57.2, 2024-01-27 08:48:23 2024-01-27 08:48:23
LSJA24390MS139778 57.2 57.3, 2024-01-27 08:48:25 2024-01-27 08:48:35
LSJA24390MS139778 57.2 57.3, 2024-01-27 08:48:37 2024-01-27 08:48:47
实现成这样
name1,start,end,start_time,end_time
LSJA24390MS139778 55.8 56.6 2024-01-27 08:44:23 2024-01-27 08:46:51
LSJA24390MS139778 56.7 57.3 2024-01-27 08:46:53 2024-01-27 08:48:35
LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47
shuaishuaivip
336 天前
start,和 end 说的不是时间哦
ericzy
336 天前
-- 创建一个临时表,用 lag 函数获取上一行的 end 值
with temp as (
select name1, start, end, start_time, end_time,
lag(end) over (partition by name1 order by start_time) as prev_end
from your_table
)
-- 从临时表中筛选出断开的数据,即 start 不等于 prev_end 的数据
select name1, start, end, start_time, end_time
from temp
where start != prev_end or prev_end is null

-- 来自 gpt ,仔细查看和使用
shuaishuaivip
336 天前
@ericzy gpt 我都翻烂了~行不通
BeiChuanAlex
336 天前
一张表解决不了就 2 张表分层实现
hpkaiq
336 天前
with t2 as (
select *,
lead(`start`) over (partition by name order by `end_time`) as next_start from your_table
)
,t3 as (
select name,start_time,end_time
from t2 where `end` != next_start or next_start is null
)
,t4 as (select * ,lag(`end_time`) over (partition by name order by `end_time`) as pre_end_time
from t3)

select name,step_end_time,
min(`start`) `start`,max(`end`) `end`,
min(start_time) start_time,max(end_time) end_time
from
(
select t2.*,step_end_time from
(
select name,end_time step_end_time,if(pre_end_time is null,'1970-01-01 00:00:00',pre_end_time) step_start_time from t4
)step
join
t2 on step.name = t2.name and t2.end_time <= step_end_time and t2.start_time > step_start_time
)tmp
group by name,step_end_time
order by name,step_end_time

小试一把,不知道行不行
hero1874
336 天前
st 就是 start ed 是 end
我是用 doris 写的 所以下面 0 在 hive 里应该是 null 把判断是否为 0 改成是否为 null

第一步 用 lag 函数取出上一个的 ed 值
select
name,
st,
ed,
start_time,
end_time,
lag(ed,1,0) over( partition by name order by start_time ) lag_ed1 -- 上一行的 ed 值
from form6
order by start_time

第二步
select
t1.*,
sum(
case
when lag_ed1 = 0 then 1 -- 第一行
when lag_ed1 !=st then 1 -- 断开了
when lag_ed1 = st then 0 -- 连续
end )
over( partition by name order by start_time) num
from
(select
name,
st,
ed,
start_time,
end_time,
lag(ed,1,0) over( partition by name order by start_time ) lag_ed1
from form6
order by start_time) t1
order by start_time

第三步 根据 name 和判断是否连续或者断开分组取数
select
name,
min(st) st,
max(ed) ed,
min(start_time)start_time,
max(end_time) end_time
from
(
select
t1.*,
sum(
case
when lag_ed1 = 0 then 1
when lag_ed1 !=st then 1
when lag_ed1 = st then 0
end )
over( partition by name order by start_time) num
from
(select
name,
st,
ed,
start_time,
end_time,
lag(ed,1,0) over( partition by name order by start_time ) lag_ed1
from form6
order by start_time) t1
order by start_time
)t3
group by name ,num
lusansan
335 天前
select
name1
,MIN(start_number) as start_number
,MAX(end_number) as end_number
,MIN(start_time) as start_time
,MAX(end_time) as end_time
from
(
SELECT
*
,case when start_number is null then row_number() over (partition by name1 order by start_time) -1 else row_number() over (partition by name1 order by start_time) end as rn
from
(
SELECT
name1
,start_number
,end_number
,start_time
,end_time
,lead(start_number,1) over (partition by name1 order by start_time) lead_start_number
FROM
(
select
name1,start_number,end_number,start_time,end_time
from
values
('LSJA24390MS139778' ,55.8,55.8 ,'2024-01-27 08:44:23' ,'2024-01-27 08:44:23')
,('LSJA24390MS139778' ,NULL,55.9 ,'2024-01-27 08:44:25' ,'2024-01-27 08:44:35')
,('LSJA24390MS139778' ,NULL,55.9 ,'2024-01-27 08:44:37' ,'2024-01-27 08:44:47')
,('LSJA24390MS139778' ,NULL,56 ,'2024-01-27 08:44:49' ,'2024-01-27 08:44:51')
,('LSJA24390MS139778' ,NULL,56.1 ,'2024-01-27 08:44:53' ,'2024-01-27 08:45:11')
,('LSJA24390MS139778' ,NULL,56.1 ,'2024-01-27 08:45:13' ,'2024-01-27 08:45:21')
,('LSJA24390MS139778' ,NULL,56.1 ,'2024-01-27 08:45:23' ,'2024-01-27 08:45:23')
,('LSJA24390MS139778' ,NULL,56.3 ,'2024-01-27 08:45:25' ,'2024-01-27 08:45:51')
,('LSJA24390MS139778' ,NULL,56.5 ,'2024-01-27 08:45:53' ,'2024-01-27 08:46:21')
,('LSJA24390MS139778' ,NULL,56.5 ,'2024-01-27 08:46:23' ,'2024-01-27 08:46:23')
,('LSJA24390MS139778' ,NULL,56.6 ,'2024-01-27 08:46:25' ,'2024-01-27 08:46:51')
,('LSJA24390MS139778' ,56.7,56.7 ,'2024-01-27 08:46:53' ,'2024-01-27 08:46:59')
,('LSJA24390MS139778' ,NULL,56.8 ,'2024-01-27 08:47:01' ,'2024-01-27 08:47:11')
,('LSJA24390MS139778' ,NULL,56.8 ,'2024-01-27 08:47:13' ,'2024-01-27 08:47:21')
,('LSJA24390MS139778' ,NULL,56.8 ,'2024-01-27 08:47:23' ,'2024-01-27 08:47:23')
,('LSJA24390MS139778' ,NULL,56.9 ,'2024-01-27 08:47:25' ,'2024-01-27 08:47:35')
,('LSJA24390MS139778' ,NULL,57 ,'2024-01-27 08:47:37' ,'2024-01-27 08:47:47')
,('LSJA24390MS139778' ,NULL,57 ,'2024-01-27 08:47:49' ,'2024-01-27 08:47:51')
,('LSJA24390MS139778' ,NULL,57 ,'2024-01-27 08:47:53' ,'2024-01-27 08:47:59')
,('LSJA24390MS139778' ,NULL,57.1 ,'2024-01-27 08:48:01' ,'2024-01-27 08:48:11')
,('LSJA24390MS139778' ,NULL,57.2 ,'2024-01-27 08:48:13' ,'2024-01-27 08:48:21')
,('LSJA24390MS139778' ,NULL,57.2 ,'2024-01-27 08:48:23' ,'2024-01-27 08:48:23')
,('LSJA24390MS139778' ,NULL,57.3 ,'2024-01-27 08:48:25' ,'2024-01-27 08:48:35')
,('LSJA24390MS139778' ,57.2,57.3 ,'2024-01-27 08:48:37' ,'2024-01-27 08:48:47') as t (name1,start_number,end_number,start_time,end_time)
) p1
) p2
WHERE
start_number is not null or lead_start_number is not null
) t1
group by name1,rn
;



name1 start_number end_number start_time end_time
LSJA24390MS139778 55.8 56.6 2024-01-27 08:44:23 2024-01-27 08:46:51
LSJA24390MS139778 56.7 57.3 2024-01-27 08:46:53 2024-01-27 08:48:35
LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47
shuaishuaivip
335 天前
@lusansan 大佬,这个可行。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/1013134

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX