数据库批量插入数据，对于 unique 字段，在遇到重复数据时如何自动重命名？

2020-07-27 17:15:35 +08:00

einsdisp

数据库为 PostgresSQL，

表结构样例：

create table public.test (
    id serial primary key,
    key text unique not null,
    value int
);

其中字段 key 具有唯一性约束

现在，需要插入大量数据，例如：

insert into public.test (key, value) values
('k1', 101),
('k2_dup', 102),
('k2_dup', 103);

数据中 key 字段存在大量重复，如何在插入时按一定规则自动重命名，就是说，如果要插入某值，但是数据库中该字段该值已经存在，则自动末尾添加 _2，如果仍然重复，则改为 _3，以此类推。

最好能批量处理这些插入数据，就说说，最好别每一条数据就一个 insert 语句。

1999 次点击

所在节点

程序员

11 条回复

labulaka521

2020-07-27 17:39:11 +08:00

事先处理好数据呗

allAboutDbmss

2020-07-27 17:45:59 +08:00

你应该不是手动输入这些 insert 吧
我想是有个 csv 或者 script
psql 有个 copy 可以 bulkload(快速读取)csv 等文件类型

最好你用 grep, sed, awk 这种命令行工具先处理你的文件或者 script
我从来没在 psql insert 的时候做很多判断

einsdisp

2020-07-27 17:48:10 +08:00

@labulaka521
@allAboutDbmss

einsdisp

2020-07-27 17:49:20 +08:00

@labulaka521
@allAboutDbmss
实现处理好数据的话，从你处理好数据时刻到插入的时刻，中间又有可能有新数据已经插入，导致你“处理好”的数据又重复了

allAboutDbmss

2020-07-27 17:57:14 +08:00

那你需要 insert + select
我给了一个很小的例子你另外需要字符串处理的函数

```
psql=# drop table foo;
DROP TABLE
psql=# create table foo (id int);
CREATE TABLE
psql=# select * from foo;
id
----
(0 rows)

psql=# insert into foo (id) values (1);
INSERT 0 1
psql=# select * from foo;
id
----
1
(1 row)

psql=# insert into foo (id) select f.id+1 from foo f where f.id=1;
INSERT 0 1
psql=# select * from foo;
id
----
1
2
(2 rows)

psql=# insert into foo (id) select f.id+1 from foo f where f.id=2;
INSERT 0 1
psql=# select * from foo;
id
----
1
2
3
(3 rows)

```

sss15

2020-07-27 18:08:38 +08:00

先 select 再插入，单这样效率很低，插个眼看看其他大佬给的方案

sfqtsh

2020-07-27 18:31:32 +08:00

>= 9.5 新增 upsert 特性
INSERT...ON CONFLICT DO UPDATE...

zhazi

2020-07-27 19:12:13 +08:00

insert(key k)
try{
insert(k);
}catch(DuplicateKeyException e){
insert(key+1);
}

MoYi123

2020-07-27 21:06:29 +08:00

图一乐。估计性能还是会有问题。为了方便把后缀单独弄了一列。

create table u_insert
(
id serial primary key,
key text,
value int,
suffix int default 0
);
create unique index on u_insert (key, suffix);

begin;
lock u_insert;
CREATE unlogged TABLE tmp(id serial,key text,value int) on commit drop;
insert into tmp(key, value) values ('a', 1),('a', 2),('b', 1);
insert into u_insert(key, value, suffix)
select key, value, t.suffix + rank() OVER (PARTITION BY key ORDER BY id DESC) as suffix
from tmp,
(select t.key as k, greatest(max(u_insert.suffix), t.suffix) as suffix
from u_insert right join (select unnest(array ['a','b']) as key, 0 as suffix) as t on u_insert.key = t.key group by t.key, t.suffix) as t
where t.k = key;
commit;

rrfeng

2020-07-27 21:14:25 +08:00

每次 1000 条，出错再 handle 一下不就行了？？

Habyss

2020-07-28 09:08:56 +08:00

也就是说, 无论数据重复不重复, 这些数据都是要插入表中的. 而且如果 key 重复, 还可以随意改动.
弱弱的问一句, 那这个 key 存在的意义是什么...

如果末尾添加 _2 依旧有重复的, 也就是说你这样批量处理过的数据, 还会重复处理?不然为什么会有_2.

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/693512

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.