最近搞点词嵌入相关的东西,无意中发现腾讯去年开源的词向量模型:
https://mp.weixin.qq.com/s/b9NWR0F7GQLYtgGSL50gQw
这个模型涵盖 800w 中文词(虽然里边很多错误词),但总体还是挺强大的。
简单搭了个 api 哈哈: https://zhuanlan.zhihu.com/p/94124468
一些有意思的测试:
1.红烧肉相似词
output:
{
"top_similar_words":[
[
"糖醋排骨",
0.8907967209815979
],
[
"红烧排骨",
0.8726683259010315
],
[
"回锅肉",
0.858664333820343
],
[
"红烧鱼",
0.8542774319648743
],
[
"梅菜扣肉",
0.8500987887382507
],
[
"糖醋小排",
0.8475514650344849
],
[
"小炒肉",
0.8435966968536377
],
[
"红烧五花肉",
0.8424086570739746
],
[
"红烧肘子",
0.8400496244430542
],
[
"糖醋里脊",
0.8381932377815247
],
[
"红烧猪蹄",
0.8374584913253784
],
[
"青椒炒肉",
0.8344883918762207
],
[
"粉蒸肉",
0.8337559700012207
],
[
"水煮肉片",
0.8311598300933838
],
[
"青椒肉丝",
0.8294434547424316
],
[
"鱼香茄子",
0.8291393518447876
],
[
"烧茄子",
0.8272593021392822
],
[
"梅干菜扣肉",
0.8267726898193359
],
[
"土豆炖牛肉",
0.8263725638389587
],
[
"红烧茄子",
0.8244959115982056
]
],
"word":"红烧肉"
}
2.因吹斯汀相似词
output:
{
"top_similar_words":[
[
"一颗赛艇",
0.7618176937103271
],
[
"因吹斯听",
0.7523878812789917
],
[
"城会玩",
0.6856077909469604
],
[
"厉害了 word 哥",
0.6615914702415466
],
[
"emmmmm",
0.6590334177017212
],
[
"扎心了老铁",
0.6527535915374756
],
[
"神吐槽",
0.6382066011428833
],
[
"可以说是非常爆笑了",
0.6365567445755005
],
[
"不明觉厉",
0.6362186670303345
],
[
"段子哥",
0.6293908357620239
],
[
"厉害了我的哥",
0.6265187859535217
],
[
"脑洞大开",
0.6255093216896057
],
[
"hhhhhh",
0.6220428943634033
],
[
"233333",
0.6189173460006714
],
[
"没想到你是这样的",
0.6184067726135254
],
[
"屌炸天",
0.6119771003723145
],
[
"interesting",
0.6102393865585327
],
[
"emmmmmmm",
0.6097372770309448
],
[
"开脑洞",
0.6095746755599976
],
[
"猴赛雷",
0.6095525026321411
]
],
"word":"因吹斯汀"
}
3.ojbk 相似词
output:
{
"top_similar_words":[
[
"我觉得 ok",
0.6393940448760986
],
[
"emmmmmmm",
0.6306545734405518
],
[
"hhhh",
0.6229800581932068
],
[
"hhhhh",
0.6225401163101196
],
[
"不存在的",
0.6077110767364502
],
[
"溜了溜了",
0.603063702583313
],
[
"hhhhhhh",
0.6008774638175964
],
[
"emmmm",
0.6002634167671204
],
[
"emmm",
0.5958442687988281
],
[
"emmmmm",
0.592516303062439
],
[
"阿喵",
0.5918310880661011
],
[
"哈哈哈",
0.590988039970398
],
[
"略略略",
0.590296745300293
],
[
"hhhhhh",
0.5870903730392456
],
[
"微笑脸",
0.5860881209373474
],
[
"tan90°",
0.5825910568237305
],
[
"没毛病",
0.5802331566810608
],
[
"233333",
0.5794929265975952
],
[
"我觉得不行",
0.5762011408805847
],
[
"就酱",
0.5751103162765503
]
],
"word":"ojbk"
}
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.