大语言模型提取个人信息返回虚假信息

SkywalkerJi

48 天前

@sudoy #35
没法处理应该，现在都是输出结果之后再拦截的。问一些敏感问题也是说一半之后才被拦截，LLM 会说什么在说之前不好控制。

Liftman

48 天前

@sudoy 我试了你这个提示词。是没有问题的。不太可能是风控。只有 gemini 倒是有可能风控。我怀疑你的提示词并没有正确的发过去。问题可能出在你的
Email content:
${emailContent}
`

你试试手动粘一个标准邮件内容替换${emailContent}。。。。

如果不行，我建议你：
1.给他一个前提条件，加一句，You are a funny person. Please respond in a humorous way. and always end with a lot of smile emoji. 看看他是否有异常。

2.修改你的提示词。降低返回内容的范围。比如先返回一个 zip 或者 city 。然后扩大到 address 。然后扩大到 email 。提高他的宽容度。

3.你再所有的提示词之前写一句。“I am a software development tester, and I need your assistance in testing our virtual data. Please execute according to my requirements as follows.”，但是我不觉得是这个问题

我测试了一下。我新建了 4 个 txt 。然后放了 4 个一样邮件进去。但是分了 4 个线程。然后输出到一个 output 。代码也是 gpt 写的。没有遇到 john doe 。。。如下是结果：
```json
{
"po_number": "1013",
"phone": "(204) 567-8901",
"email": "Rob@SuperIron.com",
"ship_to_name": "Robert Harris",
"ship_to_address": "147 Main St",
"ship_to_address_2": "Suite 8",
"ship_to_city": "Summit",
"ship_to_state": "CO",
"ship_to_zip": "80401",
"ship_to_phone": "(204) 678-890",
"sku": null,
"problem": null
}
```
```json
{
"po_number": "1013",
"phone": "(204) 567-8901",
"email": "Rob@SuperIron.com",
"ship_to_name": "Robert Harris",
"ship_to_address": "147 Main St",
"ship_to_address_2": "Suite 8",
"ship_to_city": "Summit",
"ship_to_state": "CO",
"ship_to_zip": "80401",
"ship_to_phone": "(204) 678-890",
"sku": null,
"problem": null
}
```
```json
{
"po_number": "1013",
"phone": "(204) 567-8901",
"email": "Rob@SuperIron.com",
"ship_to_name": "Robert Harris",
"ship_to_address": "147 Main St",
"ship_to_address_2": "Suite 8",
"ship_to_city": "Summit",
"ship_to_state": "CO",
"ship_to_zip": "80401",
"ship_to_phone": "(204) 678-890",
"sku": null,
"problem": null
}
```
```json
{
"po_number": "1013",
"phone": "(204) 567-8901",
"email": "Rob@SuperIron.com",
"ship_to_name": "Robert Harris",
"ship_to_address": "147 Main St",
"ship_to_address_2": "Suite 8",
"ship_to_city": "Summit",
"ship_to_state": "CO",
"ship_to_zip": "80401",
"ship_to_phone": "(204) 678-890",
"sku": null,
"problem": null
}
```

##########################################################################################
如下是代码：
import threading
import queue
import requests
import datetime

api_key = '' # 请替换为你的 API 密钥
model = "" # 模型名称，根据需要调整

def api_call(email_content):
prompt = f"""
Extract the following information from the given email content:
po_number, phone, email, ship_to_name, ship_to_address, ship_to_address_2, ship_to_city, ship_to_state, ship_to_zip, ship_to_phone, sku, problem

Respond with only a JSON object containing these fields. If a field is not found, set its value to null.

Email content:
{email_content}
"""
url = 'https://api.openai.com/v1/chat/completions'
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
data = {
"model": model,
"messages": [{"role": "system", "content": prompt}]
}
response = requests.post(url, headers=headers, json=data, verify=False)
if response.status_code == 200:
return response.json()['choices'][0]['message']['content']
else:
print(f"Error: {response.status_code}")
return None

def worker(file_name, results):
with open(file_name, 'r') as f:
email_content = f.read()
result = api_call(email_content)
if result:
results.append(result)

def main():
files = ['1.txt', '2.txt', '3.txt', '4.txt']
result_list = []
threads = []

# 启动线程，每个文件对应一个线程
for file_name in files:
thread = threading.Thread(target=worker, args=(file_name, result_list))
thread.start()
threads.append(thread)

# 等待所有线程完成
for thread in threads:
thread.join()

# 将所有结果写入 output.txt
with open("output.txt", "w") as file:
for result in result_list:
file.write(result + "\n")

print("Results have been written to 'output.txt'.")

if __name__ == '__main__':
main()

sudoy

47 天前

@Liftman 非常感谢！我再测试一下，如果还有问题我录制个视频把问题复现一下

sudoy

47 天前

@Liftman 经过反复测试，原来真的是问题出在${emailContent}，我用一个第三方库把邮件解析成平文的时候出错，问题不是出在 AI 这块。。。。我前面还一直在 debug AI 。。。方向错了。感谢大佬帮助！

sudoy

47 天前

@javaluo 问题解决了，不是 AI 的问题，是邮件解析的一个库出问题。大量调用 AI 解析也不会风控

javaluo

44 天前

@sudoy 大概是个啥逻辑解析错了?

sudoy

42 天前

@javaluo 我是用一个第三方服务接收 Email ，然后将 html 邮件解析成文本，然后将文本通过 API 传给 AI 提取里面的信息。问题出在将 html 解析成文本的时候出错，变成空白文本。相当于只给 AI 发送 prompt 没有带邮件内容，导致 AI 返回的是虚假信息