当我们查看 HuggingFaceHub 模型的使用情况时langchain
有这部分作者不知道如何停止生成,https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline.py#L182 https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline.py#L182:
class HuggingFacePipeline(LLM):
...
def _call(
...
if stop is not None:
# This is a bit hacky, but I can't figure out a better way to enforce
# stop tokens when making calls to huggingface_hub.
text = enforce_stop_tokens(text, stop)
return text
我应该使用什么来将停止标记添加到模板的末尾?
如果我们看一下https://github.com/hwchase17/langchain/blob/master/langchain/llms/utils.py https://github.com/hwchase17/langchain/blob/master/langchain/llms/utils.py,它只是一个正则表达式分割,根据停用词列表分割输入字符串,然后取第一个分区re.split
re.split("|".join(stop), text)[0]
让我们尝试从 Huggingface 模型获取生成输出,例如
from transformers import pipeline
from transformers import GPT2LMHeadModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
output = generator("Hey Pizza! ")
output
[out]:
[{'generated_text': 'Hey Pizza! 」\n\n「Hurry up, leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and then, Yuigahama came in contact with Ruriko in the middle of the'}]
如果我们应用re.split
:
import re
def enforce_stop_tokens(text, stop):
"""Cut off the text as soon as any stop words occur."""
return re.split("|".join(stop), text)[0]
stop = ["up", "then"]
text = output[0]['generated_text']
re.split("|".join(stop), text)
[out]:
['Hey Pizza! 」\n\n「Hurry ',
', leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and ',
', Yuigahama came in contact with Ruriko in the middle of the']
但这没有用,我想在一代结束时分裂。我可以使用哪些令牌来“enforce_stop_tokens”?