用我的 Telegram 消息微调大语言模型:一段经验分享
2023 年 11 月 27 日
对大部分认识我的人来说,我大多数时间像是一个文本处理程序。既然输入输出都这么简单,一个模型能不能取代我呢?为了实现这一点,模型不只需要模仿我的写作风格,还得对我有深入了解。而我的 Telegram 使用记录是最好的信息源,我每天都用它,它几乎记录了我所有的想法和行动。
实施方法
最简单的方法是提取我所有的消息,导入到 ChatGPT 的上下文中,并让它学习这些信息,以此模仿我回复新消息的风格。不过,这种方法受限于上下文窗口的大小,意味着我得预处理消息,提炼关键内容。为了避免这个繁琐的步骤,我考虑使用检索增强生成 (RAG) 来在需要时检索相关信息。但根据我的经验,要从多样化的数据,比如聊天记录中检索信息,通常需要对检索模型进行有监督的微调,这是我不愿意去做的。因此,对我来说,微调是最好的选择,因为它不仅能够捕捉我的写作风格,还能从我所有的消息中积累知识,而无需我去筛选重要内容。
OpenAI 提供了微调功能,但因为我打算使用私人信息,我不想依赖任何第三方的微调服务。因此,我需要挑选一个合适的基础模型。根据 Hugging Face Open LLM 排行榜,Mistral 7B(链接)是目前排名靠前的小型模型之一(参数不超过 13B),它的表现甚至优于 Llama 2 13B(链接)。现在,我在考虑是否采用 LoRA 足以应对我的需求,或者我是否需要进行完整的微调。一些比较研究 [1] [2] 显示,尽管 LoRA 在某些情况下不如完全微调,但大多数时候仍然有效。但对于像我这样的特定任务(俄语 + 聊天),我找到了一篇 论文,研究者们在中文环境下对 Llama 进行了指令微调,这与我的目标有相似的复杂性。他们发现,对于未经过指令微调的基础模型,采用 LoRA 微调的效果不及完全微调。然而,在已经针对指令进行了微调的模型上,LoRA 微调可以达到与完全微调相媲美的效果。对我来说,这意味着要么在一个基础模型上进行完全微调,要么在已经针对俄语聊天进行过微调的模型上采用 LoRA。鉴于我找不到专门针对俄语聊天微调的模型,我打算在一个已针对英语聊天进行过微调的 Mistral 模型 Dolphin 上尝试 LoRA。
因此,我的计划是:
- 首先在 Dolphin 上尝试 LoRA,Dolphin 是一个已针对英语聊天进行过微调的 Mistral 模型。
- 如果效果不理想,考虑在 Mistral 上进行完全微调。
数据准备
在应用程序如 Telegram 的信息传递中,有一个与电子邮件截然不同的特点:对话流的自然性。在这样的对话中,消息通常不是你和联系人依次回复,而是你可能连续发出几条信息,然后对方连续回复。这些信息一般都比较简短。为了在我的数据中保留这种真实的对话感觉,我进行了一些特别的处理。
利用 Telegram 的内置导出功能,我导出了所有的聊天记录为 JSON 格式。经过筛选和将消息编组成不同的会话后,我汇总了过去五年使用 Telegram 的数据。这包括 15,789 个会话,来自 466 个聊天群,平均每个会话有 8.51 条信息。在数据结构上,我选用了 ChatML 格式作为参考。下面是一个例子,展示了一个从俄语翻译过来的会话:
<|im_start|>John Smith>>> 真是糟糕,总是卡在 135 分钟的时间限制上>>> 尝试高效完成任务,但似乎总是不行<|im_end|><|im_start|>Alexander Smirnov>>> 我也有同样的困扰>>> 你还在坚持最初的方案吗?<|im_end|><|im_start|>John Smith>>> 不确定,我觉得我们想法是一致的>>> 就像你说的>>> 我正在试着用倒置字符串的方法,希望能在那里发现些什么>>> 但这似乎行不通,因为 z 函数把一切都搞砸了……………………<|im_end|><|im_start|>Alexander Smirnov>>> 我不明白 z 函数在这里怎么用<|im_end|><|im_start|>John Smith>>> 我也不太清楚,感觉不管怎样我都是在迭代地做,但是是的,为了构建 z 函数需要反转一些字符串>>> 这只是从讨论中随便找的一个解决方案>>> 来自讨论<|im_end|><|im_start|>Alexander Smirnov>>> 明白了<|im_end|>
我的数据整合工具只在收到某人的回应时计算损失值。预测下一个发言者相对简单,我们不想让模型专注于这个学习点。因此,对话中涉及损失计算的部分会加粗显示。
你会发现,不仅仅是我的回应,其他人的回应也被用来计算损失值。这样做是有意为之。通过这种方式,模型不仅能扮演我,还能模仿我常交流的朋友们!
评估计划
我将通过两种方式来测试模型。首先,让模型扮演我,然后我会从我不同朋友的视角与自己对话。接着,我会以自己的身份聊天,让模型扮演我的朋友。我的开场白总是相同的两条信息:“hey”和“what’s up?”(俄语:“прив”和“как дела?”)。模型扮演的角色和生成的短语会被突出显示。所有对话最初将使用俄语,并可通过点击“原始”详细信息按钮查看。测试时,我会使用 oobabooga/text-generation-webui。
首先,我想看看没有我事先训练的情况下,通用对话微调的 Mistral 模型如何处理这个任务。
朋友 1 对 亚历山大·斯米尔诺夫
>>> hey>>> what’s up?>>> I’m good, what about you?>>> nothing much>>> what did you do today?>>> I worked.>>> then I did a few tasks in Python and studied data processing libraries.>>> in the evening I’ll do wrestling training.>>> have you ever been into sports or do you have any hobbies?>>> I see>>> want to watch a movie today?>>> maybe, if I find the time.>>> what’s your favorite music? Electronic, rock, or other genres?
亚历山大·斯米尔诺夫 对 朋友 1
>>> hey>>> what’s up?>>> well, everything is fine so far. what’s new with you?>>> spent the whole day programming>>> what did you do today?>>> finished at work>>> and did some housework, now I’m resting. are you also resting or still working?>>> I’m already resting>>> well, that’s good. can we talk about something interesting or do you have any questions?
原文
好的,它能够组织出连贯的句子。最大的问题是它对对话背景的理解不足,导致回答显得单调和普通。消息缺乏独特的风格,感觉非常基础。另一个问题是,模型的俄语表达能力不强。这是可以预见的,因为这个模型太小,不足以很好地适应英语之外的其他语言。此外,模型过于积极,几乎每句话都以问题结束,这并不符合人们在即时通讯中的真实沟通方式。
让我们试着解决这些问题!
LoRA
LoRA 提供了一个既节省时间又节省硬件资源的训练方法。它仅训练了总权重的大约 1%。我采用的是 1024 的序列长度和 8 的批次大小。这个训练在 RTX 3090 显卡上占用了 20GB 显存,进行了三个周期,总共用时 5.5 小时。我在 vast.ai 上进行了训练,GPU 的费用是每小时 0.362,总共花费了0.362,总共花费了 2(不包括做实验和修复错误所花的时间)。
以下是实验结果:
Friend 1 对比 Alexander Smirnov
>>> hey>>> what’s up?>>> hi) all good>>> you?)>>> ok as well>>> what you were doing today?>>> had an amazon interview and solved problems>>> had to solve 3>>> cool>>> hard?>>> don’t know what>>> hadn’t looked yet>>> I hadn’t started yet>>> chatting whole day with the team and meet new ones>>> how are they?>>> ok>>> same coworking as you have))
Friend 2 对比 Alexander Smirnov
>>> hey>>> what’s up?>>> ok, we show yesterday’s product to searchcom, today we write endpoints for them so that someone else can request our api>>> cool>>> what problems do we have now?>>> no problems>>> everything is working>>> I added an option to choose index>>> so now we can attach several indexes>>> and separate collections within single index>>> what we have to do?>>> test>>> yes ok>>> are they always will be sending post-script?
Friend 3 对比 Alexander Smirnov
>>> hey>>> what’s up?>>> Hi, good)>>> you?>>> ok as well>>> what do you do at the evening?>>> we have meeting with office collegues at 8 30 msk>>> after it I can>>> can what?>>> haha)>>> didn’t get it>>> I don’t know(
Alexander Smirnov 对比 Friend 1
>>> hey>>> what’s up?>>> hi)>>> seems good>>> was working, than went to the call with Nadya>>> now reading normally>>> you?>>> good as well>>> any plans for the evening?>>> after work wendesday I would love to go running>>> and then have a lunch>>> your plans?>>> I though about watching movie>>> aa>>> ooo>>> to be honest I like when it is raining>>> when you don’t feel like going anywhere
Alexander Smirnov 对比 Friend 2
>>> hey>>> what’s up?>>> Hi, great!>>> Sorry will be 20-30 min late>>> ok>>> why?>>> yes, I just got up and forgot to press the «exit sleep» button>>> my telegram didn’t close on its own>>> I thought it behaved like this when you logging off the Internet, but no>>> I still logged off the Internet, but telegram still didn’t respond>>> and it was strange, because yesterday there was also such a case when I sent you a message and did not receive an answer>>> today I looked at mine, but not yours, but yours was displayed)>>> ok>>> any plans?>>> yes, we can chat a little>>> and I can tell you that the method for generating answers worked really well for me>>> he gives respectable results in training>>> but not on the test
Alexander Smirnov 对比 Friend 3
>>> hey>>> what’s up?>>> hey, hi>>> Good, you?)>>> Haven’t gone to bed yet)0)0)0)>>> Today I get up and go to a programming course>>> To tighten up the base, etc.>>> nice>>> what did you do yesterday?>>> Yesterday there was already the same course in which it seemed to me that I did very poorly>>> Then I wanted to solve hw but I don’t care>>> Otherwise I’m fucked because I can’t sleep)>>> Really fucked up>>> It will probably be better now, since I’m already maximally overtrained>>> cool man>>> what are your plans for life in general?>>> Fuck, get a job interviews first>>> And if everything is ok, then actually order a computer>>> Like, I think you should order a computer when it’s more profitable than working at work>>> But I don’t care
这个效果显著提升。它不仅捕捉到了它所回应的人的风格,还能识别出特定两人间最常讨论的主题。比如,与 Friend 2 的对话更多聚焦在工作上。但是,其语法仍存在问题,并且很快就会偏离对话的主题。我相信在英语环境中 LoRA 可以达到合理的质量,可能不需要完整的微调。但考虑到俄语并非模型的原生语言,我们接下来尝试完整的微调。
完整精调 (Full fine-tuning)
由于涉及多 GPU 训练,完整精调过程颇具挑战。常用的方法包括使用ZeRO & DeepSpeed 3] 或 FSDP 4],其中 FSDP 实际上是基于 ZeRO3 5] 的。我选择了 FSDP 方法。
在构建训练流程时,我借鉴了 斯坦福大学 Alpaca 精调代码 和 Anton Bacaj 的 Mistral 精调代码。
我使用半精度的 FSDP 全分片模式,设置序列长度为 1024,微批大小为 2,在每个 80 GB A100 GPU 上需用到 63GB VRAM。这样的训练持续了三个时期,仅需 20 分钟。虚拟机的总费用为每小时 8.88 美元,实际花费为 3 美元,不包括实验和调试的时间。
对话:
朋友 1 vs Alexander Smirnov
>>> hey>>> what’s up?>>> hey) good>>> you?>>> good as well>>> what you were doing today?>>> played a board game>>> I’ll go home now>>> nice>>> what kind of game?>>> catan>>> strategy>>> very good>>> who was there?>>> Max Yavich and Fedya>>> who won?>>> Max
朋友 2 vs Alexander Smirnov
>>> hey>>> what’s up?>>> hi, good>>> let’s meet in half an hour?>>> yep>>> what did you want?>>> damn I haven’t finished the script for recalculating models yet>>> and I need to finish it to send it to them>>> this recalculation takes a lot of time, so I just want to turn around and start doing it and not waste time on preparation>>> got it>>> what is the plan for today?>>> meet in half an hour to discuss what we are doing>>> if decided, then I send an application to them and wait for a response>>> if we haven’t decided, then we continue to work>>> who do you mean by them?>>> investors>>> https://www.ycombinator.com/howtoapply>>> there are two tests)
朋友 3 vs Alexander Smirnov
>>> hey>>> what’s up?>>> hey we’re sitting and solving problems>>> what kind of problems?>>> data analysis +->>> and for what?>>> https://yandex.ru/cup/profile/>>> it seems it just became more interesting>>> cool>>> what are you planning this evening?>>> idk yet>>> and on the weekend?>>> apparently I’ll also finish solving problems and get some work done
Alexander Smirnov vs 朋友 1
>>> hey>>> what’s up?>>> hey)>>> good>>> you?>>> that’s okay too>>> what did you do today?>>> well, I worked, of course>>> and also>>> met with Alena>>> ate at the ramen shop with everyone>>> and went home>>> by the way, we booked a place for 14 people>>> for the evening))>>> cool>>> and what’s about university?>>> yeah>>> studied a lot>>> in general, every day I want to study more>>> what is the hardest thing now?>>> well I don’t know>>> it just takes a very long time>>> there are some optimization methods that need to be rewritten>>> so I think maybe it’s not worth optimizing
Alexander Smirnov vs 朋友 2
>>> hey>>> what’s up?>>> Hey, good, what about you?>>> Are you already in Bangkok?>>> yeah, already here>>> what are your plans for today?>>> Well, I would like to work on these pioneer tasks, if you don’t mind, we can do it together>>> Or you can skip, I’ll finish the livechat tomorrow>>> and what about live chat?>>> Damn right now I launched the front on the prod without this shit that generates this thing with a logo, and it swears, says to connect js from google analytics>>> I understand that this is a fucking cringe, but I still tried to remember where to get it>>> yeah got it>>> what will you do on weekends?>>> We are meeting with friends here before leaving, tomorrow we need to go to the hotel by 6 am to spend the day in the best traditions>>> There’s another party in the evening>>> Actually, let’s have a call, I’ll tell you
Alexander Smirnov vs 朋友 3
>>> hey>>> what’s up?>>> Hi, haven’t you called yet?)>>> Hey>>> ok)>>> I’m going to play Dota now>>> Join if you want to)>>> mb a little later)>>> what are your plans for weekends?>>> Lot of gyms/flexes/parks>>> + maybe come to your place>>> hmm well mb mb)>>> what’s with your job btw?>>> Well, I’m waiting for the info>>> did they say what deadlines are?>>> He said that there will be a call next week to discuss it>>> remind me what this position is>>> Data Scientist>>> The team is engaged in recommending products and other shit>>> Creating model>>> As part of this position, you will have to develop skills in machine learning, statistical data analysis, programming and much more>>> It is how I understood it
尽管对话变得更加生动有趣,但仍存在丢失上下文的可能性。俄语表现有所改善,但仍存在一些错误。我认为,在进行特定任务的精调(如我的情况,数据量有限)之前,最好先在大量俄语文本上进行无监督的模型精调。此外,把常见对话伙伴的名字作为单独的 Token 加入可能会进一步提高对话质量。
我认为它与 LoRA 相比并没有显著的优势。专注于分析单一人物的回应,仅根据这些回应来计算损失可能会更有效,而不是尝试去理解每一个对话对象。
结束感想
当然,在展示结果时我进行了精挑细选。这并非因为模型的大部分回复不合适,而是因为许多回复过于简单,如“我稍后给你打电话”、“我很忙”和“好的”,这些在对话中非常常见。尽管如此,这个模型在模仿它所模拟人物的风格方面表现出色。它还能捕捉到两个人之间常讨论的主题。然而,它在对话的上下文理解上有明显不足。在没有充分上下文的情况下,回应像“嘿,怎么样?”或“周末有什么计划?”这样的问题显然是挑战性的。或许利用像 Rewind 这样的系统,它可以捕捉用户在电脑上的所有活动,这可能会有所帮助。
代码
你可以在我的 GitHub 仓库中找到这个项目的代码,以及如何在你自己的 Telegram 转储上实现它的详细说明 my github repo。你还可以在 WandB 上查看训练日志。