スマホでも動くくらい高速だと謳われているMicrosoftの小規模言語モデル Phi-3を要約と和訳について試してみました。

環境・前提

モデル

4bit量子化モデルを使用します

環境

以下のDocker Imageを使います

FROM python:3.11
RUN pip install llama-cpp-python

また、ローカルPCのスペックは以下の通りです。

OS: WSL Ubuntu 20.04
CPU: i7-10700K

コード

以下のコードで動かしていきます。

from llama_cpp import Llama
import os

llm = Llama(
    model_path="./Phi-3-mini-4k-instruct-q4.gguf",
    n_ctx=4096,  # The max sequence length to use
    seed=0,
)

def generate_text(prompt: str, max_tokens: int = 256) -> str:
    output = llm(
        f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
        max_tokens=max_tokens,
        stop=["<|end|>"],
        echo=False,
    )
    return output['choices'][0]['text']

要約させてみる

試しに、LightGBMの論文のAbstractを箇条書きに要約させてみます。

# 要約
prompt = """
Summarize the following sentences with bullet points:

Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.
"""
summary = generate_text(prompt)
print(summary)

結果は以下のようになりました

>>> print(summary)
- Gradient Boosting Decision Tree (GBDT) widely used but has efficiency issues with high feature dimensions and large data sizes.
- Traditional implementations like XGBoost use exhaustive search for split points, which is time-consuming.
- Proposed two techniques:
  - Gradient-based One-Side Sampling (GOSS): Excludes instances with small gradients to improve efficiency in estimating information gain.
  - Exclusive Feature Bundling (EFB): Binds mutually exclusive features to reduce feature count, using a greedy algorithm due to NP-hardness.
- LightGBM integrates GOSS and EFB for faster training times with high accuracy retention on various datasets by up to 20x compared to conventional GBDT methods.

要点は抑えられており、いい感じに要約できているのではないでしょうか。

推論時間

上記の要約には78秒かかりました。全体では6.9トークン/秒、一番時間がかかっているeval timeについては3.58トークン / 秒でした。

llama_print_timings:        load time =   29868.38 ms
llama_print_timings:      sample time =      87.45 ms /   170 runs   (    0.51 ms per token,  1943.99 tokens per second)
llama_print_timings: prompt eval time =   29867.60 ms /   368 tokens (   81.16 ms per token,    12.32 tokens per second)
llama_print_timings:        eval time =   47259.42 ms /   169 runs   (  279.64 ms per token,     3.58 tokens per second)
llama_print_timings:       total time =   77788.34 ms /   537 tokens

Phi-3の論文ではiPhoneのA16チップ（6コアCPU + 5コアGPU）で12トークン/秒とのことだったのですが、i7-10700Kでは負けるようです。

日本語に翻訳させてみる

Phi-3のデータセットの10%程度は多言語とのことで、実際に日本語も一応対応してくれます。ただ、あまり自然ではなく、怪しい日本語が出てくることが多いです。

さきほどの要約を日本語訳させてみます。

# 日本語訳
prompt = f"Translate following sentence to Japanese:\n{summary}"
translated = generate_text(prompt, max_tokens=512)
print(translated)

結果が以下になります。

>>> print(translated)
- グレード Boosting 決定木 (GBDT)が広く利用されますが、高次元特徴量と大規模なデータ
サイズに対して効率性の問題があります。
- XGBoost型の伝統的な実装は切分点を全探索させることで、時間がかかります。
- 提案した二つの技術:
   - グラディエントベースの一側サブセット選択(GOSS):小さい勾配のインスタンスを除外す
ることで、情報養分の予測に向けた効率性を高めます。
   - ユニークな特徴集合バンディング(EFB):必要以上に多くの特徴を持つものを一つに結ぶ 
ことで、難しいNP-Hard性説明のために精密アルゴリズムを使用します。
- ライトGBMはGOSSとEFBを組み合わせることで高々20倍以上、変換時間の速度向上に加えなが
ら、さまざまなデータセットにおいて高い信頼性を保持します。

まずGBDTを「勾配ブースティング決定木」と訳すこともできていないですね。全体的に変な日本語になっています。さすがに日本語で使うには厳しいようです。

なお生成時間は182秒（全体: 3.1トークン/秒、eval time: 2.45トークン/秒）でした。

llama_print_timings:        load time =   29868.38 ms
llama_print_timings:      sample time =     218.57 ms /   394 runs   (    0.55 ms per token,  1802.65 tokens per second)
llama_print_timings: prompt eval time =   19891.78 ms /   180 tokens (  110.51 ms per token,     9.05 tokens per second)
llama_print_timings:        eval time =  160245.35 ms /   393 runs   (  407.75 ms per token,     2.45 tokens per second)
llama_print_timings:       total time =  182150.50 ms /   573 tokens

Lambdaでの生成速度

AWS Lambdaに10GB（最大）のメモリを乗せてやってみます（Lambdaは設定したメモリに応じてCPUも上位のものになります）

まず要約。生成結果は以下になりました。言い回しは違いますが内容は概ね同様ですね（seedを固定しても結果は実行の都度変わっちゃいますね・・・）。

- Gradient Boosting Decision Tree (GBDT) widely used but lacks efficiency in high feature/data dimensions.
- Implementations like XGBoost and pGBRT improved, yet scalability remains an issue due to extensive data scanning for information gain estimation.
- Proposed two techniques: 
- Gradient-based One-Side Sampling (GOSS): Excludes small gradient instances, leading to accurate info gain with lesser data size.
- Exclusive Feature Bundling (EFB): Reduces feature dimensions by bundling mutually exclusive features, akin to solving NP-hard problems using greedy algorithms for approximations.
- New GBDT implementation: LightGBM, which improves training speed by over 20 times while maintaining accuracy in large datasets.

処理時間は35.5秒（全体で15.4トークン/秒）でした。私の古いi7（6.9トークン / 秒）よりもずっと高速に処理できています。

llama_print_timings:        load time =   14931.46 ms
llama_print_timings:      sample time =      87.62 ms /   178 runs   (    0.49 ms per token,  2031.43 tokens per second)
llama_print_timings: prompt eval time =   14551.40 ms /   369 tokens (   39.43 ms per token,    25.36 tokens per second)
llama_print_timings:        eval time =   20439.29 ms /   177 runs   (  115.48 ms per token,     8.66 tokens per second)
llama_print_timings:       total time =   35559.17 ms /   546 tokens

（古いとはいえi7がLambdaにボロ負けするとは正直意外でした。第10世代のi7（2020年発売）って今となっては結構スペック低いんですね。ムーアの法則は健在ということでしょうか。）

つづいて和訳。生成結果は以下になります。なぜか途中からarxivのURLを喋り始めました。

- グラディエントブースト回帰樹木（GBDT）が綿密に利用されていますが、高次元/大規模データの効率性は不足しています。
- XGBoostやpGBRTといった実装例は改良されていますが、大量なデータを検索して情報適用度を推定することでスケーラビリティの問題が残っています。
- 提案された二つの技術:
- 一方、Gradient-based One-Side Sampling (GOSS)は小さいグライデントの例を除外し、より正確な情報適用度が得られるにつれてデータサイズを減らします。
- もう一方、Exclusive Feature Bundling (EFB)は対立関ited from: https://arxiv.org/abs/1803.09632
- 新たなGBDT実装「LIGHTGBM」として知られるもので、大量データに対するトレーニング速度を大幅に向上させながら精度を保持します。

こちらの処理時間は50.6秒（11トークン/秒）でした

llama_print_timings:        load time =   14931.46 ms
llama_print_timings:      sample time =     175.03 ms /   368 runs   (    0.48 ms per token,  2102.46 tokens per second)
llama_print_timings: prompt eval time =    7302.03 ms /   188 tokens (   38.84 ms per token,    25.75 tokens per second)
llama_print_timings:        eval time =   42044.19 ms /   367 runs   (  114.56 ms per token,     8.73 tokens per second)
llama_print_timings:       total time =   50591.80 ms /   555 tokens

Lambdaでも動くものの、それなりの時間がかかるので、リアルタイムな返答が必要なタスクには向かなさそうです。リアルタイム性が不要な、例えば「事前に文章を生成してDBに入れておくバッチ処理」みたいな状況ならLambdaだけでも十分使えそうです。

まとめ

Phi-3は英語の要約は結構いいかんじ
日本語は弱い
Lambdaでも動く

日本語を使いたいタスクには向きませんが、英語だけで問題ない場合は使い道がありそうな感じではあります。あるいはPhi-3を日本語データセットで学習したバージョンが今後登場すれば日本での活用機会が増えそうです。今後が楽しみですね。

盆暗の学習記録

データサイエンス，エンジニアリング，ビジネスについて日々学んだことの備忘録としていく予定です。初心者であり独学なので内容には誤りが含まれる可能性が大いにあります。

Phi-3をCPUのみで動かしてみる（ローカル / AWS Lambda）

環境・前提

モデル

環境

コード

要約させてみる

推論時間

日本語に翻訳させてみる

Lambdaでの生成速度

まとめ