第 13 回　テキストの分類

第 13 回では、テキストの分類方法、特に教師あり学習を用いた手法について見ていきます。

3.1 ロジスティック回帰

ロジスティック回帰は、基本的には二値分類器として知られるアルゴリズムですが、One-vs-Rest (OvR) や多項式回帰モデル（ソフトマックス関数）を用いることで、多クラス分類問題に拡張することができます。OvR は、各クラスに対して、そのクラス（One）とその他のクラス（Rest）の分類器を全て作成し、最も高い確率を示したクラスを予測結果として返します。この性質から、One-vs-All と呼ばれることもあります。一方で多項式回帰モデルでは、シグモイド関数の代わりにソフトマックス関数を用いることで、全クラスにわたる確率が 1 になるように調整を行います。

3.1.1 ロジスティック回帰によるテキスト分類

ロジスティック回帰に限らず、教師あり学習によるテキスト分類を行う際には、基本的に以下の手順に従います。

文書データと対応するラベル（カテゴリ）を用意
文書データのベクトル化
モデル（分類器）の学習
モデル（分類器）の評価

ここで文書データのベクトル化には、第 11 回で導入した BoW 特徴量 や TF-IDF 特徴量、第 12 回で導入した Word2Vec や Doc2Vec などが用いられます。

3.1.2 ロジスティック回帰の実装（BoW）

テキスト分類のタスクを実装するうえで、コードをシンプルにするため、文書データを読み込んでラベル付けを行う load_data()関数と、形態素解析を実施して名詞のみを抽出する extract_nouns()関数を作成することを考えます。文書データに構造にもよりますが、本講義で用意した仮想ニュースのデータセットであれば、以下のようになります。

データ読込と名詞抽出の関数

import pandas as pd
import MeCab

# データを読み込みラベル付けを行う関数
def load_data(file_path, label):
    with open(file_path, "r", encoding="utf-8") as file:
        data = file.read()
    texts = data.split("\n\n")
    data = pd.DataFrame({"text": texts, "label": label})
    return data

# MeCab の形態素解析を使用して名詞のみを抽出する関数
def extract_nouns(text):
    mecab = MeCab.Tagger()
    node = mecab.parseToNode(text)
    nouns = []
    while node:
      pos = node.feature.split(",")[0] # 品詞を取得
      if pos == "名詞": # 品詞が「名詞」であった場合
        nouns.append(node.surface) #リストに追加
      node = node.next # 次のノードに移動
    return " ".join(nouns)

ロジスティック回帰は、sklearn.linear_model.LogisticRegression クラスによって実装できます。実際に教育ニュース education.txt と科学ニュース science.txt を読み込ませ、ロジスティック回帰によるテキスト分類器を作成し、その評価を行うコードは以下のようになります。

ロジスティック回帰によるテキスト分類

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# データを読み込む
education_data = load_data("education.txt", "education")
science_data = load_data("science.txt", "science")

# データを結合する
data = pd.concat([education_data, science_data]).reset_index(drop=True)

# データをシャッフルする
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# 各文書データから名詞を抽出
nouns_texts = [extract_nouns(text) for text in data["text"]]

# CountVectorizer のインスタンスを作成
vectorizer = CountVectorizer()

# 文書のベクトル化（BoW）
X = vectorizer.fit_transform(nouns_texts)

# ラベルデータ
y = data["label"]

# データを訓練用とテスト用に分割
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 分類器の作成と訓練
model = LogisticRegression()
model.fit(X_train, y_train)

# テストデータを用いた予測
y_pred = model.predict(X_test)

# 結果の表示
print("Accuracy:", accuracy_score(y_test, y_pred))

実行結果

Accuracy: 0.9008264462809917

LogisticRegression クラスのインスタンスを作成し、fit() メソッドを呼び出して学習を、predict() メソッドで予測を行っています。ただし、このモデルでは ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. のような警告が出るかもしれません。これは、モデルの学習が収束する前に最大反復数に達したことを意味する警告であり、LogisticRegression のインスタンス生成時に max_iter パラメータを十分大きな値に設定することで解決できます。その他、各種パラメータの詳細については、適宜公式ドキュメントを参照するようにしてください。

なお、上記のコードではパフォーマンスの測定に、sklearn.metrics モジュールの accuracy_score() 関数を利用しています。この関数は、正しく分類されたテストデータの割合（正解率）を表すものとなっています。その他にも、パフォーマンスの測定によく使われるものとして、classification_report() 関数や、confusion_matrix() 関数があります。それぞれ、以下のように記述することができます。

パフォーマンスの詳細表示

from sklearn.metrics import classification_report, confusion_matrix

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

実行結果

Classification Report:
               precision    recall  f1-score   support

   education       0.86      0.95      0.90        58
     science       0.95      0.86      0.90        63

    accuracy                           0.90       121
   macro avg       0.90      0.90      0.90       121
weighted avg       0.91      0.90      0.90       121

Confusion matrix:
 [[55  3]
 [ 9 54]]

Classification Report では、それぞれの数字について、適合率（Precision）、再現率（Recall）、F1 スコア（F1 Score）、サポート（Support）の結果が表示されます。それぞれ、以下のような意味を持ちます。

適合率（Precision）：正しいと予測したにもののうち、実際に正解であったものの割合。偽陽性（誤検出）の数を少なくしたい場合に注目する。
再現率（Recall）：正解データのうち、正しいと予測できたものの割合。偽陰性（取りこぼし）の数を少なくしたい場合に注目する。
F1 スコア（F1 Score）：適合率と再現率の調和平均。不均衡データに対するパフォーマンス評価をする際や、適合率と再現率が同等に重要であるときに、注目する。
サポート（Support）：正解データに含まれている個数。データセットの解釈や信頼性の確認に利用する。

Confusion matrix（混合行列）は、予測したクラスと実際のクラスの関係を示す行列です。今回の場合、教育ニュース記事に関しては、58 件のうち 55 件を正しく分類し、3 件を誤って分類しています。科学ニュース記事に関しては、63 件のうち 54 件を正しく分類し、9 件を誤って分類しています。

3.1.3 Word2Vec + ロジスティック回帰

Word2Vec で文書をベクトル化し、ロジスティック回帰で分類を行う例を紹介します。まずは同様に、データの読み込みとラベル付を行う load_data()関数と、形態素解析を実施して名詞を抽出する extract_nouns()関数を定義します。ここで、extract_nouns()関数の返り値が、単語を結合したテキストデータではなく、単語のリストになっている点に注意してください。

データ読込と名詞抽出（リスト版）の関数

import pandas as pd
import MeCab

# データを読み込みラベル付けを行う関数
def load_data(file_path, label):
    with open(file_path, "r", encoding="utf-8") as file:
        data = file.read()
    texts = data.split("\n\n")
    data = pd.DataFrame({"text": texts, "label": label})
    return data

# MeCab の形態素解析を使用して名詞のみを抽出する関数
def extract_nouns(text):
    mecab = MeCab.Tagger()
    node = mecab.parseToNode(text)
    nouns = []
    while node:
      pos = node.feature.split(",")[0] # 品詞を取得
      if pos == "名詞": # 品詞が「名詞」であった場合
        nouns.append(node.surface) #リストに追加
      node = node.next # 次のノードに移動
    return nouns # 単語のリストを返す

Word2Vec で文書のベクトル化を行い、ロジスティック回帰で分類するプログラムは、以下のように記述できます。

Word2Vec + ロジスティック回帰

import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# データを読み込む
education_data = load_data("education.txt", "education")
science_data = load_data("science.txt", "science")

# データを結合する
data = pd.concat([education_data, science_data]).reset_index(drop=True)

# 形態素解析を使用して名詞のみを抽出
data["tokens"] = data["text"].apply(lambda x: extract_nouns(x))

# Word2Vecモデルの定義とトレーニング
w2v_model = Word2Vec(sentences=data["tokens"], vector_size=50, window=5, min_count=3, epochs=20)

# 各文書のベクトルを計算する関数
def document_vector(model, doc):
    # 各単語のベクトルを取得し平均化
    doc = [word for word in doc if word in model.wv.index_to_key]
    if len(doc) == 0:
        return np.zeros(model.vector_size)
    return np.mean(model.wv[doc], axis=0)

# 文書ベクトルの生成
data["vector"] = data["tokens"].apply(lambda x: document_vector(w2v_model, x))

# 無効なベクトル（全てゼロベクトル）を除去
valid_vectors = data["vector"].apply(lambda x: not np.all(x == 0))
data = data[valid_vectors]

# 特徴量とラベルを準備
X = np.vstack(data["vector"])
y = data["label"]

# データをトレーニングセットとテストセットに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ロジスティック回帰
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# テストデータで予測
y_pred = clf.predict(X_test)

# 精度を評価
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

実行結果

Accuracy: 0.9166666666666666
Classification Report:
               precision    recall  f1-score   support

   education       0.91      0.94      0.92        62
     science       0.93      0.90      0.91        58

    accuracy                           0.92       120
   macro avg       0.92      0.92      0.92       120
weighted avg       0.92      0.92      0.92       120

Confusion matrix:
 [[58  4]
 [ 6 52]]

3.1.4 Doc2Vec + ロジスティック回帰

Doc2Vec で文書のベクトル化を行い、ロジスティック回帰で分類するプログラムは、以下のようになります。

Doc2Vev + ロジスティック回帰

import pandas as pd
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# データを読み込む
education_data = load_data("education.txt", "education")
science_data = load_data("science.txt", "science")

# データを結合する
data = pd.concat([education_data, science_data]).reset_index(drop=True)

# 形態素解析を使用して名詞のみを抽出
data["tokens"] = data["text"].apply(lambda x: extract_nouns(x))

# タグ付けされたデータを作成
tagged_data = [TaggedDocument(words=row["tokens"], tags=[str(index)]) for index, row in data.iterrows()]

# Doc2Vecモデルの定義とトレーニング
d2v_model = Doc2Vec(tagged_data, vector_size=50, window=5, min_count=3, epochs=20)

# 文書ベクトルの生成
data["vector"] = data["tokens"].apply(lambda x: d2v_model.infer_vector(x))

# 無効なベクトル（全てゼロベクトル）を除去
valid_vectors = data["vector"].apply(lambda x: not np.all(x == 0))
data = data[valid_vectors]

# 特徴量とラベルを準備
X = np.vstack(data["vector"])
y = data["label"]

# データをトレーニングセットとテストセットに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ロジスティック回帰
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# テストデータで予測
y_pred = clf.predict(X_test)

# 精度を評価
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

実行結果

Accuracy: 0.8925619834710744
Classification Report:
               precision    recall  f1-score   support

   education       0.88      0.92      0.90        64
     science       0.91      0.86      0.88        57

    accuracy                           0.89       121
   macro avg       0.89      0.89      0.89       121
weighted avg       0.89      0.89      0.89       121

Confusion matrix:
 [[59  5]
 [ 8 49]]

3.2 ランダムフォレスト

ランダムフォレスト（Random Forest）は、決定木とアンサンブル学習を組み合わせたアルゴリズムであり、多数の決定木を組み合わせて最終的に投票により予測ラベルを決定します（回帰の場合は平均を算出）。

3.2.1 ランダムフォレストの実装

ランダムフォレストは、sklearn.ensemble モジュールの RandomForestClassifier クラスを用いることで、以下のように実装できます。

ランダムフォレストによるテキスト分類

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# データを読み込む
education_data = load_data("education.txt", "education")
science_data = load_data("science.txt", "science")

# データを結合する
data = pd.concat([education_data, science_data]).reset_index(drop=True)

# データをシャッフルする
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# 各文書データから名詞を抽出
nouns_texts = [extract_nouns(text) for text in data["text"]]

# CountVectorizer のインスタンスを作成
vectorizer = CountVectorizer()

# 文書のベクトル化（BoW）
X = vectorizer.fit_transform(nouns_texts)

# ラベルデータ
y = data["label"]

# データを訓練用とテスト用に分割
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 分類器の作成と訓練
model = RandomForestClassifier()
model.fit(X_train, y_train)

# テストデータを用いた予測
y_pred = model.predict(X_test)

# 精度の表示
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

実行結果

Accuracy: 0.8842975206611571
Classification Report:
               precision    recall  f1-score   support

   education       0.83      0.95      0.89        58
     science       0.95      0.83      0.88        63

    accuracy                           0.88       121
   macro avg       0.89      0.89      0.88       121
weighted avg       0.89      0.88      0.88       121

Confusion matrix:
 [[55  3]
 [11 52]]

上記のコードではランダムフォレストのパラメータをすべてデフォルト値で実行していますが、パフォーマンス向上のために検討するべき重要なパラメータが幾つか存在します。例えば、決定木の数を定義する n_estimators、各決定木で分割に使用する特徴量の最大数を定義する max_features、決定木の最大深さを定義する max_depth などです。最適なパラメータについて、検討してみてください。

3.3 SVM

SVM (Support Vectir Machine) は、2 つのクラスを分離する超平面を見つけるアルゴリズムであり、さまざまな分類・回帰の問題に利用されています。超平面は、最も近い学習データの点群（サポートベクター）までの距離（マージン）が最大になるように計算されます。SVM は高次元データやサンプル数が少ないデータに対しても比較的うまく機能することが知られていますが、パラメータ選択が難しいことや、大規模なデータセットでは学習時間が長くなるデメリットがあります。

3.3.1 SVM の実装

SVM は、sklearn.svm モジュールの SVC クラスを用いることで、以下のように実装できます。

SVM によるテキスト分類

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# データを読み込む
education_data = load_data("education.txt", "education")
science_data = load_data("science.txt", "science")

# データを結合する
data = pd.concat([education_data, science_data]).reset_index(drop=True)

# データをシャッフルする
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# 各文書データから名詞を抽出
nouns_texts = [extract_nouns(text) for text in data["text"]]

# CountVectorizer のインスタンスを作成
vectorizer = CountVectorizer()

# 文書のベクトル化（BoW）
X = vectorizer.fit_transform(nouns_texts)

# ラベルデータ
y = data["label"]

# データを訓練用とテスト用に分割
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 分類器の作成と訓練
model = SVC(gamma=0.01, C=100.0)
model.fit(X_train, y_train)

# テストデータを用いた予測
y_pred = model.predict(X_test)

# 精度の表示
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

実行結果

Accuracy: 0.9256198347107438
Classification Report:
               precision    recall  f1-score   support

   education       0.90      0.95      0.92        58
     science       0.95      0.90      0.93        63

    accuracy                           0.93       121
   macro avg       0.93      0.93      0.93       121
weighted avg       0.93      0.93      0.93       121

Confusion matrix:
 [[55  3]
 [ 6 57]]

SVM のパフォーマンスは、幾つかの重要なパラメータに大きく依存します。以下に、代表的なパラメータを示します。

C（コストパラメータ）：誤分類のペナルティを表すパラメータであり、小さいほど多くの誤分類が許容され、滑らかな決定境界が得られます。逆に大きいほど正確な分類を実現しようとしますが、過学習の可能性が高まります。デフォルトでは 1.0 となります。
kernel：非線形の決定境界を実現するための、poly、rbf、sigmoid などのさまざまなカーネルを指定できます。デフォルトでは RBF カーネルが利用されます。
gamma：ガンマは、非線形カーネル関数におけるハイパーパラメータであり、サポートベクターの影響の範囲を表します。

3.4 時間の計測

パフォーマンスを測定する上で、時間の計測方法を確認しておきましょう。以下のように time モジュールを用いることで、タイムスタンプを取得し、処理に要した時間を計測することができます。

時間の計測

import time

start_time = time.time()  # 開始時刻

## 何らかの処理

elapsed_time = time.time() - start_time  # 経過時間
print(f"Elapsed time: {elapsed_time:.4f} seconds")

課題 13

仮想ニュースデータ（または各自で用意したコーパス）を対象に、カテゴリー分類タスクを実装しなさい。前処理（ストップワードや品詞の抽出）、ベクトル化、分類アルゴリズム、パラメータについて検討し、精度向上を試みること。 ※ 検討した事項は .ipynb ファイル内にテキストで記載

提出ファイル名：学籍番号_13.ipynb
提出締切：2024年7月19日(金) 23:59

第 13 回 テキストの分類

3.1 ロジスティック回帰

3.1.1 ロジスティック回帰によるテキスト分類

3.1.2 ロジスティック回帰の実装（BoW）

3.1.3 Word2Vec + ロジスティック回帰

3.1.4 Doc2Vec + ロジスティック回帰

3.2 ランダムフォレスト

3.2.1 ランダムフォレストの実装

3.3 SVM

3.3.1 SVM の実装

3.4 時間の計測

課題 13

第 13 回　テキストの分類