在 ML 分类器中对文本进行编码

2023-12-02

我正在尝试构建一个机器学习模型。但是我很难理解在哪里应用编码。请参阅下面的步骤和功能来复制我一直遵循的过程。

首先，我将数据集分为训练和测试：

# Import the resampling package
from sklearn.naive_bayes import MultinomialNB
import string
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.utils import resample
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
# Split into training and test sets

# Testing Count Vectorizer

X = df[['Text']] 
y = df['Label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

# Returning to one dataframe
training_set = pd.concat([X_train, y_train], axis=1)

现在我应用（下）采样：

# Separating classes
spam = training_set[training_set.Label == 1]
not_spam = training_set[training_set.Label == 0]

# Undersampling the majority
undersample = resample(not_spam, 
                       replace=True, 
                       n_samples=len(spam), #set the number of samples to equal the number of the minority class
                       random_state=40)
# Returning to new training set
undersample_train = pd.concat([spam, undersample])

我应用所选的算法：

full_result = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])

X, y = BOW(undersample_train)
full_result = full_result.append(training_naive(X_train, X_test, y_train, y_test, 'Count Vectorize'), ignore_index = True)

其中 BOW 定义如下

def BOW(data):
    
    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)

    count_vectorizer = CountVectorizer(analyzer=fun)
    count_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()
    
    X = count_vectorizer.transform(list_corpus)
    
    return X, list_labels

basic_preprocessing定义如下：

def basic_preprocessing(df):
    
    df_temp = df.copy(deep = True)
    df_temp = df_temp.rename(index = str, columns = {'Clean_Titles_2': 'Text'})
    df_temp.loc[:, 'Text'] = [text_prepare(x) for x in df_temp['Text'].values]
    
    #le = LabelEncoder()
    #le.fit(df_temp['medical_specialty'])
    #df_temp.loc[:, 'class_label'] = le.transform(df_temp['medical_specialty'])
    
    tokenizer = RegexpTokenizer(r'\w+')
    df_temp["Tokens"] = df_temp["Text"].apply(tokenizer.tokenize)
    
    return df_temp

where text_prepare is:

def text_prepare(text):

    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))
    
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub('', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    words = text.split()
    i = 0
    while i < len(words):
        if words[i] in STOPWORDS:
            words.pop(i)
        else:
            i += 1
    text = ' '.join(map(str, words))# delete stopwords from text
    
    return text

and

def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res

如您所见，顺序是：

定义text_prepare用于文本清理；
定义基本预处理；
定义弓；
将数据集分为训练和测试；
应用抽样；
应用算法。

我不明白的是如何正确编码文本以使算法正常工作。我的数据集称为 df，列是：

Label      Text                                 Year
1         bla bla bla                           2000
0         add some words                        2012
1         this is just an example               1998
0         unfortunately the code does not work  2018
0         where should I apply the encoding?    2000
0         What am I missing here?               2005

我应用 BOW 时的顺序是错误的，因为我收到此错误：ValueError: could not convert string to float: 'Expect a good results if ... '

我按照此链接中的步骤（和代码=：kaggle.com/ruzarx/oversampling-smote-and-adasyn。然而，采样部分是错误的，因为它应该只对火车进行，所以在分割之后。原则应该是：（1）分割训练/测试； (2) 对训练集进行重采样，使模型使用平衡数据进行训练； (3) 将模型应用于测试集并对其进行评估。

我很乐意提供更多信息、数据和/或代码，但我认为我已经提供了所有最相关的步骤。

多谢。

您需要一个测试 BOW 函数，该函数应重用在训练阶段构建的计数向量化器模型。

考虑使用管道来减少代码的冗长。

from sklearn.naive_bayes import MultinomialNB
import string
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.utils import resample
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def fun(text):
    remove_punc = [c for c in text if c not in string.punctuation]
    remove_punc = ''.join(remove_punc)
    cleaned = [w for w in remove_punc.split() if w.lower()
               not in stopwords.words('english')]
    return cleaned
# Testing Count Vectorizer

def BOW(data):

    df_temp = data.copy(deep=True)
    df_temp = basic_preprocessing(df_temp)

    count_vectorizer = CountVectorizer(analyzer=fun)
    count_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = count_vectorizer.transform(list_corpus)

    return X, list_labels, count_vectorizer

def test_BOW(data, count_vectorizer):

    df_temp = data.copy(deep=True)
    df_temp = basic_preprocessing(df_temp)

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = count_vectorizer.transform(list_corpus)

    return X, list_labels

def basic_preprocessing(df):

    df_temp = df.copy(deep=True)
    df_temp = df_temp.rename(index=str, columns={'Clean_Titles_2': 'Text'})
    df_temp.loc[:, 'Text'] = [text_prepare(x) for x in df_temp['Text'].values]


    tokenizer = RegexpTokenizer(r'\w+')
    df_temp["Tokens"] = df_temp["Text"].apply(tokenizer.tokenize)

    return df_temp


def text_prepare(text):

    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))

    text = text.lower()
    # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = REPLACE_BY_SPACE_RE.sub('', text)
    # delete symbols which are in BAD_SYMBOLS_RE from text
    text = BAD_SYMBOLS_RE.sub('', text)
    words = text.split()
    i = 0
    while i < len(words):
        if words[i] in STOPWORDS:
            words.pop(i)
        else:
            i += 1
    text = ' '.join(map(str, words))  # delete stopwords from text

    return text

s = """Label      Text                                 Year
1         bla bla bla                           2000
0         add some words                        2012
1         this is just an example               1998
0         unfortunately the code does not work  2018
0         where should I apply the encoding?    2000
0         What am I missing here?               2005"""


df = pd.read_csv(StringIO(s), sep='\s{2,}')


X = df[['Text']]
y = df['Label']


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=40)

# Returning to one dataframe
training_set = pd.concat([X_train, y_train], axis=1)
# Separating classes
spam = training_set[training_set.Label == 1]
not_spam = training_set[training_set.Label == 0]

# Undersampling the majority
undersample = resample(not_spam,
                       replace=True,
                       # set the number of samples to equal the number of the minority class
                       n_samples=len(spam),
                       random_state=40)
# Returning to new training set
undersample_train = pd.concat([spam, undersample])

full_result = pd.DataFrame(columns=['Preprocessing', 'Model', 'Precision',
                                    'Recall', 'F1-score', 'Accuracy'])
train_x, train_y, count_vectorizer  = BOW(undersample_train)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_BOW(testing_set, count_vectorizer)



def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res 

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, 'Count Vectorize'), ignore_index = True)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

在 ML 分类器中对文本进行编码的相关文章

python sys.path 故障排除

python 文档位于http docs python org library sys html http docs python org library sys html比如说sys path is 从环境变量 PYTHONPATH 以及
如何避免使用 python 处理空的标准输入？

The sys stdin readline 返回之前等待 EOF 或新行所以如果我有控制台输入 readline 等待用户输入相反我想打印帮助并在没有需要处理的情况下退出并显示错误而不是等待用户输入原因我正在寻找一个Pytho
从文本文件中删除特定字符

我对 Python 和编码都很陌生我当时正在做一个小项目但遇到了一个问题 44 1 6 23 2 7 49 2 3 53 2 1 68 1 6 71 2 7 我只需要从每行中删除第三个和第六个字符或者更具体地说从整个文件中删除字符
是否可以从 Julia 调用 Python 函数并返回其结果？

我正在使用 Python 从网络上抓取数据我想使用这些数据在 Julia 中运行计算是否可以在 Julia 中调用该函数并返回其结果或者我最好直接导出到 CSV 并以这种方式加载数据绝对地看PyCall jl https gith
Series.sort() 和 Series.order() 有什么区别？

s pd Series nr randint 0 10 5 index nr randint 0 10 5 s Output 1 3 7 6 2 0 9 7 1 6 order 按值排序并返回一个新系列 s order Output 2 0
如何通过 python 多处理利用所有核心

我一直在摆弄Python的multiprocessing现在已经使用了一个多小时的功能尝试使用并行化相当复杂的图形遍历函数multiprocessing Process and multiprocessing Manager import
Arcpy 模数在 Pycharm 中不显示

如何将 Arcpy 集成到 Pycharm 中我尝试通过导入模块但它没有显示我确实知道该模块仅适用于 2 x python arcpy 在 PyPi Python 包索引上不可用因此无法通过 pip 安装要使用 arcpy 您需要
AttributeError：“模块”对象没有属性[重复]

这个问题在这里已经有答案了我有两个 python 模块 a py import b def hello print hello print a py print hello print b hi b py import a def hi
Python HMAC：类型错误：字符映射必须返回整数、None 或 unicode

我在使用 HMAC 时遇到了一个小问题运行这段代码时 signature hmac new key secret key msg string to sign digestmod sha1 我收到一个奇怪的错误 File usr loca
运行 Python 单元测试，以便成功时不打印任何内容，失败时仅打印 AssertionError()

我有一个标准单元测试格式的测试模块 class my test unittest TestCase def test 1 self tests def test 2 self tests etc 我的公司有一个专有的测试工具它将作为命令行
根据其他单元格值更改多个单元格值

我想更改包含的单元格moving to movingToOpenor movingToClose基于下一个单元格中给出的状态有时循环会被中断并且不会从open to close or close to open 这是我当前的数据框 Dat
Paste.httpserver 并通过 HTTP/1.1 Keep-alive 减慢速度；使用 httperf 和 ab 进行测试

我有一个基于paste httpserver 的Web 服务器作为HTTP 和WSGI 之间的适配器当我使用 httperf 进行性能测量时如果每次使用 num conn 启动一个新请求我每秒可以执行超过 1 000 个请求如果我使
查找 Pandas DF 行中的最短日期并创建新列

我有一个包含多个日期的表有些日期将为 NaN 我需要找到最旧的日期所以一行可能有 DATE MODIFIED WITHDRAWN DATE SOLD DATE STATUS DATE 等因此对于每一行一个或多个字段中都会有一个日期
pandas 相当于 np.where

np where具有向量化 if else 的语义类似于 Apache Spark 的when otherwise数据帧方法我知道我可以使用np where on pandas Series but pandas通常定义自己的 API
给定一个排序数组，就地删除重复项，使每个元素仅出现一次并返回新长度

完整的问题我开始在线学习 python 但对这个标记为简单的问题有疑问给定一个排序数组就地删除重复项使得每个元素只出现一次并返回新的长度不分配另一个数组的额外空间您必须通过修改输入来完成此操作数组就地具有 O 1 额外内
Airflow 1.9 - 无法将日志写入 s3

我在 aws 的 kubernetes 中运行气流 1 9 我希望将日志发送到 s3 因为气流容器本身的寿命并不长我已经阅读了描述该过程的各种线程和文档但我仍然无法让它工作首先是一个测试向我证明 s3 配置和权限是有效的这是在我们
无法通过 Python 子进程进行 SSH

我需要通过堡垒 ssh 进入机器因此该命令相当长 ssh i
pandas 中数据帧中的随机/洗牌行

我目前正在尝试找到一种方法来按行随机化数据框中的项目我在 pandas 中按列洗牌排列找到了这个线程在 pandas 中对 DataFrame 进行改组排列 https stackoverflow com questions 157
IndexError - 具有匀称形状的笛卡尔 PolygonPatch

我曾经使用 shapely 制作一个圆圈并将其绘制在之前填充的图上这曾经工作得很好最近我收到索引错误我将代码分解为最简单的操作但它甚至无法执行最简单的循环 import descartes import shapely geome
如何（安全）将 Python 对象发送到我的 Flask API？

我目前正在尝试构建一个 Flask Web API 它能够在 POST 请求中接收 python 对象我使用 Python 3 7 1 创建请求使用 Python 2 7 运行 API 该 API 设置为在我的本地计算机上运行我试图发

随机推荐

读取Python中的最低有效位

我必须在 Python 中解析系统日志消息的功能和严重性这些值作为单个整数随每条消息一起提供事件的严重性为 0 7 在整数的 3 个最低有效位中指定从数字中评估这 3 位的最简单最快的方法是什么我现在的代码只是右移 3 位然后将
无法使用 Django 3.0.3 中的迁移 API 来使用 ModelState 和 ProjectState 进行迁移

我正在使用 ProjectState 迁移到表的新属性我试图了解 Django 3 0 3 中迁移 API 的 ModelState 和 ProjectState 使用我无法迁移到有新字段的新州有人可以帮我吗ProjectState
按类型将引用 bean 自动装配到列表中

我有一个类其中包含以下对象列表Daemon type class Xyz List
删除
标签 HTML 上方和下方的空格

拿着它ul例如 ul li HI THERE li br li p ME p li ul 当innerHtml的li标签为空则li将自身包裹到文本处这不会发生在p标签我假设这是因为p适用于前后通常有空格的段落有什么办法可以去除这个吗
Objective C - 1 个 .h 文件需要 2 个 .m 文件？

我的问题是我知道如何将 2 个 m 文件用于一个对象类以及一个标头 h 我有一个包含 20000 多行的大方法并且我希望该方法独立于 m 文件中而其他方法独立于其他 m 文件中我已经完成了但出现错误这些方法不在一个 m 文件中
使用 javascript 设置背景图像

在 chrome safari 和 opera 中将背景图像设置为绝对引用例如 images image png 将其更改为 http sitepath images image png 它在 Firefox 中不会执行此操作有什么办
Firebase Auth：通过电子邮件或电话号码获取用户

我正在构建一个Android我使用不同的方式让用户注册自己的应用程序例如电子邮件密码电话 Google Facebook Twitter 我还希望用户能够相互添加为联系人如果我只使用电子邮件和 Google 那么实现这一点会很容易
Hibernate：奇怪的行为 - 需要在另一个会话中进行第二次 commit() 才能访问表条目

我有一个抽象类它只提供对会话创建的简单访问 private Session currentSession null private Transaction currentTransaction null protected void op
远程 Webdriver Chrome 抛出“驱动程序可执行文件的路径”错误

你好当我使用以下代码时 IWebDriver webDriver new RemoteWebDriver new Uri http 127 0 0 1 4444 wd hub DesiredCapabilities Chrome 我收到以
当我在 onCreate() 中调用 findViewById() 时，它返回 null

我的第一个 Android 应用程序上的 findViewById 遇到问题我试图调用这个函数但总是返回 null 我的应用程序有 2 个活动在第二个活动 activity display message 中我有以下代码 Overri
ASP.NET Web 应用程序中的多个 Web.Config 文件

我有一个 ASP NET Web 应用程序其中根文件夹中有多个子目录在我的根 web config 中我将 sessionMode 设置为 StateServer 因此在我的子目录的一页中我无法进行序列化如果我将 Session
C++ 中的 PyQt 自定义小部件

我可以用纯 C 编写自定义 Qt 小部件编译它并在 PyQt 中使用吗我正在尝试将 ctypes opencv 与 qt 一起使用但我在使用 python 代码以 Qt 形式显示 opencv 图像时遇到性能问题您必须使用以下方法为
Map对象转换为列表后清空自身

我不明白为什么map对象刷新自身如果这就是它正在做的事情这是我尝试过的 gt gt gt squares map lambda x x 2 range 10 gt gt gt squares
使用 JAXB 对 XML 进行部分解组以跳过某些 xmlElement

我想使用 JAXB 将 XML 文件解组为 java 对象 XML 文件非常大包含一些节点在某些情况下我想跳过这些节点以提高性能因为这些元素不可由客户端 java 程序编辑 XML 示例如下
如何从 NSDate 对象检索午夜过后的小时数？

我需要从 iPhone 项目中的 UIDatePicker 控件检索午夜过后的小时数 datePickerMode被设定为UIDatePickerModeTime 所以用户只能设置时间不能设置日期当用户完成并关闭 UIDatePicke
f 升值的合并排序

这是我的代码当我输入一个非常大的数字时我收到堆栈溢出错误有人知道为什么吗当我输入一个非常大的数字时我收到该错误我不太确定是什么原因导致的只有大数字小数字才能正常工作 merge two sorted lists into one
Android Studio 1.2.2 ClassNotFoundException android.widget.viewstub

我刚刚安装了 Android Studio 每当我启动一个项目时甚至是 IDE 创建的默认 Hello world 活动我有一个例外 java lang ClassNotFoundException 未找到类路径上的 android
如何将我的 Zip 文件转换为 NSData 以将我的 Zip 文件作为附件通过电子邮件发送

我正在使用 Objective Zip 库来压缩我拍摄的几张图像我我猜已经到了压缩图像的地步了现在我想用邮件编辑器发送这个压缩文件但是我需要在我的邮件函数中声明一个 NSData 对象 picker addAttachmentDa
合并行，连接以逗号分隔的一列中的内容-R 编程

我需要帮助合并数据 mydf 中具有相同名称即起始列的行并连接 ALT 列中的内容从而根据起始列中的相似值删除所有重复的行我想合并行并连接 ALT 列中用逗号分隔的内容并得到如下所示的结果感谢您的帮助 gt mydf chr
在 ML 分类器中对文本进行编码

我正在尝试构建一个机器学习模型但是我很难理解在哪里应用编码请参阅下面的步骤和功能来复制我一直遵循的过程首先我将数据集分为训练和测试 Import the resampling package from sklearn naive b

在 ML 分类器中对文本进行编码

在 ML 分类器中对文本进行编码 的相关文章

随机推荐

热门标签

在 ML 分类器中对文本进行编码的相关文章