SGDClassifier 每次为文本分类提供不同的准确度

2024-04-21

我使用 SVM 分类器将文本分类为好文本和乱码。我正在使用 python 的 scikit-learn 并按如下方式执行:

'''
Created on May 5, 2017
'''

import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

# Prepare data

def prepare_data(data):
    """
    data is expected to be a list of tuples of category and texts.
    Returns a tuple of a list of lables and a list of texts
    """
    random.shuffle(data)
    return zip(*data)

# Format training data

training_data = [
    ("good", "rain a lot the packs maybe damage."),
    ("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner.  he needs a call back from customer support "),
    ("gibber", "wh. screen"),
    ("gibber", "How will I know if I"),
    ("good", "I have problems scheduling blocks they are never any available.  Can I do full time?  Can I get scheduled more than one day a month?"),
    ("good", "Suggestion: easier way to sign in due alleviate the tediousness of periodically having to sign back in to the app to check for blocks."),
    ("good", "I am so glad to hear from you. "),
    ("good", "loading on today's itinerary takes ages!!!!!! time consuming when you have 150+ packages to deliver!!!!!"),
    ("good", "due to the new update that makes hours available at 10 pm. if you worked 8 hours that day you can't see next day hours due to 8 hour limit. please fix this"),
    ("good", "omg, PLEASE make it so we don't have to sign in every time we need to go into the app. At least make it good for a week. Thanks."),
    ("good", "Constantly being logged out of app, if we could have a continuous login so we could receive notifications if blocks are available that would be ideal."),
    ("good", "I am having problems  with the App. Every time I exit the App and reopen it asks for my login info."),
    ("good", "15 minute service time due to 33rd floor and  20l lbs of cargo"),
    ("good", "I have been sceduled 1 block in 3 weeks. I check for new block availability multiple times a day and have not seen 1 available in three weeks. is there any way to get more blocks."),
    ("good", "When will delivery jobs be available? Everytime I open this app, it says nothing is available. Have deliveries in Cincinnati started yet?"),
    ("good", "During delivery had to call customer support and after 10 minutes support person couldn't find my pick up location Kirkland /Bellevue and told me to hang up and call different support team.  Support person were unprofessional and rude, which is not acceptable."),
    ("good", "can you please remove the pick up from my phone"),
    ("good", "Dear friends: I'm very very happy it's a big oportunitt"),
    ("good", "THANK YOU so much for the block you assigned me for next week.   If you have an additional 5 blocks please go ahead and assign them to me for next week.  My availability is updated and current.  You guys are awesome!!!"),
    ("good", "after update every time I open app I have too log in! I used to be able to stay logged in unless I logged out, can you return stay logged in option."),
    ("good", "It looks like my app is not installed properly on my android phone, Note 5. I cannot access or do not see the tab to swipe to start delivering and the map or help button that should be visible for me to work today 5/6 at rpm"),
    ("gibber", "AF0000"),
    ("good", "awesome app, awesome hiring process, awesome delivery warehouse , awesome team and help in the field! lets deliver I would like more more more delivers , looking forward to the future ! I just bought a new delivery vehical !"),
    ("good", "I will like to ask why I can't get more delivery's only one in two weeks"),
    ("good", "device too slow software crashing all day"),
    ("good", "it doesn't work sometimes."),
    ("good", "can you please remove the old sprouts pick up from my phone"),
    ("good", "They ability to zoom in on text screens would be very helpful. Am example would be customer notes when viewing in certain lighting conditions can be difficult."),
    ("good", "I missed out on a delivery day when I clicked check in and waited for my turn to get an order only to find out that not only did my check in not register but the gps showed me down the street. I encountered this issue again when one of the warehouse employees placed an order for that location and the app wanted me to drive in a big circle to get back to where I was standing."),
    ("good", "i am a little concerned that i didn't receive any blocks of time for this coming week, even though i had a perfect delivery score from this past pay period. Did the Cincinnati market over hire drivers where there are many people being shut completely out of any delivery blocks for an entire week? i really enjoy this type of work and the app makes it quite convenient."),
    ("good", "I've arrived at the pick up restaurant but the staff did not have the barecode for me to scan, however I pick up the package and deliver but my is still not let me move on"),
    ("good", "might want to check my assigned hours for next week.  5am to 1pm??"),
    ("good", "hi team--just want to give some positive feedback.  I have had nothing but positive feedback from customers. Great support when calling help line. Thank you for this opportunity and if there is ever a situation where you need drivers immediately I will drop what I'm doing and help. You guys are the best."),
    ("good", "Allow days or blocks throughout the day to be modified after General availability is set up for time off like doctors appointments."),
    ("gibber", "AL001234"),
    ("good", "Please, enlight me."),
    ("good", "it only shows my schedule starting in two weeks. when will we be able to start work"),
    ("good", "include more packages for one block, if the packages can be fitted into the car, so driver don't have to come back and pickup every two hours. 25% of the time is wasted coming back for pick up."),
    ("gibber", "BBB h"),
    ("gibber", "AG0003006033SDgCJ12344"),
    ("gibber", "How will I know if I"),
    ("good", "please bring back some sort of hours cap! or possibly stagger the hour drops from 1200 to 1203 so that people with slower internet/slower phone arent at a disadvantage!"),
    ("good", "when the hours released tonight all of the people who didn't have 40 hours could see them.    however the drivers that are capped at 40 were unable to see them due to a flawed system.  please fix the system so that we are not continually treated unfairly like all of the drivers that whined so much and got us in to this mess.  the cap system is unfair to people that want to work and it caused problems with a lack of drivers  to deliver today at the hub.  obviously this is not a good system and benefits no one."),
    ("good", "You have seriously messed up the whole scheduling process. Why can't I get any blocks at 10 even if I wait exactly until 10? Midnight was much better. So now that scheduling is a huge random pain in the ass, why would people want to keep doing this? I haven't been able to schedule work for three days now, it's quite frustrating when I don't get a chance to sign up, even when I'm diligent with timing."),
    ("good", "Seriously, that's all I'm going to get is one lousy day? Tell me again why you need drivers if all we get is one day. I'm not sure this is gonna work out for me. I waited forever to get my background check back and this is what I get? smh"),
    ("good", "doesn't save updated access codes"),
    ("good", "the scheduling of my route is nor done very accurately. it keeps me driving back and forth"),
    ("good", "can't understand how to pick up a block. my availability is wide open. when you guys send the alerts about blocks available I open it real quick and there is nothing there. I do it in a matter of seconds"),
    ("good", "My availability keeps disappearing from my calender.   I set my availability for three weeks in advance. The gray dots are visible  but disappear on Wednesday or Thursday.  This makes it impossible for me to see and choose available blocks for the upcoming week. How can I get it fix.   Mike"),
    ("good", "GPS blank screen"),
    ("gibber", "sea swq"),
    ("gibber", "hiw o"),
    ("gibber", "Dr a"),
    ("gibber", "quick to quick to u uhu wu just us"),
    ("gibber", "Awa what's"),
    ("gibber", "wxdfcs"),
    ("gibber", "7k9opu"),
    ("gibber", "o.m.day day"),
    ("gibber", "GGT part his h"),
    ("gibber", "aawfhg"),
    ("gibber", "seesaw 2s"),
    ("gibber", "wawaa"),
    ("gibber", "of ll"),
    ("gibber", "rewards"),
    ("gibber", "mmqqm5my"),
    ("gibber", ".in w"),
    ("gibber", "play r"),
    ("gibber", "was wwnw www www n"),
    ("gibber", "wqq2fwqq2fz22"),
    ("gibber", "not"),
    ("gibber", "I by yu I"),
    ("gibber", "Hi just wanted to let you know that it's bee"),
    ("gibber", "I erroneously v"),
    ("gibber", "I find it"),
    ("gibber", "bqyyx I a"),
    ("gibber", "are are"),
    ("gibber", "wawi waarnnnkwn"),
    ("gibber", "t Petey ueteu he"),
    ("gibber", "ews ri"),
    ("gibber", "bd xd"),
    ("gibber", "hatpa"),
    ("gibber", "se wests tasgt"),
    ("gibber", "wa vgcx azc Jo of"),
    ("gibber", "2w222"),
    ("gibber", "her u t b"),
    ("gibber", "ddddedc"),
    ("gibber", "just juju in hiking"),
    ("gibber", "wew2ww2wwwew2i2wkkk"),
    ("gibber", "meleeee"),
    ("gibber", "Aaq wqXD"),


]
training_labels, training_texts = prepare_data(training_data)


# Format test data

test_data = [

("gibber", "an quality"),
    ("good", "Can't check in.   Time was 4:06.  I didn't drive out here for no reason."),
    ("good", "can you do view all full address including postal code how it's in old app that helps do correctly delivery and not waist customer time"),
    ("good", "i am available again starting at 10am to 10pm. thanks"),
    ("gibber", "Hello, I encountered"),
    ("good", "I want to know how we are notified if there is a block I have been signed in and haven't been given a block yet"),
    ("gibber", "aawaaw"),
    ("gibber", "eeeeeeeeene"),
    ("good", "I am not getting enough shifts"),
    ("gibber", "hey e75k"),
    ("good", "my screen had went black or inverted"),
    ("good", "maps packed up again in sr20ls"),
    ("good", "how to clear my itinerary from old pickup address ?"),
    ("good", "keep signing me out."),
    ("good", "For alcohol delivery,  where does customer sign?"),
    ("gibber", "t Petey ueteu he"),
    ("good", "can't get blocks.  too many drivers ??"),
    ("good", "got a new phone how do i download to new phone")



]
test_labels, test_texts = prepare_data(test_data)


# Create feature vectors

"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels


# Train the classifier


clf = SGDClassifier()
clf.fit(X, y)


# Test performance

X_test = vectorizer.transform(test_texts)
y_test = test_labels

# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)

# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
print(annotated_test_data)

# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))

但是,每次运行时我都会得到不同的准确度。为什么会发生这种情况?

更新: 所以我将 Training_data 移动到一个文本文件中,并在上面的代码中读取它,如下所示:

lines = [line.rstrip('\n') for line in open("file.txt")]
training_data=[]
for i in lines:
    result = i.rstrip(',')
    l = literal_eval(result)
    training_data.append(l)

training_labels, training_texts = prepare_data(training_data)

我还在上面的代码中更改了这一点:

clf = SGDClassifier(random_state=5000)

所以,现在 random_state 不是 None。但是,我每次仍然得到不同的准确度!


这是因为在你的prepare_data()方法,您随机地洗牌数据。这就是你正在做的事情:

random.shuffle(data)

因此它会影响估计器的训练,从而影响结果。

尝试注释或删除该行以及random_state设置在SGDClassifier。您每次都会得到完全相同的结果。

建议:尝试使用不同的估算器,看看哪一个表现最好。如果您热衷于使用SGDClassifier,那么我建议您查看并了解n_iter范围。尝试将其更改为更大的值,您会发现准确性的差异会变得越来越小(即使您对数据进行了改组)。

您可以查看此答案以了解更多详细信息:

  • https://datascience.stackexchange.com/a/9794 https://datascience.stackexchange.com/a/9794
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

SGDClassifier 每次为文本分类提供不同的准确度 的相关文章

  • Python 小数.InvalidOperation 错误

    当我运行这样的东西时 我总是收到此错误 from decimal import getcontext prec 30 b 2 3 Decimal b Error Traceback most recent call last File Te
  • 将打开关闭的 Google Chrome 浏览器添加到 Selenium linkedin_scraper 代码中

    我正在尝试抓取一些知名人士的 LinkedIn 个人资料 该代码获取一堆 LinkedIn 个人资料 URL 然后使用Selenium and scrape linkedin收集信息并将其作为 json 文件保存到文件夹中 我遇到的问题是
  • 如何在 Django 管理中以表格格式显示添加模型?

    我刚刚开始使用 Django 编写我的第一个应用程序 为我的家庭设计的家务图表管理器 在本教程中 它向您展示了如何添加相关对象 http docs djangoproject com en dev intro tutorial02 cust
  • 通过鼻子测试检查某个函数是否发出警告

    我正在使用编写单元测试nose http somethingaboutorange com mrl projects nose 0 11 2 我想检查函数是否引发警告 该函数使用warnings warn 这是很容易就能做到的事情吗 def
  • Python 使用 M2Crypto 通过 S/MIME 对消息进行签名

    我现在花了几个小时 但找不到我的错误 我想要一个简单的例程来创建 S MIME 签名消息 稍后可以与 smtplib 一起使用 这是我到目前为止所拥有的 usr bin python2 7 coding utf 8 from future
  • 将 matplotlib png 转换为 base64 以在 html 模板中查看

    背景 你好 我正在尝试制作一个简单的网络应用程序 按照教程计算阻尼振动方程 并将结果的 png 返回到 html 页面 然后将其转换为 Base64 字符串 Problem 该应用程序运行正常 只是在计算结果时返回损坏的图像图标 可能是因为
  • 绘制“plot”而不是“scatter”时,图例选择会中断

    再会 这个问题是后续问题为什么图例选取仅适用于 ax twinx 而不适用于 ax https stackoverflow com q 60167378 9282844 下面提供的最小代码分别绘制了两条曲线ax1 and ax2 ax1 t
  • 使用字母而不是数字进行顺序计数[重复]

    这个问题在这里已经有答案了 我需要一种方法 将字符串 递增 到 z 然后将 aa 递增到 az 然后将 ba 递增到 bz 依此类推 就像 Excel 工作表中的列一样 我将向该方法提供前一个字符串 它应该增加到下一个字母 PSEUDO C
  • 在加载“cv2”二进制扩展期间检测到递归

    我有一个小程序 在 pyinstaller 编译后返回 opencv 错误 但无需编译即可工作 我在 Windows 10 上使用 Python 3 8 10 Program 导入 pyautogui将 numpy 导入为 np导入CV2
  • 在ansible中合并字典

    我目前正在构建一个使用 ansible 安装 PHP 的角色 并且在合并字典时遇到一些困难 我尝试了多种方法来做到这一点 但我无法让它像我想要的那样工作 A vars file my default values key value my
  • 在 MATLAB 中创建共享库

    一位研究人员在 MATLAB 中创建了一个小型仿真 我们希望其他人也能使用它 我的计划是进行模拟 清理一些东西并将其变成一组函数 然后我打算将其编译成C库并使用SWIG https en wikipedia org wiki SWIG创建一
  • Floyd-Warshall 算法:获取最短路径

    假设一个图由一个表示n x n维数邻接矩阵 我知道如何获得所有对的最短路径矩阵 但我想知道有没有办法追踪所有最短路径 Blow是python代码实现 v len graph for k in range 0 v for i in range
  • 管理文件字段当前 url 不正确

    在 Django 管理中 只要有 FileField 编辑页面上就会有一个 当前 框 其中包含指向当前文件的超链接 但是 此链接会附加到当前页面 url 因此会导致 404 因为不存在这样的页面 例如 http 127 0 0 1 8000
  • 如何在 Python 中仅列出 zip 存档中的文件夹?

    如何仅列出 zip 存档中的文件夹 这将列出存档中的每个文件夹和文件 import zipfile file zipfile ZipFile samples sample zip r for name in file namelist pr
  • 网页抓取 - 如何识别网页上的主要内容

    给定一个新闻文章网页 来自任何主要新闻来源 例如时报或彭博社 我想识别该页面上的主要文章内容 并丢弃其他杂项元素 例如广告 菜单 侧边栏 用户评论 在大多数主要新闻网站上都可以使用的通用方法是什么 有哪些好的数据挖掘工具或库 最好是基于Py
  • 在 scrapy 中将基本 url 与结果 href 结合起来

    下面是我的蜘蛛代码 class Blurb2Spider BaseSpider name blurb2 allowed domains www domain com def start requests self yield self ma
  • 在 Tensorflow 2.0 中的简单 LSTM 层之上添加 Attention

    我有一个由一个 LSTM 和两个 Dense 层组成的简单网络 如下所示 model tf keras Sequential model add layers LSTM 20 input shape train X shape 1 trai
  • 如何让你的精灵在pygame中跳跃

    目前我已经制作了一个平台游戏 可以左右移动我的角色 他从地上开始 关于如何让他跳的任何想法 因为我不明白 目前 如果我按住向上键 我的玩家精灵将连续向上移动 或者如果我按下它 我的玩家精灵将向上移动并保持向上 我想找个办法远离他 让我重新跌
  • 在读/写二进制数据结构时访问位域

    我正在为二进制格式编写一个解析器 这种二进制格式涉及不同的表 这些表同样采用二进制格式 通常包含不同的字段大小 其中 50 100 个之间 大多数这些结构都有位域 并且在 C 语言中表示时看起来像这样 struct myHeader uns
  • 如何检测文本是否可读?

    我想知道是否有一种方法可以告诉给定的文本是人类可读的 我所说的人类可读的意思是 它有一些含义 格式就像某人写的文章 或者至少是由软件翻译器生成的供人类阅读的文章 这是背景故事 最近我正在制作一个应用程序 允许用户将短文本上传到数据库 在部署

随机推荐

  • 我应该使用哪些 gdb 命令来缩小标签“main”中出现分段错误的位置?

    这是我的汇编代码和我的主要子例程 这是我的宏和常量 text fmt string x t t ln x n sfmt string 10lf t 10lf n error string Error filename string inpu
  • 同一 IP 443 端口中的多个域

    我在 IIS 7 的端口 443 https 上托管了一个网站 www example1 com 现在我为同一 IP 的 www example2 com 购买了一个新域 我想在此域中托管另一个网站 www example2 com htt
  • Jquery 获取具有特定类的第 n 个子级

    我有一个 html 表如下 table tr td class take 1 td td 2 td td 3 td td class take 4 td td 5 td td class take 6 td tr tr td class t
  • 如何在 Java 8 中组合不同的流

    我有一个Set
  • 在代码中添加一个定时器,然后循环它

    尝试找到一种方法将计时器添加到我的代码中 然后用计时器不断循环它 例如 尝试通过单击按钮来制作物品 然后等待 5 秒以使其制作 然后只要我有材料 它就会自动开始再次制作 依此类推 我环顾四周的教程 但未能找到我一直在寻找的东西 这是我想要循
  • 专门针对右值的 std::swap

    在标准 20 2 2 utility swap 中 std swap 是为左值引用定义的 我知道这是当你想交换两件东西时的常见情况 但是 有时交换右值是正确且可取的 当临时对象包含引用时 如下所示 交换临时引用元组 https stacko
  • 如何仅定义自定义产品类型的字段 - Woo Commerce Hook

    我的代码显示在所有产品类型中 例如简单产品 可变产品 自定义类型 手段适用于所有人 但我想将其限制为仅适用于我的自定义类型 如何将自定义字段类型限制为英语课程产品类型 add filter product type selector eng
  • Tensorflow 中多维时间序列预测中的向量表示

    我有一个大型数据集 约 3000 万个数据点 具有 5 个特征 我已使用 K 均值将其减少到 200 000 个集群 数据是大约 150 000 个时间步长的时间序列 我想要训练模型的数据是每个时间步上特定簇的存在 预测模型的目的是生成一个
  • 将 Ajax JQuery 选择器保存在数组中

    我对 Ajax 非常陌生 需要帮助将 Ajax 请求中的数据存储到数组中 我在论坛上查看了答案 但无法解决我的问题 Ajax 响应正在进入 responseField val format output response 我想将 outpu
  • 等待多个 future 的回调

    最近我深入研究了一些使用 API 的工作 该API使用Unirest http库来简化从网络接收的工作 当然 由于数据是从 API 服务器调用的 因此我尝试通过使用对 API 的异步调用来提高效率 我的想法结构如下 通过返回 future
  • JDK 17:Switch 语句导致 java.lang.VerifyError:操作数堆栈上的类型错误

    刚刚在 Eclipse 2021 09 上尝试了 JDK17 结果失败并显示java lang VerifyError 这本身并没有多大帮助 我追踪到了一个 switch 语句 它被提供了一个从 a 中取出的值Map或其他泛型类型 如果我在
  • React-native cli 和带有 Bare 工作流程的 Expo 有什么区别? [关闭]

    Closed 这个问题是基于意见的 help closed questions 目前不接受答案 我将构建一个具有多种复杂功能的非常大的应用程序 但我坚持以下几点 React native cli 和带有 Bare 工作流程的 Expo 有什
  • 在非常大的数组中查找重复项的算法

    在一次技术面试中得到了这个问题 我知道使用 在java中 HashSet解决这个问题的方法 但当面试官强行说出 这个词时 我无法理解一个非常大的数组 假设给定数组中有 1000 万个元素 我需要改变方法吗 如果不是 实现这一目标的效率应该是
  • Scrapy蜘蛛抓取页面和抓取项目之间的区别

    我正在编写一个 Scrapy CrawlSpider 它读取第一页上的 AD 列表 获取一些信息 例如列表和 AD url 的缩略图 然后向每个 AD url 发出请求以获取其详细信息 它在测试环境中工作和分页显然很好 但今天试图进行完整的
  • Java 中是否有与 Python 的 defaultdict 等效的工具?

    在 Python 中 defaultdict类提供了一种方便的方法来创建映射key gt list of values 在下面的示例中 from collections import defaultdict d defaultdict li
  • Bootstrap 模式确认表行删除

    我对网络工作非常陌生 我希望我能在这里得到一些有用的答案 我正在使用引导框架来设计一个网站 但遇到了一个小问题 我有一个表格 最后一个单元格中有一个删除按钮 我希望该按钮可以删除整行 我希望删除按钮激活引导模式以在删除之前确认表行删除 基本
  • Jenkins 未找到 SureFire 报告

    我已经在本地 jenkins 服务器中创建了一个 Maven 项目作业 项目 并添加了jenkin的TestNG插件来查看测试报告 但该作业没有显示我的 TestNg 结果 我看到以下错误 TestNG 报告处理 开始使用模式在工作区中查找
  • 使用 Backbone-Relational 实现多对多关系

    我有一个简单的应用程序 它定义了两个类 一个Person and a PersonGroup 其中存在多对多关系 一个人可以没有组 或者被分配到所有组 以及介于两者之间的任何组 backbonerelational org 上的示例建议对多
  • java中的@Documented注解

    目的是什么 Documentedjava中的注释 我看到了文档 但无法从中获得太多信息 有人可以通过一个清晰的例子指出 Documented是一个元注释 你申请 Documented定义注释时 确保使用您的注释的类在其生成的 JavaDoc
  • SGDClassifier 每次为文本分类提供不同的准确度

    我使用 SVM 分类器将文本分类为好文本和乱码 我正在使用 python 的 scikit learn 并按如下方式执行 Created on May 5 2017 import re import random import numpy