在机器学习中,你需要多少训练数据?

2023-11-13

你为什么会问这个问题?

首先我们要搞清楚你为什么会问需要多大的训练数据集。

可能你现在有以下情况:

  • 你有太多的数据。可以考虑通过构建学习曲线(learning curves)来预估样本数据集(representative sample)的大小或者使用大数据的框架把所有的可得数据都用上。
  • 你有太少的数据。首先确定你的数据量确实比较少。那么可以考虑尝试收集更多的数据或者用数据增强的方法(data augmentation methods)来人为的增加数据样本大小
  • 你还没有开始收集数据?你需要开始手机数据并且评估这些数据是否足够。如果你是在做一个研究或者数据收集太昂贵,你可以和领域内的专家或者统计学家聊一聊。

在我自己实际工作中,我经常应用学习曲线,在小数据集上应用重新采样的方法(resampling methods)比如k-fold 交叉验证和bootstrap,和在最终结果中增加置信区间。

针对这个问题,你究竟需要多少训练数据?

1. 不能一概而论,需要分论讨论

No one can tell you how much data you need for your predictive modeling problem.没有人可以在不了解你的项目的情况下告诉你,你究竟需要多少训练数据。这个一个棘手的问题,你经常需要通过经验调查来得到答案

机器学习中你所需要的数据数量和很多因素有关,比如:

  • 你要解决问题的复杂程度, nominally the unknown underlying function that best relates your input variables to the output variable.
  • 学习算法的复杂程度, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.

.2. 通过学习别人的经验进行类比

很多人在你之前做了很多机器学习相关的研究,有些人还针对他们的研究发表了paper.也许你可以参考那些和你的问题相似的文章,借鉴别人需要多大的数据量。

你还可以研究他们关于数据量大小对算法表现的影响的文章。你可以在google, Google Scholar 和Arxiv上搜索文章

3. 用你的领域的专业知识

You need a sample of data from your problem that is representative of the problem you are trying to solve.

In general, the examples must be independent and identically distributed.

Remember, in machine learning we are learning a function to map input data to output data. The mapping function learned will only be as good as the data you provide it from which to learn.

This means that there needs to be enough data to reasonably capture the relationships that may exist both between input features and between input features and output features.

Use your domain knowledge, or find a domain expert and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the problem.

4. 应用统计式启发

There are statistical heuristic methods available that allow you to calculate a suitable sample size.

Most of the heuristics I have seen have been for classification problems as a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely ad hoc.

Here are some examples you may consider:

  • Factor of the number of classes: There must be x independent examples for each class, where x could be tens, hundreds, or thousands (e.g. 5, 50, 500, 5000).
  • Factor of the number of input features: There must be x% more examples than there are input features, where x could be tens (e.g. 10).
  • Factor of the number of model parameters: There must be x independent examples for each parameter in the model, where x could be tens (e.g. 10).

They all look like ad hoc scaling factors to me.

Have you used any of these heuristics?
How did it go? Let me know in the comments.

In theoretical work on this topic (not my area of expertise!), a classifier (e.g. k-nearest neighbors) is often contrasted against the optimal Bayesian decision rule and the difficulty is characterized in the context of the curse of dimensionality; that is there is an exponential increase in difficulty of the problem as the number of input features is increased.

For example:

Findings suggest avoiding local methods (like k-nearest neighbors) for sparse samples from high dimensional problems (e.g. few samples and many input features).

For a kinder discussion of this topic, see:

5. 非线性算法一般需要更多数据

The more powerful machine learning algorithms are often referred to as nonlinear algorithms.

By definition, they are able to learn complex nonlinear relationships between input and output features. You may very well be using these types of algorithms or intend to use them.

These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your problem in addition to the values of those parameters). They are also high-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training data, often a lot more data.

In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data.

If a linear algorithm achieves good performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like random forest, or an artificial neural network.

6. 评估数据集大小和模型表现

It is common when developing a new machine learning algorithm to demonstrate and even explain the performance of the algorithm in response to the amount of data or problem complexity.

These studies may or may not be performed and published by the author of the algorithm, and may or may not exist for the algorithms or problem types that you are working with.

I would suggest performing your own study with your available data and a single well-performing algorithm, such as random forest.

Design a study that evaluates model skill versus the size of the training dataset.

Plotting the result as a line plot with training dataset size on the x-axis and model skill on the y-axis will give you an idea of how the size of the data affects the skill of the model on your specific problem.

This graph is called a learning curve.

From this graph, you may be able to project the amount of data that is required to develop a skillful model, or perhaps how little data you actually need before hitting an inflection point of diminishing returns.

I highly recommend this approach in general in order to develop robust models in the context of a well-rounded understanding of the problem.

7. 天真的猜测

You need lots of data when applying machine learning algorithms.

Often, you need more data than you may reasonably require in classical statistics.

I often answer the question of how much data is required with the flippant response:

Get and use as much data as you can.

If pressed with the question, and with zero knowledge of the specifics of your problem, I would say something naive like:

  • You need thousands of examples.
  • No fewer than hundreds.
  • Ideally, tens or hundreds of thousands for “average” modeling problems.
  • Millions or tens-of-millions for “hard” problems like those tackled by deep learning.

Again, this is just more ad hoc guesstimating, but it’s a starting point if you need it. So get started!

8. Get More Data (No Matter What!?)

Big data is often discussed along with machine learning, but you may not require big data to fit your predictive model.

Some problems require big data, all the data you have. For example, simple statistical machine translation:

If you are performing traditional predictive modeling, then there will likely be a point of diminishing returns in the training set size, and you should study your problems and your chosen model/s to see where that point is.

Keep in mind that machine learning is a process of induction. The model can only capture what it has seen. If your training data does not include edge cases, they will very likely not be supported by the model.

不要拖延,马上开始吧

不要让训练数据大小的问题阻止你开始你的研究你的模型问题。尽可能多的获得数据,把所有可用的资源都用上,验证模型在解决你的问题上是否有效.

Learn something, then take action to better understand what you have with further analysis, extend the data you have with augmentation, or gather more data from your domain.

延伸阅读

This section provides more resources on the topic if you are looking go deeper.

There is a lot of discussion around this question on Q&A sites like Quora, StackOverflow, and CrossValidated. Below are few choice examples that may help.

I expect that there are some great statistical studies on this question; here are a few I could find.

Other related articles.

If you know of more, please let me know in the comments below.

Summary

In this post, you discovered a suite of ways to think and reason about the problem of answering the common question:

How much training data do I need for machine learning?

Did any of these methods help?
Let me know in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Except, of course, the question of how much data that you specifically need

如果您有训练数据采集和标注需求欢迎联系我们www.sparkapi.ai

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

在机器学习中,你需要多少训练数据? 的相关文章

随机推荐

  • Vue的页面跳转与刷新

    Vue刷新页面 在开发的过程中 有时候我们需要刷新整个页面 this router go 0 Vue页面跳转 例如 在用户登录成功之后跳转到系统首页 this router push home
  • 基础算法题——虫洞(简单版、vector)

    虫洞 简单 题目链接 解题步骤 求出第 i 个星球作为中心子星系时 f i 的大小 对每个 i 与 n f i 异或后的结果相加 再对998244353取模即可得到答案 问题关键点 求第 i 个星球 f i 的大小 个人解题思路 暴力 利用
  • cpython下载_一、Python简介及下载安装

    一 关于Python Python是目前比较受欢迎的脚本语言之一 具有简洁性 易读性以及可扩展性的特点 Python与Java均可以写网页 也可以写后台功能 区别是Python执行效率低 开发效率高 而Java执行效率高 开发效率低 pyt
  • Linux--vim安装、简介、模式及命令

    目录 1 vim简介 1 命令模式转为插入模式 2 命令模式转为末行模式 3 转换图 4 vim常用命令 复制 删除 1 删除 2 拷贝 3 粘贴 4 撤销 5 恢复撤销 6 替换 7 光标移动 1 行开头 2 行末尾 3 最后一行 4 第
  • SAP 销售订单及发票 利润中心替代 Userexit出口 配置及程序

    在跨公司销售业务中 跨公司销售订单的发票时无法从销售订单中将利润中心带到发票中 所以在跨公司的发票创建过程中需要配置出口来获取对应销售订单行项目的利润中心 事务代码 0KEM 配置步骤 1 创建一个新的替代 2 创建一个步骤 3 维护一个先
  • 华为数通方向HCIP-DataCom H12-831题库(单选题:1-20)

    第1题 关于IPSG下列说法错误的是 A IPSG可以防范IP地址欺骗攻击 B IPSG是一种基于三层接口的源IP地址过滤技术 C IPSG可以开启IP报文检查告警功能 联动网管进行告警 D 可以通过IPSG防止主机私自更改IP地址 答案
  • 更改默认滚动条的样式

    在前端开发的过程中 通常会需要更改滚动条的默认样式 代码如下 webkit scrollbar 滚动条整体样式 width 4px 高宽分别对应横竖滚动条的尺寸 height 1px webkit scrollbar thumb 滚动条里面
  • echarts前后端交互数据_Web的前后端交互

    1503年 列奥纳多 达 芬奇回到佛罗伦萨 开始绘制 蒙娜丽莎 耗时四年 塑造了资本主义上升时期一位城市有产阶级的妇女形象 将自己内心的的妇女通过画卷展示给了众人 期间无数的灵感 无数的情绪 无数的状态这是大家不能所得知的 更不用说付出的心
  • MAC系统 批量删除一个项目中的所有.svn

    打开终端 进入项目所在的文件夹 使用命令find type d name svn xargs rm rvf就可将项目的 svn全部删除
  • Oracle存储过程总结(一、基本应用)

    1 创建存储过程 create or replace procedure test var name 1 in type var name 2 out type as 声明变量 变量名 变量类型 begin 存储过程的执行体 end tes
  • React系列之useState

    目录 1 基础使用 2 状态的读取和修改 3 组件的更新过程 4 使用规则 1 基础使用 作用 useState为函数组件提供状态 state 使用步骤 导入 useState 函数 调用 useState 函数 并传入状态的初始值 从us
  • 人脸识别刷脸以往大多应用在安防领域

    最近你会看到各大媒体 网络平台都是关于刷脸的报道 感觉不聊聊刷脸是不是就已经跟这个时代脱节了 简单的说 刷脸支付是一种连手机都不需要的新型支付方式 刷脸支付意味着在手机没电 信号不好 因为卡顿打不开支付宝和微信 甚至出门忘记带手机时也能完成
  • AMBA低功耗接口(一)Q_Channel

    AMBA提供了 低功耗的接口 用于实现power控制功能 目前 AMBA里面 包含2种低功耗接口 Q Channel 实现简单的power控制 如上电 下电 P Channel 实现复杂的power控制 如全上电 半上电 1 4上电等 AR
  • 微服务springcloud环境下基于Netty搭建websocket集群实现服务器消息推送----netty是yyds

    netty搭建websocket 1 背景 2 websocket 3 netty 3 1 socket 3 2 Java IO模型 3 3 netty 3 3 1 概念 3 3 2 三大特点 3 3 3 主从Reactor架构图 3 3
  • 期货基础知识

    一 期货是什么 期货是与现货相对应 并由现货衍生而来 期货通常指期货合约 期货与现货完全不同 现货是实实在在可以交易的货 商品 期货主要不是货 而是以某种大众产品如棉花 大豆 石油等及金融资产如股票 债券等为标的标准化可交易合约 因此 这个
  • C++11下的单例模式

    动机 在软件系统中 经常有这样一些特殊的类 必须保证它们在系统中只存在一个实例 才能确保它们的逻辑正确性 以及良好的效率 绕过常规的构造器 提供一种机制来保证一个类只有一个实例 定义 保证一个类仅有一个实例 并提供一个该实例的全局访问点 1
  • Java 反射小案例(使用配置文件)

    Java 反射小案例 使用配置文件 记录学习过程 定义两个类一个Student 一个Person 还有一个 主类Reflect Test 再加一个配置文件 两类 public class Student public void sleep
  • RabbitMQ死信队列学习笔记

    视频地址 死信的概念 先从概念解释上搞清楚这个定义 死信 顾名思义就是无法被消费的消息 字面意思可以这样理解 一般来说 producer 将消息投递到 broker 或者直接到 queue 里了 consumer 从 queue 取出消息进
  • JDBC连接

    JDBC连接 加载JDBC驱动程序 打开JDBC连接 打开带有URL的连接 使用URL 用户和密码打开连接 使用URL和属性打开连接 关闭JDBC连接 通过Try With Resources关闭连接 setAutoCommit commi
  • 在机器学习中,你需要多少训练数据?

    你为什么会问这个问题 首先我们要搞清楚你为什么会问需要多大的训练数据集 可能你现在有以下情况 你有太多的数据 可以考虑通过构建学习曲线 learning curves 来预估样本数据集 representative sample 的大小或者