机器学习特征扩展方式_可扩展的机器学习

机器学习特征扩展方式

Machine Learning is part of a encyclopedic known as Artificial Intelligence. It evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data — such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs.

机器学习是称为人工智能的百科全书的一部分。它是从对人工智能中的模式识别和计算学习理论的研究发展而来，机器学习探索了可以从数据中学习并进行数据预测的算法的研究和构建，这种算法通过进行数据驱动的预测来克服严格遵循静态程序指令的情况，通过根据样本输入构建模型来做出决策。

机器学习类别 (Machine Learning Categories)

We can broadly categorize machine learning into supervised and unsupervised categories based on the approach. There are other categories as well, but we’ll keep ourselves to these two:• Supervised learning works with a set of data that contains both the inputs and the desired output — Supervised learning is further divided into two broad sub-categories called classification and regression:-Classification algorithms are related to categorical output, like whether a property is occupied or not -Regression algorithms are related to a continuous output range, like the value of a property• Unsupervised learning, on the other hand, works with a set of data which only have input values.

基于该方法，我们可以将机器学习大致分为有监督和无监督类别。也有其他类别，但我们将仅关注这两个类别：• 监督学习使用一组包含输入和期望输出的数据进行工作-监督学习进一步分为两个大的子类别，称为分类和分类。回归：- 分类算法与分类输出相关，例如是否使用某个属性- 回归算法与连续输出范围相关，例如某个属性的值•另一方面， 无监督学习可与一组仅具有输入值的数据。

机器学习工作流程 (Machine Learning Workflow)

什么是Spark MLlib？ (What is Spark MLlib?)

Spark MLlib is Apache Spark’s Machine Learning component. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning algorithms.On top of this, MLlib provides most of the popular machine learning and statistical algorithms. This greatly simplifies the task of working on a large-scale machine learning project.

Spark MLlib是Apache Spark的机器学习组件。 Spark的主要吸引力之一是能够大规模扩展计算能力，这正是您需要机器学习算法的能力。此外，MLlib提供了大多数流行的机器学习和统计算法。这大大简化了从事大型机器学习项目的工作。

MLlib算法 (MLlib Algorithms)

The popular algorithms and utilities in Spark MLlib are:

Spark MLlib中流行的算法和实用程序是：

1.Basic Statistics2.Regression3.Classification4.Recommendation System5.Clustering6. Dimensionality Reduction7.Feature Extraction8.Optimization

1.基本统计2.回归3.分类4.推荐系统5.聚类6。降维7，特征提取8，优化

机器学习的“ Hello World” (“Hello World” of Machine Learning)

Consider a multivariate labeled dateset, consisting of length and width of sepals and petals of different species of Iris.This gives our problem objective: can we predict the species of an Iris from the length and width of its sepal and petal?

考虑一个多变量标记的日期集，该日期集由不同种类的鸢尾花的萼片和花瓣的长度和宽度组成，这给出了我们的问题目标： 我们能否根据其萼片和花瓣的长度和宽度来预测鸢尾的种类 ？

构型 (Configurations)

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.X</artifactId> <version>2.X</version> <scope>provided</scope></dependency>

<dependency> <groupId> org.apache.spark </ groupId> <artifactId> spark-mllib_2.X </ artifactId> <version> 2.X </ version> <scope>提供的</ scope> </ dependency>

设置数据 (Setting up the data)

Initialize the SparkContext to work with Spark APIs -

初始化SparkContext以与Spark API配合使用-

SparkConf conf = new SparkConf().setAppName(“Main”) .setMaster(“local[X]”);JavaSparkContext sc = new JavaSparkContext(conf);

SparkConf conf =新的SparkConf()。setAppName(“ Main”).setMaster(“ local [X]”); JavaSparkContext sc =新的JavaSparkContext(conf);

then we have to load data in Spark -

然后我们必须在Spark中加载数据-

String dataFile = “file_location”;JavaRDD<String> data = sc.textFile(dataFile);

字符串dataFile =“文件位置”； JavaRDD <String>数据= sc.textFile(dataFile);

Spark MLlib offers several data types, both local and distributed, to represent the input data and corresponding labels. The simplest of the data types are Vector -

Spark MLlib提供了几种本地和分布式数据类型来表示输入数据和相应的标签。最简单的数据类型是Vector-

JavaRDD<Vector> inputData = data .map(line -> { String[] parts = line.split(“,”); double[] v = new double[parts.length — 1]; for (int i = 0; i < parts.length — 1; i++) { v[i] = Double.parseDouble(parts[i]); } return Vectors.dense(v);});

JavaRDD <Vector> inputData = data .map(line-> {String [] parts = line.split(“，”); double [] v =新double [parts.length — 1]; for(int i = 0; i <parts.length — 1; i ++){v [i] = Double.parseDouble(parts [i]);} return Vectors.dense(v);});

Note that we’ve included only the input features here, mostly to perform statistical analysis. A training example typically consists of multiple input features and a label, represented by the class LabeledPoint

请注意，我们此处仅包括输入功能，主要用于执行统计分析。一个训练示例通常由多个输入要素和一个标签组成，由LabeledPoint类表示

Map<String, Integer> map = new HashMap<>();map.put(“Iris-setosa”, 0);map.put(“Iris-versicolor”, 1);map.put(“Iris-virginica”, 2); JavaRDD<LabeledPoint> labeledData = data .map(line -> { String[] parts = line.split(“,”); double[] v = new double[parts.length — 1]; for (int i = 0; i < parts.length — 1; i++) { v[i] = Double.parseDouble(parts[i]); } return new LabeledPoint(map.get(parts[parts.length — 1]), Vectors.dense(v));});

Map <String，Integer> map = new HashMap <>(); map.put(“ Iris-setosa”，0); map.put(“ Iris-versicolor”，1); map.put(“ Iris-virginica” ，2); JavaRDD <LabeledPoint> labelledData = data .map(line-> {String [] parts = line.split(“，”); double [] v =新double [parts.length — 1]; for(int i = 0; i <parts.length — 1; i ++){v [i] = Double.parseDouble(parts [i]);}返回新的LabeledPoint(map.get(parts [parts.length — 1])，Vectors.dense(v ));});

Our output label in the data-set is textual, signifying the species of Iris. To feed this into a machine learning model, we have to convert this into numeric values.

数据集中的输出标签是文本的，表示虹膜的种类。为了将其输入到机器学习模型中，我们必须将其转换为数值。

探索性数据分析 (Exploratory Data Analysis)

EDA refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

EDA是指对数据进行初步调查，以发现模式，发现异常情况，检验假设并在汇总统计信息和图形表示的帮助下检查假设的关键过程。

Our dataset, in this example, is small and well-formed. Hence we don’t have to indulge in a lot of data analysis. Spark MLlib, however, is equipped with APIs to offer quite an insight. Let’s begin with some simple statistical analysis -MultivariateStatisticalSummary summary = Statistics.colStats(inputData.rdd());System.out.println(“Summary Mean:”);System.out.println(summary.mean());System.out.println(“Summary Variance:”);System.out.println(summary.variance());System.out.println(“Summary Non-zero:”);System.out.println(summary.numNonzeros());

在此示例中，我们的数据集很小且格式正确。因此，我们不必沉迷于大量的数据分析。但是，Spark MLlib配备了API，可以提供很多见识。让我们从一些简单的统计分析开始-MultivariateStatisticalSummary summary = Statistics.colStats(inputData.rdd()); System.out.println(“ Summary Mean：”); System.out.println(summary.mean()); System。 out.println(“摘要方差：”); System.out.println(summary.variance()); System.out.println(“摘要非零：”); System.out.println(summary.numNonzeros() );

Here, we’re observing the mean and variance of the features we have. This is helpful in determining if we need to perform normalization of features. It’s useful to have all features on a similar scale. We are also taking a note of non-zero values, which can adversely impact model performance. Another important metric to analyze is the correlation between features in the input data -Matrix correlMatrix = Statistics.corr(inputData.rdd(), “pearson”);System.out.println(“Correlation Matrix:”);System.out.println(correlMatrix.toString());

在这里，我们正在观察所具有特征的均值和方差。这有助于确定我们是否需要执行功能规范化。使所有功能具有相似的比例很有用。我们还记下了非零值，这可能会对模型性能产生不利影响。另一个要分析的重要指标是输入数据中要素之间的相关性- 矩阵correlMatrix = Statistics.corr(inputData.rdd()，“ pearson”); System.out.println(“ Correlation Matrix：”); System.out。 println(correlMatrix.toString());

A high correlation between any two features suggests they are not adding any incremental value and one of them can be dropped.

任意两个功能之间的高度相关性表明它们未添加任何增量值，并且其中一个可以删除。

分割数据 (Splitting the Data)

If we recall our discussion of machine learning workflow, it involves several iterations of model training and validation followed by final testing.For this to happen, we have to split our training data into training, validation, and test sets. To keep things simple, we’ll skip the validation part. So, let’s split our data into training and test sets -

如果回想一下关于机器学习工作流程的讨论，它涉及模型训练和验证的几次迭代，然后进行最终测试，为此，我们必须将训练数据分为训练，验证和测试集。为了简单起见，我们将跳过验证部分。因此，让我们将数据分为训练集和测试集-

JavaRDD<LabeledPoint>[] splits = parsedData.randomSplit(new double[] { 0.7, 0.2 }, 10L);JavaRDD<LabeledPoint> trainingData = splits[0];JavaRDD<LabeledPoint> testData = splits[1];

JavaRDD <LabeledPoint> [] splits = parsedData.randomSplit(new double [] {0.7，0.2}，10L); JavaRDD <LabeledPoint> trainingData = splits [0]; JavaRDD <LabeledPoint> testData = splits [1];

模型训练 (Model Training)

We’ve reached a stage where we’ve analyzed and prepared our dataset. All that’s left is to feed this into a model and start the magic LogisticRegressionModel model = new LogisticRegressionWithLBFGS() .setNumClasses(3) .run(trainingData.rdd());Here, we are using a three-class Limited Memory BFGS based classifier.

我们已经达到分析和准备数据集的阶段。剩下的就是将其输入模型并启动神奇的LogisticRegressionModel模型= new LogisticRegressionWithLBFGS().setNumClasses(3).run(trainingData.rdd()); 在这里，我们使用基于三类有限内存BFGS的分类器。

模型评估 (Model Evaluation)

Remember that model training involves multiple iterations, but for simplicity, we’ve just used a single pass here. Now that we’ve trained our model, it’s time to test this on the test dataset -

请记住，模型训练涉及多次迭代，但是为了简单起见，我们在这里只使用了一次遍历。现在我们已经训练了模型，是时候在测试数据集上对此进行测试了-

JavaPairRDD<Object, Object> predictionAndLabels = testData .mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());double accuracy = metrics.accuracy();System.out.println(“Model Accuracy on Test Data: “ + accuracy);

JavaPairRDD <Object，Object> projectionAndLabels = testData .mapToPair(p-> new Tuple2 <>(model.predict(p.features())，p.label())); MulticlassMetrics指标=新的MulticlassMetrics(predictionAndLabels.rdd() );双精度= metrics.accuracy(); System.out.println(“测试数据的模型精度：“ +精度)；

Now, how do we measure the effectiveness of a model? There are several metrics that we can use, but one of the simplest is Accuracy. Simply put, accuracy is a ratio of the correct number of predictions and the total number of predictions. However, accuracy is not a very effective metric in some problem domains. Other more sophisticated metrics are Precision and Recall (F1 Score), ROC Curve, and Confusion Matrix.

现在，我们如何衡量模型的有效性？我们可以使用多种指标，但最简单的指标之一就是准确性。简而言之，准确度是正确预测数与预测总数之比。但是，在某些问题域中，准确性不是非常有效的指标。其他更复杂的指标是精确度和召回率(F1得分)，ROC曲线和混淆矩阵。

最后坚持模型 (Finally Persist the Model)

model.save(sc, “test\\model\\logistic-regression”);LogisticRegressionModel sameModel = LogisticRegressionModel .load(sc, “test\\model\\logistic-regression”);Vector newData = Vectors.dense(new double[]{1,1,1,1});double prediction = sameModel.predict(newData);System.out.println(“Model Prediction on New Data = “ + prediction);

model.save(sc，“ test \\ model \\ logistic-regression”); LogisticRegressionModel sameModel = LogisticRegressionModel .load(sc，“ test \\ model \\ logistic-regression”); Vector newData = Vectors.dense(new double [] {1,1,1,1});双重预测= sameModel.predict(newData); System.out.println(“新数据的模型预测=“ +预测);

We often need to save the trained model to the file-system and load it for prediction on production data.So, we’re saving the model to the file-system and loading it back. After loading, the model can be straight away used to predict output on new data.

我们经常需要将经过训练的模型保存到文件系统中并加载以用于生产数据预测，因此，我们将模型保存到文件系统中并重新加载回去。加载后，该模型可以立即用于预测新数据的输出。

Thanks for reading 💜

感谢您阅读💜

翻译自: https://medium.com/@vivekshivhare/scalable-machine-learning-on-spark-dacc3512e7ad