R-Modeling(step 4)

2023-11-16

[I]Regression

OLSregression

Description Function
simple linear regression lm(Y~X1,data)
polynomial regression lm(Y~X1+...+I(X^2),data)
multiple linear regression lm(Y~X1+X2+...+Xk)
multiple linear regression with interaction terms lm(Y~X1+X2+X1:X2)

selection the optimal regression model

Function Description
anova(fit1,fit2) nested model fit
AIC(fit1,fit2) statistical fit
stepAIC(fit,direction=) stepwise method:"forward""backward"
regsubsets() all-subsets regression

test

Function Description
plot() plot
qqplot() quantile comparison
durbinWatsonTest() Durbin-Waston
crPlots() component and residual
ncvTest() score test of non-constant error variance
spreadLevelPlot() dispersion level
outlierTest() Bonferroni outlier
avPlots() added variable
inluencePlot() regression
scatterplot() enhanced scatter plot
scatterplotMatrix() enhanced scatter plot matrix
vif() variance expansion factor

abnormal

  • abnormal observation

1.outlier:outlierTest()
2.high leverage point:hat.plot()
3.strong influence point:Cook's D

  • improvement measures

1.delete observation point
2.variable transformation
3.addition and deletion of variables

Generalized Linear

generalized linear

  • glm()
    Distribution Family | Default Connection Function
binomial (link = "logit")
gaussian (link = "identity")
gamma (link = "inverse")
inverse.gaussian (link = "1/mu^2")
poisson (link = "log")
quasi (link = "identity",variance = "constant")
quasibinomial (link = "logit")
quasipoisson (link = "log")
  • function with glm()
    Function | Description
summary() show details of the fitting model
coefficients()/coef() list the parameters of the fitting model
confint() give a confidence interval for the model paramenters
residuals() list residual value of the fitting model
anova() generate an analysis table of variance for two fitting model
plot() generate a diagnostic map of the evaluation fitting model
predict() forecast new datasets with a fitting model
deviance() the deviation of the fitting model
df.residual() the residual freedom of the fitting model

logistic

> data(Affairs,package="ARE")
> summary(Affairs)
> table(Affair$affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <-1
> Affairs$ynaffair[Affairs$affairs == 0] <-0
> Affairs$naffair <- factor(Affairs$ynaffair),levels=c(0,1),labels=c("No","Yes"))
> table(Affairs$ynaffair)
> fit.full <-glm(ynaffair~gender + age + yearmarried +children + 
             religiousness + education + occupation + rating,family = binomial()
             data = Affairs)
> fit.reduced <-glm(ynaffair ~ age + yearmarried + religiousness +rating,
                      family = binomial(),data = Affairs)
> anova(fit.reduced,fit.full,test="Chisq")
> coef(fit.reduced)
> exp(coef(fit.reduced))
> testdata<-data.frame(rating=c(1,2,3,4,5),age=mean(Affair$age),
                                   yearsmarried=mean(Affairs$yearsmsrraied),
                                   religiousness=mean(Affairs$religiousness))
> teatdata
> testdata$prob <- predict(fit.reduced,newdata=testdata,type="response")
> testdata <- data.frame(rating = mean(Affair$rating),
                                    age=seq(17,57,10),
                                    yearsmarried=mean(Affair$yearsmarried),
                                    religiousness=mean(Affairs$religiousness))
> testdata
> testdata$prob<-predict(fit.reduced,newdata=testdata,type="response")
> testdata
> deviance(fit.reduced)/df.residual(fit.reduced)
> fit <- glm(ynaffair ~ age + yearmarried + religiousness + rating,
                 family = quasibinomial(),data=Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
             fit$df.residual,lower=F)

poisson

> data(breslow.dat,package="robust")
> name(breslow.dat)
> summary(breslow.dat[c(6,7,8,10)])
> opar<-par(no.readonly=TRUE)
> par(mfrow=c(1,2))
> attach(breakslow.dat)
> hist(sumY,breaks=20,xlab="Seizure Count",main="Distribution of Seizures")
> boxplot(sumY~Trt,xlab="Treatment",main="Group Comparisons")
> par(opar)
> fit<-glm(sumY~Base + Age +Trt,data=breslow.dat,family=poisson())
> summary(fit)
> coef(fit)
> exp(coef(fit))
> deviance(fit)/df.residual(fit)
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
> fit.od<-glm(sumY~Base + Age + Trt,data=breslow.dat,
       family=quassipossion())
> summary(fit.od)
> fit glm(sumY~Base + Age +Trt,data=breslow.dat,offset=log(time),family=poisson)

[II]Cluster

1.cluster analysis step

1.choose the right variable
2.scale data
3.looking for anomalies
4.calculated distance
5.selection clustering algorithm
6.obtain one or more clustering methods
7.determine the number of classes
8.get the ultimate clustering solution
9.visualization of results
10.interpretation class
11.validation results

2.calculated distance

> data(nurtrient,package="flexclust")
> head(nutrient,4)
> d<-dist(nutrient)
> as.martrix(d)[1:4,1:4]

3.hierarchical clustering analysis

> data(nutrient,package="flexclust")
> row.name(nutrient) <-tolower(row.names(nutrient))
> nutrient.scaled<-scale(nutrient)

> d<-dist(nutrient.scaled)

> fit.average <-hclust(d,method="average")
> plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")
> library(Nbcluster)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(nutrient,scaled,distance="euclidean",
                       min.nc=2,max.nc=15,method="average")
> table(nc$Best.n[1,])
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Cluster Chosen by 26 Criteria")  
> clusters<-cutree(fit.average,k=5)
> table(clusters)
> aggregate(nutrient,by=list(cluster=clusters),median)
> aggregate(as.data.frame(nutrient.scaled),bu=list(cluster=clusters),median)
> plot(fit.average,hang=-1,cex=.8,
         main="Average Linkage Cluster\n5 Cluster Solution")
> rect.hclust(fit.average,k=5)

4.Clustering analysis

  • K-means clustering
> wssplot<-function(data,nc=15,seed=1234)(
> head(wine)
> df<-scale(wine[-1])
> wssplot(df)
> library(NcClust)
> set.seed(1234)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmeans")
> table(nc$Best.n[1,)
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> set.seed(1234)
> fit.km<-kmeans(df,3,nstart=25)
> fit.km$size
> fit.km$centers
> aggregate(wine[-1],by=list(cluster=fit.km$cluster),mean)
  • Division around the center point
> library(cluster)
> set.seed(1234)
> fit.pam<-pam(wine[-1],k=3,stand=TRUE)
> fit.pam$medoids
> clusplot(fit.pam,main="Bivariate Cluster Plot")
> ct.pam<-table(wine$Type,fit.pam$clustering)
> randIndex(ct.pam)

5.avoid non-existing classes

> library(fMultivar)
> set.seed(1234)
> df<-rnom2d(1000,rho=.5)
> df<-as.data.frame(df)
> plot(df,main="Bivariate Normal Distribution with rho=0.5")
> wssplot(df)
> library(NbClust)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmean")
> dev.new()
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> library(ggplot2)
> library(cluster)
> fit<-pam(df,k=2)
> df$clustering<-factor(fit$clustering)
> ggplot(data=df,aes(x=V1,y=V2,color=clustering,shape=clustering)) +
             geom_point() + ggtitle("Clustering of Bivariate Normal Data")
> plot(nc$All,index[,4],type="o",ylab="CCC",xlab="Number of clusters",col="blue")

[III]Classification

data preparation

> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
> url<-paste(loc,ds,sep="")
> breast<-read.table(url,sep=",",header=FALSE,na.string="?")
> names(breast)<-c("ID","clumpThickness","sizeUniformity",
                              "shapeUniformity","maginalAdhesion",
                              "singleEpitheliaCellSize","bareNuclei",
                              "blandChromatin","normalNucleoli","mitosis","class")
> df<-breast[-1]
> df$class<-factor(df$class,levels=c(2,4),
                           labels=c("benign","malignant"))
> set.seed(1234)
> train<-sample(nrow(df),0.7*nrow(df))
> df.train<-df[train,]
> df.validate<-df[-train,]
> table(df.train$class)
> table(df.validate$class)

logistic regression

> fit.logit<-glm(class~.,data=df.train,family=binomial())
> summary(fit.logit)
> prob<-predict(fit.logit,df.validate,type="response")
> logit.perd<-factor(prob>.5,levels=c(FALSE,TRUE),
                             labels=c("benign","malignant"))
> logit.perf<-table(df.validate$class,logit.pred,
                            dnn=c("Actual","Predicted"))
> logit.perf

decision tree

  • classic decision tree
> library(repart)
> set.seed(1234)
> dtree<-repart(class~.,data=df.train,method="class",
                       parms=list(split="information"))
> dtree$cptable
> plotcp(dtree)
> dtree.pruned<-prune(dtree,cp=.0125)
> library(repart.plot)
> prp(dtree.pruned,type=2,extra=104,
        fallen.leaves=TRUE,main="Decesion Tree")
> dtree.pred<-predict(dtree.pruned,df.validate,type="class")
> dtree.perf<-table(df.validate$class,dtree.pred,
                             dnn=c("Actual","Predicted"))
> dtree.perf
  • conditional inference tree
> library(party)
> fit.ctree<-ctree(class~.,data=sf.train)
> plot(fit.ctree,main="Conditional Inference Tree")
> ctree.pred<-predict(fit.ctree,df.validate,type="response")
> ctree.perf<-table(df.validate$class,ctree.pred
                            dnn=c("Actual","Predicted"))
> stree.perf

random forest

> library(randomForest)
> set.seed(1234)
> fit.forest<-randomForest(class~.,data=df.train,na.action=na.roughfix,importance=TRUE)
> fit.forest
> importance(fit.forest,type=2)
> forest.pred<-predict(fit.forest,df.validate)
> forest.perf<-table(df.validate$class,forest.pred,
                              dnn=c("Actual","Predicted"))
> forest.perf

support vector machines

  • svm
> library(e1071)
> set.seed(1234)
> fit.svm<-svm(class~.,data=df.train)
> fit.svm
> svm,pred<-predict(fit.svm,na.omit(df.validate))
> svm.perf<-table(na.omit(df.validate)$class,
                           svm,pred,dnn=c("Actual","Predicted"))
> svm.perf
  • svm model with rbf core
> set.seed(1234)
> tuned<-tune.svm(class~.,data=df.train,gamma=10^(-6:1),cost=10^(-10:10))
> turned
> fit.svm<-svm(class~.,data=df.train,gamma=.01,cost=1)
> svm.pred<-predict(fit.svm,na.omit(df,validate))
> svm.perf<-table(na.omit(df.validate)$class,
                            svm.pred,dnn=c("Actual","Predicted"))
> svm.perf

choose the best solution for forecasting

> performance<-function(table,n=2){
        if(!all(dim(table)==c(2,2)))
              stop("Must be a 2 × 2 table")
   tn=table[1,1]
   fp=table[1,2]
   fn=table[2,1]
   tp=table[2,2]
   sensitivity=tp/(tp+fn)
   specificity=tn/(tn+fp)
   ppp=tp/(tp+fp)
   npp=tn/(tn+fn)
   hitrate=(tp+tn)/(tp+tn+fp+fn)
   result<-paste("Sensitivity=",round(ppp,n),
           "\nSpecificity=",round(specificity,n),
           "\nPositive Predictive Value=",round(ppp,n),
           "\nNegative Predictive Value=",round(npp,n),
           "\nAccuracy=",round(hitrate,n),"\n",seq="")
   cat(result)
  }

data mining with the rattle package

> install.packages("rattle")
> rattle()
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"pima-indians-diabetes/pima-indians-diabetes.data"
> url <-paste(loc,ds,sep="")
> diabetes<-read.table(url,seq=",",header=FALSE)
> names(diabetes)<-c("npregant","plasma","bp","triceps",
                                  "insulin","bmi","pedigree","age","class")
> diabetes$class<-factor(diabetes$class,levels=c(0,1),
                                      labels=c("normal","diabetic"))
> library(rattle)
> rattle()
> cv<-matrix(c(145,50,8,27),nrow=2)
> performance(as.table(cv))

[IV]Time Series

Function Package Description
ts() stats generate timing objects
plot() graphics draw a line graph of the time series
start() stats return the start time of the time series
end() stats return the end time of the time series
frequency() stats return the number of time points of the time series
windows() stats subset a sequence object
ma() forecast fit a simple moving average model
stl() stats use LOESS smooth to decompose timing into seasonal terms,trend terms and random terms
monthplot() stats draw seasonal terms in time series
seasonplot() forecast generate seasonal plot
HoltWinters() ststs fit exponential smoothing model
forecast() forecast predicting the future value of the timing
accuracy() forecast return time-of-fit goodness variable
ets() forecast fit exponential smoothing model and automatically select the optimal model
lag() stats return the timing after the specified hysteresis is taken
Acf() forecast estimated autocorrelation function
Pacf() forecast estimated partial autocorrelation function
diff() base return the sequence after the lag term or difference
ndiffs() forecast find the optimal difference count to remove the trend terms in the sequence
adf.test() tseries perform an ADF test on the sequence to determine if it is stable
arima stats fit the ARIMA model
Box.test() ststs perform a Ljung-Box test to determine if the model's residuals are independent
bds.test() teeries perform a BDS test to determine whether random variables in the sequence
are subject to independent and identical distribution
auto.arima forecast automatically select the ARIMA model

END!

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

R-Modeling(step 4) 的相关文章

随机推荐

  • 关于HTTP协议,一篇就够了

    HTTP简介 HTTP协议是Hyper Text Transfer Protocol 超文本传输协议 的缩写 是用于从万维网 WWW World Wide Web 服务器传输超文本到本地浏览器的传送协议 HTTP是一个基于TCP IP通信协
  • TCP 连接管理机制(一)——TCP三次握手详解 + 为什么要有三次握手

    TCP是面向连接的协议 在通信之前需要先建立连接 其本质就是打开一个socket文件 这个文件有自己的缓冲区 如果要发送数据 上层把数据拷贝到发送缓冲区 如果是接收数据 OS直接把来自网络的数据拷贝到接收缓冲区里 那么三次握手期间 Serv
  • youversion.com的圣经无法使用、无法连接、无法下载离线版本的解决方法

    最近 youversion com的圣经无法使用 无法连接 无法下载离线版本了 这是一部很好用的圣经软件 以前一直用着 后来ipad越狱重新安装的时候就不能连接了 后来无意间发现原来是这个网站被和谐了 至于GCD为什么这么做 以咱的智商尚不
  • 接口自动化测试须知

    一 做接口测试需要哪些技能 做接口测试 需要的技能 基本就是以下几点 业务流 了解系统及内部各个组件之间的业务逻辑交互 数据流 了解接口的I O input output 输入输出 协议 包括http协议 TCP IP协议族 http协议
  • CMD查杀端口的两种方式

    第一种 netstat ano windows r输入cmd并打开 输入netstat ano 记住对应的6052 输入杀掉端口 taskkill pid 6052 f 第二种 netstat aon findstr 8080 直接输入ne
  • Win10 + VS2017 + Ceres配置

    前言 Ceres是google出品的一款基于C 的开源非线性优化库 官方文档 Ceres官方文档地址 依赖库 Eigen 官网 glog github gflags github Ceres github 配置过程 1 Eigen Eige
  • Python3 爬虫 requests+BeautifulSoup4(BS4) 爬取小说网站数据

    刚学Python爬虫不久 迫不及待的找了一个网站练手 新笔趣阁 一个小说网站 前提准备 安装Python以及必要的模块 requests bs4 不了解requests和bs4的同学可以去官网看个大概之后再回来看教程 爬虫思路 刚开始写爬虫
  • GPT专业应用:快速生成职位描述(JD)

    正文共 814 字 阅读大约需要 3 分钟 人力资源必备技巧 您将在3分钟后获得以下超能力 快速生成职位描述 Beezy评级 B级 经过简单的寻找 大部分人能立刻掌握 主要节省时间 推荐人 Kim 编辑者 Linda 图片由 Lexica
  • 数据中台与传统大数据平台有什么区别?_光点科技

    一 数据中台 数据中台是聚合和治理跨域数据 将数据抽象封装成服务 提供给前台以业务价值的逻辑概念 数据中台是在平台概念上的升级 不再单纯的将功能进行大杂烩 理念上 中台有几个特点 第一 更强调数据集中存储 统一管理 提供标准化的服务 第二
  • 【毕业设计】基于springboot + vue微信小程序商城

    目录 前言 创新点 亮点 毕设目录 一 视频展示 二 系统介绍 三 项目地址 四 运行环境 五 设计模块 前台 后台 六 系统功能模块结构图 数据库设计 七 准备阶段 使用真实支付 使用模拟支付 八 使用说明 九 登录后台 十 后台页面展示
  • 前端常用工具库方法整理

    欢迎点击领取 前端面试题进阶指南 前端登顶之巅 最全面的前端知识点梳理总结 前言 在闲余的时间整理一份我们可能用到的前端工具库方法 依赖库 名称 cropperjs 图片裁剪 exif js lrz 图片旋转问题 html2canvas d
  • React性能优化(完整版)

    我的博客 http wangxince site my demo markdown React 性能优化 1 减少 render 次数 shouldComponentUpdate PureComponent shouldComponentU
  • 计算机学习三宗罪——计算机达人成长之路(3)(转载自朱云翔老师笔记)

    以计算机学习不可浮躁 只有用心学习 深挖知识 才能基础扎实 才可以深入理解计算机专业知识 从而达到 他强由他强 清风拂山岗 他横由他横 明月照大江 的境界 万变不离其宗 编程程序具有三重境界 同样以VCD播放器为例 第一重境界就如同上面的同
  • 【译】用 `Wasmer` 进行插件开发 1

    译 用 Wasmer 进行插件开发 1 Using Wasmer for Plugins Part 1 译文 原文链接 https wiredforge com blog wasmer plugin pt 1 index html 原文 G
  • 05-BTC-网络

    目录 前言 比特币网络的工作原理 比特币网络 比特币网络中的每一个节点维护一个零度节点的集合 比特币系统中 每个节点要维护一个等待上链的交易的集合 比特币网络的传播属于best effort 前言 学习肖臻老师的 区块链技术与应用 公开课笔
  • vue-element-admin+flask实现数据查询项目

    本文分享一个使用vue element admin flask实现的一个数据查询项目 填写数据库连接信息和查询语句 即可展示查询到的数据 前提 已下载vue element admin并编译成功 前端 1 添加路由 src router i
  • 【Windows共享文件】Java读取Windows环境共享文件夹

    Java读取Windows环境共享文件夹 支持Win10 Win11等版本 1 JCIFS介绍 JCIFS是使用纯Java开发的一个开源框架 通过smb协议访问远程文件夹 该框架同时支持Windows共享文件夹和Linux共享文件夹 不过
  • Java漫谈(二)

    类名 Java编译器的设计与java源代码的语法结构是相辅相成的 先有语法 再有编译器 大家都知道 java是以类为单位进行组织的 类是java对世界的抽象描述 Java程序的运行其实就是多个类之间的相互调用 表现在源码上 java的语法规
  • 【动手学深度学习】关于数据预处理——2.2.5练习 2023.7.12

    创建包含更多行和列的原始数据集 import torch import os os makedirs os path join data exist ok True data file os path join data nba data
  • R-Modeling(step 4)

    I Regression OLSregression Description Function simple linear regression lm Y X1 data polynomial regression lm Y X1 I X