首页 > 网络 > 云计算 >

Logistic回归参数设置,分类结果评估(Spark2.0、PythonScikit)

2016-08-27

Logistic回归参数设置,分类结果评估(Spark2 0、PythonScikit)。

参数设置

α:

梯度上升算法迭代时候权重更新公式中包含 α

这里写图片描述
为了更好理解 α和最大迭代次数的作用,给出Python版的函数计算过程。

# 梯度上升算法-计算回归系数  
# 每个回归系数初始化为1
# 重复R次:
#    计算整个数据集的梯度
#    使用α*梯度更新回归系数的向量
#    返回回归系数
def gradAscent(dataMatIn, classLabels,alpha=0.001,maxCycles = 500):  
    dataMatrix = mat(dataMatIn)     #转换为numpy数据类型  
    labelMat = mat(classLabels).transpose()  
    m,n = shape(dataMatrix)      
    maxCycles = 500  
    weights = ones((n,1))  
    for k in range(maxCycles):  
        h = sigmoid(dataMatrix*weights)  
        error = (labelMat - h)  
        #计算真实类别与预测类别的差值,按照该差值的方向调整回归系数
        weights = weights + alpha* dataMatrix.transpose() * error  
    return weights  

λ

λ,正则化参数(泛化能力),加正则化的前提是特征值要进行归一化

在实际应该过程中,为了增强模型的泛化能力,防止我们训练的模型过拟合,特别是对于大量的稀疏特征,模型复杂度比较高,需要进行降维,我们需要保证在训练误差最小化的基础上,通过加上正则化项减小模型复杂度。在逻辑回归中,有L1、L2进行正则化。>
损失函数如下:
这里写图片描述
http://www.bkjia.com/yjs/996300.html
在损失函数里加入一个正则化项,正则化项就是权重的L1或者L2范数乘以一个λ,用来控制损失函数和正则化项的比重,直观的理解,首先防止过拟合的目的就是防止最后训练出来的模型过分的依赖某一个特征,当最小化损失函数的时候,某一维度很大,拟合出来的函数值与真实的值之间的差距很小,通过正则化可以使整体的cost变大,从而避免了过分依赖某一维度的结果。当然加正则化的前提是特征值要进行归一化。

threshold

threshold变量用来控制分类的阈值,默认值为0.5。表示如果预测值小于threshold则为分类0.0,否则为1.0。

在Spark Java
ElasticNetParam : α ;RegParam :λ。

LogisticRegression lr=new LogisticRegression()
                .setMaxIter(10)
                .setRegParam(0.3)
                .setElasticNetParam(0.2)                
                .setThreshold(0.5);

分类效果评估

参考:http://www.cnblogs.com/tovin/p/3816289.html
http://blog.sina.com.cn/s/blog_900690c60101czyo.html
http://blog.chinaunix.net/uid-446337-id-94448.html
http://blog.csdn.net/abcjennifer/article/details/7834256

混淆矩阵(Confusion matrix):
考虑一个二分问题,即将实例分成正类(positive)或负类(negative)。对一个二分问题来说,会出现四种情况。如果一个实例是正类并且也被 预测成正类,即为真正类(True positive),如果实例是负类被预测成正类,称之为假正类(False positive)。相应地,如果实例是负类被预测成负类,称之为真负类(True negative),正类被预测成负类则为假负类(false negative)。
TP:正确肯定的数目;
FN:漏报,没有正确找到的匹配的数目;
FP:误报,给出的匹配是不正确的;
TN:正确拒绝的非匹配对数
这里写图片描述
精确率,precision = TP / (TP + FP)
模型判为正的所有样本中有多少是真正的正样本
召回率<喎"http://www.2cto.com/kf/ware/vc/" target="_blank" class="keylink">vc3Ryb25nPqOscmVjYWxsID0gVFAgLyAoVFAgKyBGTik8YnIgLz4NCjxzdHJvbmc+17zIt8LKPC9zdHJvbmc+o6xhY2N1cmFjeSA9IChUUCArIFROKSAvIChUUCArIEZQICsgVE4gKyBGTik8YnIgLz4NCre007PBy7fWwODG982zttTV+7j20fmxvrXExdC2qMTcwaYmbWRhc2g7Jm1kYXNoO8TcvavV/bXExdC2qM6q1f2jrLi6tcTF0Laozqq4ujxiciAvPg0KyOe6ztTacHJlY2lzaW9uus1SZWNhbGzW0MiouuKjvzxiciAvPg0KPHN0cm9uZz5GMSBTY29yZTwvc3Ryb25nPiA9IFAqUi8yKFArUimjrMbk1tBQus1St9ax8M6qIHByZWNpc2lvbiC6zSByZWNhbGw8YnIgLz4NCtTacHJlY2lzaW9u0+tyZWNhbGy2vNKqx/O437XEx+m/9s/Co6y/ydLU08NGMSBTY29yZcC0uuLBvzwvcD4NCjxibG9ja3F1b3RlPg0KCTxwPs6qyrLDtLvh09DV4sO0tuDWuLHqxNijvzxiciAvPg0KCdXiysfS8s6qxKPKvbfWwOC6zbv6xvfRp8+wtcTQ6NKqoaPF0LbP0ru49rfWwODG97bUy/nTw9H5sb61xLfWwODE3MGmu/LV39TasrvNrLXE06bTw7Ohus/KsaOs0OjSqtPQsrvNrLXE1rix6qGjILWx19y5stPQuPYxMDAguPbR+bG+o6hQK049MTAwo6nKsaOsvNnI59a709DSu7j21f3A/aOoUD0xo6mjrMTHw7TWu7+8wse+q8i3tsi1xLuwo6yyu9Do0qq9+NDQyM66zsSj0M21xNG1wbejrNaxvdO9q8v509Cy4srU0fmxvsXQzqrV/cD9o6zEx8O0IEEgxNy077W9IDk5JaOst8ezo7jfwcujrLWr1eKyosO709C3tNOzs/bEo9DN1ebV/bXExNzBpqGjwe3N4tTazbO8xtDFusW31s721tCjrLbUsrvNrMDgtcTF0LbPveG5+7XEtO3O87XEs823o8rHsrvSu9H5tcSho77ZwP22+NHUo6zA17TvytW1vTEwMLj2wLTPrrW8ta+1xNDFusWjrMbk1tDWu9PQIDO49srH1ebV/bXEtby1r9DFusWjrMbk0+AgOTcguPbKx7XQt73Eo8TitcS1vLWv0MW6xaGjvNnI58+1zbPF0LbPIDk4ILj2o6g5NyC49sSjxOLQxbrFvNPSu7j21ebV/bXEtby1r9DFusWjqdDFusW2vMrHxKPE4tDFusWjrMTHw7RBY2N1cmFjeT05OCWjrLrcuN/By6OsyqPPwsG9uPbKx7W8ta/QxbrFo6yxu73YtfSjrNXiyrFSZWNhbGw9Mi8zPTY2LjY3JaOsUHJlY2lzaW9uPTIvMj0xMDAlo6xQcmVjaXNpb27SsrrcuN+ho7WryqPPwrXExMe/xbW8ta++zbvh1OyzydTWuqahozwvcD4NCjwvYmxvY2txdW90ZT4NCjxwPjxzdHJvbmc+Uk9Dx/rP37rNQVVDPC9zdHJvbmc+PGJyIC8+DQrT0MqxuvLO0sPH0OjSqtTavqvIt8LK0+vV2bvYwsq85L340NDIqLrio6w8YnIgLz4NCjxzdHJvbmc+tffV+7fWwODG93RocmVzaG9sZMih1rU8L3N0cm9uZz6jrNLURlBSo6i82dX9wspGYWxzZS1wb3NpdGl2ZSByYXRlo6nOqrrh1/ix6qOsVFBSo6hUcnVlLXBvc2l0aXZlIHJhdGWjqc6q193X+LHq1/ZST0PH+s/fo7s8YnIgLz4NCjxzdHJvbmc+QXJlYSBVbmRlciByb2MgQ3VydmUoQVVDKTwvc3Ryb25nPqO6tKbT2lJPQyBjdXJ2Zc/Ct721xMTHsr+31sPmu/21xLTz0KHNqLOjo6xBVUO1xNa1venT2jAuNbW9MS4w1q685KOsvc+087XEQVVDtPqx7cHLvc+6w7XE0NTE3KO7PGJyIC8+DQq+q8i3wsq6zdXZu9jCysrHu6XP4NOwz+y1xKOswO3P68fpv/bPwr/PtqjKx9f2tb3BvdXftry436OstavKx9K7sOPH6b/2z8LXvL6ryLfCyqGi1dm72MLKvs21zaOs1dm72MLKtc2hor6ryLfCyrjfo6y1sci7yOe5+8G91d+2vLXNo6zEx8rHyrLDtLXYt72z9s7KzOLByzxiciAvPg0KPGltZyBhbHQ9"这里写图片描述" src="http://www.2cto.com/uploadfile/Collfiles/20160827/2016082709302862.png" title="\" />

Spark 2.0分类评估

//获得回归模型训练的Summary
LogisticRegressionTrainingSummary trainingSummary = lrModel.summary();

// Obtain the loss per iteration.
//每次迭代的损失,一般会逐渐减小
double[] objectiveHistory = trainingSummary.objectiveHistory();
for (double lossPerIteration : objectiveHistory) {
  System.out.println(lossPerIteration);
}

// Obtain the metrics useful to judge performance on test data.
// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a binary
// classification problem.
//强制类型转换为二类LR的Summary,然后就可以用混淆矩阵,ROC等评估方法了。Spark2.0还无法针对多类
BinaryLogisticRegressionSummary binarySummary =
  (BinaryLogisticRegressionSummary) trainingSummary;

// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
Dataset roc = binarySummary.roc();//获得ROC
roc.show();//显示ROC数据表,可以用这个数据自己画ROC曲线
roc.select("FPR").show();
System.out.println(binarySummary.areaUnderROC());//AUC

// Get the threshold corresponding to the maximum F-Measure and rerun LogisticRegression with
// this selected threshold.
//不同的阈值,计算不同的F1,然后通过最大的F1找出并重设模型的最佳阈值。
Dataset fMeasure = binarySummary.fMeasureByThreshold();
double maxFMeasure = fMeasure.select(functions.max("F-Measure")).head().getDouble(0);//获得最大的F1值
double bestThreshold = fMeasure.where(fMeasure.col("F-Measure").equalTo(maxFMeasure))
  .select("threshold").head().getDouble(0);//找出最大F1值对应的阈值(最佳阈值)
lrModel.setThreshold(bestThreshold);//并将模型的Threshold设置为选择出来的最佳分类阈值

Logistic回归完整的代码
http://spark.apache.org/docs/latest/ml-classification-regression.html

package my.spark.ml.practice.classification;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.ml.classification.LogisticRegressionTrainingSummary;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;


public class myLogisticRegression {

    public static void main(String[] args) {
        SparkSession spark=SparkSession
                .builder()
                .appName("LR")
                .master("local[4]")
                .config("spark.sql.warehouse.dir","file///:G:/Projects/Java/Spark/spark-warehouse" )
                .getOrCreate();
        String path="G:/Projects/CgyWin64/home/pengjy3/softwate/spark-2.0.0-bin-hadoop2.6/"
                + "data/mllib/sample_libsvm_data.txt";

        //屏蔽日志
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
        Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF);       

        //Load trainning data
        Dataset trainning_dataFrame=spark.read().format("libsvm").load(path);      


        LogisticRegression lr=new LogisticRegression()
                .setMaxIter(10)
                .setRegParam(0.3)
                .setElasticNetParam(0.2)                
                .setThreshold(0.5);

        //fit the model
        LogisticRegressionModel lrModel=lr.fit(trainning_dataFrame);

        //print the coefficients and intercept for logistic regression
        System.out.println
        ("Coefficient:"+lrModel.coefficients()+"Itercept"+lrModel.intercept());

        //Extract the summary from the returned LogisticRegressionModel
        LogisticRegressionTrainingSummary summary=lrModel.summary();

        //Obtain the loss per iteration.
        double[] objectiveHistory=summary.objectiveHistory();
        for(double lossPerIteration:objectiveHistory){
            System.out.println(lossPerIteration);
        }
        // Obtain the metrics useful to judge performance on test data.
        // We cast the summary to a BinaryLogisticRegressionSummary since the problem is a binary
        // classification problem.
        BinaryLogisticRegressionTrainingSummary binarySummary=
                (BinaryLogisticRegressionTrainingSummary)summary;
        //Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
        Dataset roc=binarySummary.roc();
        roc.show((int) roc.count());//显示全部的信息,roc.show()默认只显示20行
        roc.select("FPR").show();
        System.out.println(binarySummary.areaUnderROC());

        // Get the threshold corresponding to the maximum F-Measure and rerun LogisticRegression with
        // this selected threshold.
        Dataset fMeasure = binarySummary.fMeasureByThreshold();
        double maxFMeasure = fMeasure.select(functions.max("F-Measure")).head().getDouble(0);
        double bestThreshold = fMeasure.where(fMeasure.col("F-Measure").equalTo(maxFMeasure))
          .select("threshold").head().getDouble(0);
        lrModel.setThreshold(bestThreshold);            
    }

}

Scikit

关键代码分析

#For L1 penalization sklearn.svm.l1_min_c allows to calculate the
#lower bound for C in order to get a non “null” (all feature 
#weights to zero) model. 
#计算一个C的下限值,然后放大到一个区间内。
cs = l1_min_c(X, y, loss=&#39;log&#39;) * np.logspace(0, 3)

print("l1_min_c=%.4f"%l1_min_c(X, y, loss=&#39;log&#39;))
#输出结果:l1_min_c=0.0143

clf = linear_model.LogisticRegression(C=1.0, penalty=&#39;l1&#39;, tol=1e-6)
coefs_ = []
for c in cs:#循环不同的C
    clf.set_params(C=c)#重新设置C
    clf.fit(X, y)#重新计算
    coefs_.append(clf.coef_.ravel().copy())#获得系数
    print("score %.4f" % clf.score(X, y))#计算分类的score
coefs_ = np.array(coefs_)#将系数转为np.array
plt.plot(np.log10(cs), coefs_)#作图 log(C)vs Coefs

关键参数C,penalty
penalty选择&rsquo;l1&rsquo;,&rsquo;l2&rsquo;
C:large values of C give more freedom to the model. Conversely, smaller values of C constrain the model more(C,也控制着模型的泛化能力,与前面Spark中所说的RegParam -&lambda;作用类似)

Scikit中的LR可以完成多类(one-vs-rest)的分类,L1或L2正则化。

This implementation can fit a multiclass (one-vs-rest) logistic regression with optional L2 or L1 regularization.
binary class L2 penalized logistic regression minimizes the following cost function:
这里写图片描述
L1 regularized logistic regression solves the following optimization problem
这里写图片描述

不同情形算法选择:
Small dataset or L1 penalty —-> “liblinear”
Multinomial loss —&ndash;> “lbfgs” or newton-cg”
Large dataset —&ndash;> “sag”

“Sag”:随机平均梯度下降算法,Stochastic Average Gradient descent ,在大数据集上通常比其它算法要快。

It does not handle “multinomial” case, and is limited to L2-penalized
models, yet it is often faster than other solvers for large datasets,
when both the number of samples and the number of features are large.
Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large.

完整的代码如下:

print(__doc__)

# Author: Alexandre Gramfort 
# License: BSD 3 clause

from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn import datasets
from sklearn.svm import l1_min_c

iris = datasets.load_iris()
X = iris.data
y = iris.target

X = X[y != 2]
y = y[y != 2]

X -= np.mean(X, 0)

###############################################################################
# Demo path functions

cs = l1_min_c(X, y, loss=&#39;log&#39;) * np.logspace(0, 3)

print("Computing regularization path ...")
start = datetime.now()
clf = linear_model.LogisticRegression(C=1.0, penalty=&#39;l1&#39;, tol=1e-6)
coefs_ = []
for c in cs:
    clf.set_params(C=c)
    clf.fit(X, y)
    coefs_.append(clf.coef_.ravel().copy())
print("This took ", datetime.now() - start)

coefs_ = np.array(coefs_) #50*4维矩阵
plt.plot(np.log10(cs), coefs_)
ymin, ymax = plt.ylim()
plt.xlabel(&#39;log(C)&#39;)
plt.ylabel(&#39;Coefficients&#39;)
plt.title(&#39;Logistic Regression Path&#39;)
plt.axis(&#39;tight&#39;)
plt.show()
热点推荐