Digest | 在python中使用Scikit-Learn进行多元回归

>  一份2021年03月23日的信息流提炼

### 每天学点机械学习

#### 深度学习vs普通机器学习

原文：[深度学习vs机器学习 | 这些本质区别你知道多少？](https://cloud.tencent.com/developer/article/1517984)

- **数据相关性**： 随着数据量的增大，深度学习的性能会越来越好，而传统机器学习方法性能表现却趋于平缓；但传统的机器学习算法在数据量较小的情况下，比深度学习有着更好的表现。
- **硬件依赖性**： 深度学习算法在很大程度上依赖于高端机器，而传统的机器学习算法可以在低端机器上工作。
- **特征工程**： 在机器学习中，大多数应用的特征需要由专家识别，然后根据域和数据类型手工编码。而深度学习算法则试图从数据中学习更高级的特性。
- **解决问题方法**： 在使用传统的机器学习算法解决问题时，通常的做法是将问题分解成不同的部分，然后单独解决，最后结合起来得到结果。相比之下，深度学习更提倡端到端地解决问题。
- **执行时间**： 通常，深度学习算法需要很长的时间来训练，这是因为在深度学习算法中有太多的参数，所以训练这些参数的时间比平时要长。即使比较先进的深度学习算法Resnet，从零开始完全训练也需要大约两周的时间。相比之下，机器学习所需的训练时间要少得多，从几秒钟到几个小时不等。
- **可解释性**： 而相较于深度学习，类似于决策树这样的机器学习算法为我们提供了清晰的规则，告诉我们什么是它的选择以及为什么选择了它，很容易解释算法背后的推理。因此，决策树和线性/逻辑回归等机器学习算法主要用于工业中需要可解释性的场景。

#### Scikit-Learn

[Scikit-Learn简介](https://www.jianshu.com/p/cacbc6674984)

Scikit-learn项目最早由数据科学家David Cournapeau 在2007 年发起，需要NumPy和SciPy等其他包的支持，是Python语言中专门针对机器学习应用而发展起来的一款开源框架。Scikit-learn内部实现了各种各样成熟的算法，容易安装和使用，样例丰富，而且教程和文档也非常详细。

Scikit-learn不支持深度学习和强化学习, 也不支持图模型和序列预测，不支持Python之外的语言，不支持PyPy，也不支持GPU加速。

Scikit-learn的基本功能主要被分为六大部分：分类，回归，聚类，数据降维，模型选择和数据预处理。

[scikit-learn (sklearn) 官方文档中文版](https://sklearn.apachecn.org/)

```bash
# 检查版本 
# Python (>= 3.5)
# NumPy (>= 1.11.0)
# SciPy (>= 0.17.0)
python --version 
pip show numpy scipy

# 安装
pip install -U scikit-learn
```

#### 在python中使用Scikit-Learn进行多元回归

原文：[Multiple Regression in python using Scikit-Learn:Predicting the Miles Per Gallon (mpg) of cars.](https://medium.com/@powusu381/multiple-regression-in-python-using-scikit-learn-predicting-the-miles-per-gallon-mpg-of-cars-4c8e512234be)

可以用Google colab进行演示。

##### 使用package介绍

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import statsmodels as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error
from sklearn import preprocessing
```

- **pandas**: 是一个快速、强大、灵活、易用的开源数据分析和操作工具。
- **numpy**:  对大型多维数组和矩阵的支持，以及大量高级数学函数的集合。
- **seaborn**: 一个基于matplotlib的Python数据可视化库。它提供了一个高级接口，用于绘制有吸引力和信息量大的统计图形。
- **statsmodels**: 对SciPy的统计模块的补充。允许用户探索数据，估计统计模型，并进行统计测试。
- **sklearn**（Scikit-Learn）
  - 线性 `sklearn.linear_model`
    - 线性回归 `LinearRegression`
    - 岭回归 `Ridge`, Lasso回归 `Lasso`
      - 岭回归的目标函数在一般的线性回归的基础上加入了正则项，在保证最佳拟合误差的同时，使得参数尽可能的“简单”，使得模型的泛化能力强。
      - Lasso回归采用一范数来约束，使参数非零个数最少。
      - 岭回归和lasso回归的最根本的目的是防止过拟合。
  - 树形 `sklearn.tree`
    - 决策树 `DecisionTreeRegressor`
  - 集成 `sklearn.ensemble`
    - 随机森林 `RandomForestRegressor`
      - Bagging算法： 构建多个基分类器，各个基分类器之间相互独立，每个基分类器都随机从原样本中做有放回的采样（自主采样），然后在这些采样后的样本上训练该基分类器。每一个单独的分类器都这样操作，然后再把这些分类器的结果组合起来。对于分类问题，样本最后的预测值就是这些分类器中的众数（多数投票原则），对于回归问题，最后的预测值就是这些分类器预测值的平均值。
    - 梯度提升 `GradientBoostingRegressor `
      - boosting算法： 1）基学习器之间存在强依赖关系，每一个基分类器是在前一个基分类器的基础之上生成。具体实现方法跟基学习器有关；2）将所有基学习器结果进行线性加权求和，作为最终结果输出；3）是一个加法模型，故优化方法采用前向分步算法。

我们将用于回归分析的数据包括一些汽车的技术规格.数据集是从UCI机器学习库下载的.UCI机器学习库是一个机器学习库，它包含了免费的数据，你可以用于你的机器学习项目.你也可以去我的github仓库查看数据文件和写在jupyter笔记本上的python代码。在这个项目中，我们尝试使用多元回归来预测汽车的每加仑里程。

##### 数据加工

数据清洗，e.g. 去除预测数据为null的值。去除重复值，取掉不需要的列。

```python
file1 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars1.csv'
file2 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars2.csv'

cars1 = pd.read_csv(file1)   # read in the first data file
cars2 = pd.read_csv(file2)   # read in the second data file

# we have other columns named unamed which contains only NaN values
# so we drop them.
cars1.drop(cars1.columns[9:],axis=1,inplace=True)

# concatenate the two data
cars = pd.concat([cars1,cars2])
cars.head()  # print the first five rows of the data

cars.info()  # print the info of the data
```

##### 数据特征分析

######  直方图看数据特征分布

```python
# let's visualize the distribution of the features of the cars
cars.hist(figsize=(12,8),bins=20)
plt.show()
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*Xb22GGvwoJqK8WU02IgP0w.png)

- 数据中汽车的加速度(acceleration)呈正态分布，大部分汽车的加速度为15米/秒的平方。
- 数据中的汽车总数的一半（51.3%）有4个气缸(cylinder)。
- 我们的输出/因变量（mpg）稍微右倾斜。

###### heatmap看数据特征之间的联系

```python
plt.figure(figsize=(10,6))sns.heatmap(cars.corr(),cmap=plt.cm.Reds,annot=True)
plt.title('Heatmap displaying the relationship between\nthe features of the data',
         fontsize=13)
plt.show()
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*GPBYpi9b2IYLnp-4Q_N5OQ.png)

- mpg变量和其他变量之间存在关系，这满足线性回归的第一个假设。
- 排量(displacement)、马力(horsepower)、重量(weight)和气缸(cylinder)与mpg有很强的负相关性，意味着随着这些变量中任何一个变量的增加mpg会下降。而它们有很强的正自相关性。这违反了线性回归的非多线性假设，多线性阻碍了我们的回归模型的性能和准确性，为了避免这种情况，我们必须通过做特征选择来摆脱这些变量的一部分。
- 而其他变量，即加速度(acceleration)、模型(model)和原点(origin)的相关性不高。我们还可以使用方差膨胀因子(VIF: variance inflation factor)来检查多重共线性(multi-collinearity)。如果一个变量的VIF大于5，那么它就与多重共线性相关，我们将使用`statsmodels`的`variance_inflation_factor()`来执行这个任务，代码如下所示。

```python
X1 = sm.tools.add_constant(cars)
# calculate the VIF and make the results a series.
series1 = pd.Series([variance_inflation_factor(X1.values,i) for i in range(X1.shape[1])],index=X1.columns)
print('Series before feature selection: \n\n{}\n'.format(series1))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*ZamHC6Vn3wNkddZ-FXfCZg.png)

我们可以看到，我们的数据中存在着多重线性的问题，因为有些变量的方差膨胀系数大于5.而且我们还可以清楚地看到，排量、马力、重量和气缸之间有很强的正相关关系，它们是造成多重线性的原因，如上面的相关热图所示.为了避免这种情况，我们从我们的数据中取出这些特征，并计算其余变量的方差膨胀系数，检查多重线性是否仍然存在.

```python
# Let's drop the columns that highly correlate with each other
newcars = cars.drop(['cylinders','displacement','weight'],axis=1)
# Let's do the variance inflation factor method again after doing a feature selection
#to see if there's still multi-collinearity.
X2 = sm.tools.add_constant(newcars)
series2 = pd.Series([variance_inflation_factor(X2.values,i) for i in range(X2.shape[1])],index=X2.columns)
print('Series after feature selection: \n\n{}'.format(series2))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*lI0jiV4YcueDp3mdMDLJWQ.png)

很好，我们已经摆脱了多共线性(multi-collinearity)，因为剩下的变量的VIF小于5.现在我们已经从我们的数据中得到了足够的信息，是时候进行拟合训练(fit train)，并在其上拟合一个模型，并开始进行一些预测。

###### 训练回归模型

我们将使用`sklearn.model_selection`的`training_test_split()`函数把我们的数据集分成两部分，即训练数据和测试数据。 由于变量的尺度不一样，我们将使用`sklearn.model_selection`的`preprocessing.scale()`函数对它们进行缩放，缩放变量只对线性、脊回归和Lasso回归模型是必要的，因为这些模型会对系数进行惩罚，在对特征或预测变量进行缩放后，我们将继续对数据拟合线性回归模型，并评估模型的准确性。

```python
# create a DataFrame of independent variables
X = newcars.drop('mpg',axis=1)
# create a series of the dependent variable
y = newcars.mpg

# scaling the feature variables.
X_scaled = preprocessing.scale(X)

# preprocessing.scale() returns a 2-d array not a DataFrame so we make our scaled variables
# a DataFrame.
X_scaled = pd.DataFrame(X_scaled,columns=X.columns)

# split our data into training and testing data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size=.3,random_state=0)

model = LinearRegression()  # initialize the LinearRegression model
model.fit(X_train,y_train)  # we fit the model with the training data

linear_pred = model.predict(X_test)  # make prediction with the fitted model

# score the model on the train set
print('Train score: {}\n'.format(model.score(X_train,y_train)))
# score the model on the test set
print('Test score: {}\n'.format(model.score(X_test,y_test)))
# calculate the overall accuracy of the model
print('Overall model accuracy: {}\n'.format(r2_score(y_test,linear_pred)))
# compute the mean squared error of the model
print('Mean Squared Error: {}'.format(mean_squared_error(y_test,linear_pred)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*ea2UBk9plsrQ1nsZiIp8UQ.png)

从上面的输出中我们可以看到，线性回归模型在训练数据上的拟合度为75.5%，在测试集上的拟合度为72.7%.有了这个模型，我们不存在过拟合或欠拟合的问题，但模型的精度并不令人满意，所以我们继续在数据上拟合一个Ridge模型，看看能否提高精度，并使均方误差最小化。

```python
ridge = Ridge(alpha=.01)
ridge.fit(X_train,y_train)  # fit the model with the training data

ridge_pred = ridge.predict(X_test)  # make predictions

# score the model to check the accuracy
print('Train score: {}\n'.format(ridge.score(X_train,y_train)))
print('Test score: {}\n'.format(ridge.score(X_test,y_test)))
print('Overall model accuracy: {}\n'.format(r2_score(y_test,ridge_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(y_test,ridge_pred)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*Puy3xekO8hN0DWHUMy27Vw.png)

看起来Ridge模型与我们第一次拟合的`LinearRegression`模型没有什么不同.让我们尝试调整超参数，看看是否能使精度发生显著变化，并使MSE (最小均方误差 Minimum mean square error)最小化。我们将进行网格搜索交叉验证(grid search cross validation)，使用`sklearn.model_selection`中的`GridSearchCV()`函数搜索最佳参数。

```python
# we now try to tune the parameters of the ridge model for a better accuracy
# we use a grid search to find the best parameters for the ridge model
ridge_model = Ridge()

param = {'alpha':[0,0.1,0.01,0.001,1]}  # define the parameters

# initialize the grid search
ridge_search = GridSearchCV(ridge_model,param,cv=5,n_jobs=-1)

ridge_search.fit(X_train,y_train)   # fit the model

# print out the best parameter for ridge and score it on the test and train data
print('Best parameter found:\n{}'.format(ridge_search.best_params_))
print('Train score: {}\n'.format(ridge_search.score(X_train,y_train)))
print('Test score: {}'.format(ridge_search.score(X_test,y_test)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*FkAXXYlGBLyE42B52QCgOQ.png)

从Ridge回归中没有发现什么，我们继续拟合一个Lasso回归模型，并直接进行网格搜索，寻找最佳参数。

```python
# let's try and fit a Lasso model for the regression
# here, we just move on to making the grid search and find the best parameters
lasso = Lasso()

param['max_iter'] = [1000,10000,100000,1000000]

lasso_search = GridSearchCV(lasso,param,cv=5,n_jobs=-1) # initialize the grid search

lasso_search.fit(X_train,y_train)  # fit the model

# print out the best parameters and score it 
print('Best parameter found:\n{}\n'.format(lasso_search.best_params_))
print('Train score: {}\n'.format(lasso_search.score(X_train,y_train)))
print('Test score: {}'.format(lasso_search.score(X_test,y_test)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*QGupZdOVUeL2mOxBvNirTQ.png)

LinearRegression,Ridge和Lasso给我们的模型精度和均方误差都不尽人意，所以我们继续使用合集方法来进行回归，最常用的合集方法是DecisionTree,RandomForest和GradientBoosting.与其先用单参数拟合模型并打分，不如直接从网格搜索开始寻找最佳参数，并对模型进行打分，让我们从DecisionTreeRegressor开始，并调整其参数。

```python
dtree = DecisionTreeRegressor() # initialize a DecisionTreeRegressor model

params = {'max_features':['auto','sqrt','log2'],
         'min_samples_split':[2,3,4,5,6,7,8,9],
         'min_samples_leaf':[1,2,3,4,5,6,7,8,9],
         'max_depth':[2,3,4,5,6,7]}                # define the hyperparameters

tree_search = GridSearchCV(dtree,params,cv=5,n_jobs=-1)  # initialize the grid search

tree_search.fit(xtrain,ytrain)   # fit the model

tree_pred = tree_search.predict(xtest)  # make predictions with the model

# print out the best parameters found and score the model
print('Best parameter found:\n{}\n'.format(tree_search.best_params_))
print('Train score: {}\n'.format(tree_search.score(xtrain,ytrain)))
print('Test score: {}\n'.format(tree_search.score(xtest,ytest)))
print('Overall model accuracy: {}\n'.format(r2_score(ytest,tree_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(ytest,tree_pred)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*t8LII6piLFVVyjbUBpgKKA.png)

从上面的输出结果中，我们可以清楚地看到DecisionTreeRegressor的准确率为79%，均方差为11.4，比LinearRegression、Ridge和Lasso模型都要好，但看起来DecisionTreeRegressor的似乎有点**过拟合**，因为它在训练数据上的预测准确率为86.6%，在测试数据上的预测准确率为79%。让我们考虑一下RandomForestRegressor模型，看看我们是否还能得到更高的准确率，最小化的误差，以及一个通用的模型。

```python
# we now fit a RandomForestRegressor model and perform a grid search to find the best 
# parameters
forest = RandomForestRegressor()

# we add the n_estimators parameter in our previous parameter dictionary
params['n_estimators'] = [100,200,300,400,500]

forest_search = RandomizedSearchCV(forest,params,cv=5,n_jobs=-1,     # initialize the search
                                  n_iter=50)

forest_search.fit(xtrain,ytrain)  # fit the model

forest_pred = forest_search.predict(xtest)  # make prediction with the model

# print out the best parameters and score the model
print('Best parameter found:\n{}\n'.format(forest_search.best_params_))
print('Train score: {}\n'.format(forest_search.score(xtrain,ytrain)))
print('Test score: {}\n'.format(forest_search.score(xtest,ytest)))
print('Overall model accuracy: {}\n'.format(r2_score(ytest,forest_pred)))
print('Mean Squared Error: {}'.format(mean_squared_error(ytest,forest_pred)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*l9M_yKbjz-QbJh6k2RU2Lg.png)

RandomForestRegressor在降低均方误差方面做得很好，但也存在过拟合的情况。让我们考虑一下GradientBoostingRegressor模型。

```python
# train a GradientBoostingRegressor model

gradient_model = GradientBoostingRegressor()  # instantiate the model

# append a learning_rate parameter to the parameter dictionary
params['learning_rate'] = [0.05,0.1,0.2,0.3,0.4,0.5]

gradient_search = RandomizedSearchCV(gradient_model,params,cv=5,n_jobs=-1,
                                  n_iter=50)   # initialize the search

gradient_search.fit(xtrain,ytrain)   # fit the model

gradient_pred = gradient_search.predict(xtest)  # make predictions with the model

# print out the best parameters and score the model
print('Best parameter found:\n{}\n'.format(gradient_search.best_params_))
print('Train score: {}\n'.format(gradient_search.score(xtrain,ytrain)))
print('Test score: {}\n'.format(gradient_search.score(xtest,ytest)))
print('Overall model accuracy: {}\n'.format(r2_score(ytest,gradient_pred)))
print('Mean Squared Error: {}\n'.format(mean_squared_error(ytest,gradient_pred)))
```

![img](https://gitee.com/morphicx/image-host/raw/master/uPic/1*oQIdpIKja5ZW5LJd8JWhJw.png)

看起来这个模型过拟合的问题要好点，它有低的平均平方误差(mean squared error)。平方根为8.88开根号：2.98。这告诉我们，从实际值和预测值的平均距离是2.98。我们现在将尝试进行预测，看看我们的模型预测.我们将可视化的实际MPG值记录和MPG值预测我们的模型，看看我们的预测有多接近实际值。

```python
# create a new DataFrame of the feature variables
newcars_new = newcars.drop('mpg',axis=1)

# make a DataFrame of the actual mpg and the predicted mpg 
data = pd.DataFrame({'Actual mpg':newcars.mpg.values,
                    'Predicted mpg':gradient_search.predict(newcars_new.values)})

# make a scatter plot of the actual and the predicted mpg of a car
plt.figure(figsize=(12,8))
plt.scatter(data.index,data['Actual mpg'].values,label='Actual mpg')
plt.scatter(data.index,data['Predicted mpg'].values,label='Predicted mpg')
plt.title('Comapring the Actual mpg values to the Predicted mpg values\nModel accuracy = 85%',
         fontsize=16)
plt.xlabel('Car index')
plt.ylabel('Mile Per Gallon (mpg)')
plt.legend(loc='upper left')
plt.show()
```

![img](https://miro.medium.com/max/7200/1*HkRs7wkUUWb5g7T481BaLg.png)

从上面的散点图可以看出我们的模型作出了良好的预测，因为实际的MPG和预测的MPG值是非常接近对方。

可以选择继续训练其他模型，如Adaboost和XGboost，这可能会给出一个更好的准确性和最小化的错误，相比最终的GradientBoosting模型。

### 其他值得阅读

#### 创业公司的社交媒体自动化工具

原文：[Social Media Automation Tools for Startups](https://draft.dev/learn/tools/social-media-automation)

- **[Zapier](https://zapier.com/)** （free有）--Zapier能做的不仅仅是社交媒体--它是一个完整的自动化解决方案，适合非开发人员使用。使用Zapier链接账户，从RSS源自动发布，或设置复杂的过滤器。

- **[Buffer](https://buffer.com/)** （free有）--Buffer非常适合在多个社交平台或账户上安排帖子。此外，它还能为你提供关于你的哪些帖子表现最好的分析。

- **[F5Bot](https://f5bot.com/)** （free有）--一项免费服务，当你选定的关键词在Reddit、Hacker News或Lobsters上被提及时，它会给你发邮件。用它来监测你的品牌，你的项目，或者只是你感兴趣的话题。

- **[Hootsuite](https://hootsuite.com/)**（free有）--和Tweetdeck很像，Hootsuite可以帮助你管理多个Twitter账户，但Hootsuite更进一步。你还可以安排发布到其他社交网络的帖子，自动寻找新内容，并跟踪你的帖子的表现。

- **[Social Share Preview](https://socialsharepreview.com/)** (free有) - 在发布之前，在社交媒体上预览你的网站、登陆页或博客文章。

- **[Hubspot](https://www.hubspot.com/products/marketing/social-inbox)**（free有）--作为Hubspot营销平台的一部分，一个社交媒体管理工具。

- **[Likeable Hub](https://likeablehub.com/)** - Likeable Hub是一个功能齐全的社交媒体管理平台。

- **[PowerPost](https://www.powerpost.digital/)** - 安排和规划博客文章，将它们分享到您的所有社交渠道，并与团队一起保持一切有条不紊，这是一个挑战。PowerPost让这一切变得更加简单。

- **[TweetDeck](https://tweetdeck.twitter.com/)**(0美元)--Twitter收购Tweetdeck是为了帮助拥有多个账户的用户保持有序的管理。

### 一点收获

- 可以用Google colab进行原型验证，再用AWS Sagemaker 快速开发部署机器学习产品。
- AWS EC2维护重启相关对应措施：（1）reschedule到一个用户较少的时间段。（2）手动重启实例（保证实例在掌握的时间点重启）