Python数据科学:从探索性分析到机器学习的实用指南

引言

在大数据时代,数据科学已成为各行各业的重要驱动力。Python作为数据科学的主流语言,凭借其丰富的生态系统和易用性,使得数据分析和机器学习变得更加平易近人。本文将系统介绍Python数据科学的关键技术栈、工作流程和实践经验,帮助读者构建从数据获取到模型部署的完整数据科学能力。

Python数据科学生态系统

核心库与工具链

Python数据科学的强大之处在于其丰富且成熟的生态系统。以下是核心组件:

  1. NumPy:科学计算的基础,提供高效的数组操作
  2. Pandas:数据分析和处理的核心工具
  3. Matplotlib/Seaborn:数据可视化库
  4. Scikit-learn:机器学习算法的实现
  5. TensorFlow/PyTorch:深度学习框架
  6. Jupyter Notebook:交互式开发环境

这些工具的组合使Python成为数据科学的理想语言:

1
2
3
4
5
6
7
8
9
10
# 基本工具链导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, model_selection, metrics

# 设置可视化风格
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

环境配置与最佳实践

数据科学项目的环境管理至关重要:

1
2
3
4
5
6
7
8
9
# 使用conda创建独立环境
conda create -n datascience python=3.9
conda activate datascience

# 安装核心包
conda install numpy pandas matplotlib seaborn scikit-learn jupyterlab

# 或使用pip
pip install numpy pandas matplotlib seaborn scikit-learn jupyterlab

推荐使用虚拟环境管理依赖,避免不同项目之间的冲突。针对不同类型的项目,可以配置专用环境:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# environment.yml示例
name: ml-project
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- numpy=1.21
- pandas=1.3
- scikit-learn=1.0
- matplotlib=3.4
- seaborn=0.11
- jupyterlab=3.2
- pip
- pip:
- kaggle==1.5.12
- mlflow==1.21.0

数据获取与预处理

数据源与获取方式

数据获取是整个数据科学工作流程的起点:

  1. 公开数据集:Kaggle、UCI仓库、政府公开数据
  2. API调用:Web API、数据服务提供商API
  3. 网络爬虫:使用requests、beautifulsoup、selenium等工具
  4. 数据库查询:SQL数据库、NoSQL数据库

以下是使用Pandas从不同源获取数据的示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 从CSV文件读取数据
df = pd.read_csv('data.csv')

# 从数据库读取数据
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df = pd.read_sql('SELECT * FROM table_name', engine)

# 从API获取数据
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

# 从Excel文件读取
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

数据清洗与预处理

获取数据后,通常需要进行清洗和预处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 基本数据探索
print(df.shape) # 数据维度
print(df.info()) # 数据类型和缺失值
print(df.describe()) # 数值统计摘要

# 处理缺失值
df.isna().sum() # 检查每列缺失值数量
df.fillna(method='ffill', inplace=True) # 前向填充
df['numeric_col'].fillna(df['numeric_col'].mean(), inplace=True) # 均值填充
df.dropna(subset=['important_col'], inplace=True) # 删除特定列缺失的行

# 处理异常值
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
filter = (df['value'] >= Q1 - 1.5 * IQR) & (df['value'] <= Q3 + 1.5 * IQR)
df_filtered = df[filter]

# 特征转换
df['log_value'] = np.log1p(df['value']) # 对数变换
df['category'] = df['category'].astype('category') # 类型转换

特征工程

特征工程是提高模型性能的关键步骤:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 标准化数值特征
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])

# 编码分类特征
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# 标签编码
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# 独热编码
df_encoded = pd.get_dummies(df, columns=['category'], drop_first=True)

# 创建交互特征
df['feature_interaction'] = df['feature1'] * df['feature2']

# 多项式特征
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['feature1', 'feature2']))
df = pd.concat([df, poly_df], axis=1)

探索性数据分析

数据可视化

可视化是理解数据和发现模式的强大工具:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 基本图表
plt.figure(figsize=(12, 6))

# 直方图
plt.subplot(2, 3, 1)
sns.histplot(df['feature1'], kde=True)
plt.title('Feature1 Distribution')

# 箱型图
plt.subplot(2, 3, 2)
sns.boxplot(y='feature1', x='category', data=df)
plt.title('Feature1 by Category')

# 散点图
plt.subplot(2, 3, 3)
sns.scatterplot(x='feature1', y='feature2', hue='target', data=df)
plt.title('Feature1 vs Feature2')

# 相关性热图
plt.subplot(2, 3, 4)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')

# 成对关系图
plt.figure(figsize=(12, 10))
sns.pairplot(df.select_dtypes(include=[np.number]).sample(1000), hue='target')
plt.suptitle('Pairwise Relationships', y=1.02)

plt.tight_layout()
plt.show()

统计分析

通过统计分析深入理解数据的特性:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 基本统计量
df.groupby('category')['value'].agg(['mean', 'median', 'std', 'min', 'max'])

# 相关性分析
correlation = df.corr()
# 显示高相关特征对(绝对值大于0.7)
high_corr = correlation.where(np.abs(correlation) > 0.7)
print(high_corr)

# 假设检验
from scipy import stats

# 检验两组数据是否有显著差异
group1 = df[df['category'] == 'A']['value']
group2 = df[df['category'] == 'B']['value']
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t检验: t统计量 = {t_stat:.4f}, p值 = {p_value:.4f}")

# 检验数据是否符合正态分布
k2, p = stats.normaltest(df['value'])
print(f"正态性检验: 统计量 = {k2:.4f}, p值 = {p:.4f}")

# ANOVA分析
groups = [df[df['category'] == cat]['value'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA分析: F统计量 = {f_stat:.4f}, p值 = {p_value:.4f}")

机器学习模型构建

监督学习模型

Scikit-learn提供了丰富的监督学习算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score

# 准备数据
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 分类问题
if problem_type == 'classification':
# 逻辑回归
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("逻辑回归准确率:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

# 随机森林
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("随机森林准确率:", accuracy_score(y_test, y_pred_rf))

# 支持向量机
svc = SVC(kernel='rbf', probability=True)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
print("SVM准确率:", accuracy_score(y_test, y_pred_svc))

# 回归问题
else:
# 线性回归
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("线性回归MSE:", mean_squared_error(y_test, y_pred_lr))
print("线性回归R²:", r2_score(y_test, y_pred_lr))

# 梯度提升回归
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_test)
print("梯度提升MSE:", mean_squared_error(y_test, y_pred_gbr))
print("梯度提升R²:", r2_score(y_test, y_pred_gbr))

非监督学习技术

非监督学习用于发现数据中的模式和结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# 主成分分析
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)
print('PCA方差解释率:', pca.explained_variance_ratio_)

plt.figure(figsize=(10, 8))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=y if 'y' in locals() else None)
plt.title('PCA Visualization')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

# K均值聚类
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X)

# 可视化聚类结果
plt.figure(figsize=(10, 8))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=cluster_labels)
plt.title('K-means Clustering')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.colorbar(label='Cluster')
plt.show()

# DBSCAN聚类
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# t-SNE降维可视化
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(X)

plt.figure(figsize=(10, 8))
plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=y if 'y' in locals() else None)
plt.title('t-SNE Visualization')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.show()

模型评估与调优

模型评估和超参数调优对于构建高性能模型至关重要:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, learning_curve
from sklearn.metrics import accuracy_score, precision_recall_curve, roc_curve, auc, confusion_matrix

# 交叉验证
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"交叉验证分数: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# 网格搜索超参数优化
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("最佳参数:", grid_search.best_params_)
print("最佳分数:", grid_search.best_score_)

# 使用最佳模型
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# ROC曲线和AUC
y_proba = best_model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.4f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

# 学习曲线分析
train_sizes, train_scores, test_scores = learning_curve(
best_model, X, y, cv=5, n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), 'o-', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.title('Learning Curve')
plt.legend(loc='best')
plt.grid(True)
plt.show()

高级主题与技术

特征选择方法

特征选择可以提高模型性能并减少维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sklearn.feature_selection import SelectKBest, chi2, RFE, SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# 基于统计检验的特征选择
selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("选择的特征:", selected_features)

# 递归特征消除
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
rfe.fit(X, y)
rfe_features = X.columns[rfe.support_]
print("RFE选择的特征:", rfe_features)

# 基于模型的特征选择
rf = RandomForestClassifier()
rf.fit(X, y)
importance = rf.feature_importances_
indices = np.argsort(importance)[::-1]

plt.figure(figsize=(12, 6))
plt.title('Feature Importance')
plt.bar(range(X.shape[1]), importance[indices], align='center')
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)
plt.tight_layout()
plt.show()

# 使用模型系数选择特征
selector = SelectFromModel(RandomForestClassifier(), threshold='median')
selector.fit(X, y)
model_features = X.columns[selector.get_support()]
print("基于模型的选择特征:", model_features)

集成学习技术

集成学习通过组合多个模型提高性能:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from sklearn.ensemble import VotingClassifier, BaggingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# 投票分类器
clf1 = LogisticRegression(random_state=42)
clf2 = RandomForestClassifier(random_state=42)
clf3 = SVC(probability=True, random_state=42)

voting_clf = VotingClassifier(
estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)],
voting='soft'
)

voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
print("投票分类器准确率:", accuracy_score(y_test, y_pred))

# Bagging
bagging = BaggingClassifier(
DecisionTreeClassifier(),
n_estimators=100,
max_samples=0.8,
max_features=0.8,
random_state=42
)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
print("Bagging准确率:", accuracy_score(y_test, y_pred))

# Stacking
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=5)),
('svc', SVC(probability=True, random_state=42))
]

stacking_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)

stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
print("Stacking准确率:", accuracy_score(y_test, y_pred))

自动机器学习(AutoML)

AutoML工具可以自动化模型选择和优化过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 使用auto-sklearn
!pip install auto-sklearn
import autosklearn.classification

automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=3600, # 搜索时间限制为1小时
per_run_time_limit=300, # 每次运行时间限制为5分钟
memory_limit=10240 # 内存限制为10GB
)

automl.fit(X_train, y_train)
print(automl.sprint_statistics())

y_pred = automl.predict(X_test)
print("AutoML准确率:", accuracy_score(y_test, y_pred))

# 查看所选择的模型
print(automl.show_models())

# 使用TPOT
!pip install tpot
from tpot import TPOTClassifier

tpot = TPOTClassifier(
generations=5,
population_size=50,
cv=5,
random_state=42,
verbosity=2
)

tpot.fit(X_train, y_train)
print("TPOT准确率:", tpot.score(X_test, y_test))

# 导出TPOT最佳管道
tpot.export('tpot_pipeline.py')

模型部署与生产化

模型序列化

将训练好的模型保存和加载:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import joblib
import pickle

# 使用joblib保存模型
joblib.dump(best_model, 'model.joblib')

# 使用pickle保存模型
with open('model.pkl', 'wb') as f:
pickle.dump(best_model, f)

# 加载模型
loaded_model = joblib.load('model.joblib')
# 或
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)

# 使用加载的模型预测
predictions = loaded_model.predict(X_test)

创建REST API

使用Flask创建简单的预测API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)
model = joblib.load('model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)

# 数据预处理
features = np.array([data['features']])

# 预测
prediction = model.predict(features)

# 返回预测结果
return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
app.run(debug=True)

模型监控与更新

为部署的模型建立监控和重训练机制:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import mlflow
import pandas as pd
from sklearn.metrics import accuracy_score
from datetime import datetime

# 使用MLflow跟踪模型性能
mlflow.set_experiment("model_monitoring")

def evaluate_model(model, X, y, model_version):
with mlflow.start_run():
# 记录模型版本和时间
mlflow.log_param("model_version", model_version)
mlflow.log_param("evaluation_time", datetime.now().isoformat())

# 预测和评估
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)

# 记录性能指标
mlflow.log_metric("accuracy", accuracy)

# 如果性能下降,触发警报
if accuracy < 0.8: # 阈值
alert_low_performance(model_version, accuracy)

return accuracy

def alert_low_performance(model_version, accuracy):
# 发送警报(例如邮件、Slack消息等)
print(f"WARNING: Model version {model_version} performance below threshold: {accuracy}")

# 定期评估模型
def regular_evaluation(model_path, data_path, model_version):
model = joblib.load(model_path)
data = pd.read_csv(data_path)
X = data.drop('target', axis=1)
y = data['target']

accuracy = evaluate_model(model, X, y, model_version)

# 如果性能显著下降,触发重训练
if accuracy < 0.75: # 重训练阈值
retrain_model(data_path)

def retrain_model(data_path):
# 加载新数据
data = pd.read_csv(data_path)
X = data.drop('target', axis=1)
y = data['target']

# 拆分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 训练新模型
new_model = RandomForestClassifier(n_estimators=100, random_state=42)
new_model.fit(X_train, y_train)

# 评估新模型
new_accuracy = accuracy_score(y_test, new_model.predict(X_test))

# 如果新模型更好,则替换当前模型
if new_accuracy > 0.8:
joblib.dump(new_model, 'model_new.joblib')
print(f"Model retrained with accuracy: {new_accuracy}")

实战案例:客户流失预测

问题背景

客户流失预测是一个典型的分类问题,对企业留住客户至关重要:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 加载数据
df = pd.read_csv('telco_customer_churn.csv')

# 数据探索
print(df.head())
print(df.info())
print(df['Churn'].value_counts())

# 数据可视化
plt.figure(figsize=(15, 10))

# 客户流失比例
plt.subplot(2, 3, 1)
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')

# 按服务时长分组的流失率
plt.subplot(2, 3, 2)
sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure by Churn Status')

# 按月费用分组的流失率
plt.subplot(2, 3, 3)
sns.boxplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Monthly Charges by Churn Status')

# 按合同类型的流失率
plt.subplot(2, 3, 4)
sns.countplot(x='Contract', hue='Churn', data=df)
plt.title('Churn by Contract Type')

# 按互联网服务类型的流失率
plt.subplot(2, 3, 5)
sns.countplot(x='InternetService', hue='Churn', data=df)
plt.title('Churn by Internet Service')

plt.tight_layout()
plt.show()

特征工程和建模

准备数据,实现机器学习模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# 数据预处理
# 处理缺失值
df.fillna(method='ffill', inplace=True)

# 删除不需要的列
df.drop(['customerID'], axis=1, inplace=True)

# 将分类特征转为数值
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('Churn') # 移除目标变量

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# 将目标变量转为二进制
df_encoded['Churn'] = df_encoded['Churn'].map({'Yes': 1, 'No': 0})

# 准备特征和目标变量
X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']

# 特征标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 构建模型
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'XGBoost': XGBClassifier(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
results[name] = accuracy
print(f"{name} Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

# 绘制ROC曲线比较
plt.figure(figsize=(10, 8))

for name, model in models.items():
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Different Models')
plt.legend()
plt.show()

# 模型优化 - 以XGBoost为例
param_grid = {
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7],
'min_child_weight': [1, 3, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'n_estimators': [100, 200, 300]
}

grid_search = RandomizedSearchCV(
XGBClassifier(random_state=42),
param_distributions=param_grid,
n_iter=10,
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42
)

grid_search.fit(X_train, y_train)
print("最佳参数:", grid_search.best_params_)
print("最佳得分:", grid_search.best_score_)

best_xgb = grid_search.best_estimator_
y_pred = best_xgb.predict(X_test)
y_pred_proba = best_xgb.predict_proba(X_test)[:, 1]

print("优化后XGBoost准确率:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# 特征重要性分析
feature_importance = best_xgb.feature_importances_
sorted_idx = np.argsort(feature_importance)[::-1]
top_features = X.columns[sorted_idx[:10]]

plt.figure(figsize=(12, 6))
plt.barh(range(10), feature_importance[sorted_idx][:10], align='center')
plt.yticks(range(10), top_features)
plt.xlabel('Feature Importance')
plt.title('Top 10 Important Features for Churn Prediction')
plt.tight_layout()
plt.show()

模型解释

使用SHAP值解释模型预测:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import shap

# 创建解释器
explainer = shap.TreeExplainer(best_xgb)
shap_values = explainer.shap_values(X_test)

# 摘要图表
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test, feature_names=X.columns)

# 依赖图
plt.figure(figsize=(12, 8))
shap.dependence_plot("MonthlyCharges", shap_values, X_test, feature_names=X.columns)

# 解释单个预测
i = 5 # 选择一个测试样本
plt.figure(figsize=(12, 6))
shap.force_plot(explainer.expected_value, shap_values[i], X_test[i], feature_names=X.columns, matplotlib=True)

总结与最佳实践

Python数据科学是一个广阔而深入的领域,掌握其核心工具和方法对于数据分析师和机器学习工程师至关重要。本文介绍了从数据获取、预处理到模型构建、评估和部署的完整工作流程。关键最佳实践包括:

  1. 建立结构化工作流程:从问题定义到模型部署的完整pipeline
  2. 重视数据质量:数据清洗和特征工程是成功的基础
  3. 使用探索性分析:通过可视化和统计分析理解数据
  4. 尝试多种模型:没有一种算法适合所有问题
  5. 进行严格评估:使用交叉验证和多种指标评估模型
  6. 模型解释性:理解模型决策对业务应用至关重要
  7. 版本控制:追踪数据、代码和模型变化
  8. 自动化流程:利用管道和自动化工具提高效率

随着技术的不断发展,保持学习最新的工具和方法对于数据科学从业者至关重要。通过实践和项目经验,你将能够构建更强大、更可靠的数据科学解决方案。

参考资料

  1. Python Data Science Handbook - Jake VanderPlas
  2. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow - Aurélien Géron
  3. Feature Engineering for Machine Learning - Alice Zheng & Amanda Casari
  4. Pandas文档: https://pandas.pydata.org/docs/
  5. Scikit-learn文档: https://scikit-learn.org/stable/documentation.html
  6. Kaggle教程与竞赛: https://www.kaggle.com/