Kaggle DL Courses

Posted on 2022-02-20 In Learning , MOOC Views: Waline:

Kaggle Courses

sklearn

结构化数据预处理：数字数据 + 类别数据
注意：为避免Group Leak，这里按演唱者(artists)划分训练集、验证集

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GroupShuffleSplit

spotify = pd.read_csv('../input/dl-course-data/spotify.csv')

X = spotify.copy().dropna()
y = X.pop('track_popularity')
artists = X['track_artist']

features_num = ['danceability', 'energy', 'key', 'loudness', 'mode',
                'speechiness', 'acousticness', 'instrumentalness',
                'liveness', 'valence', 'tempo', 'duration_ms']
features_cat = ['playlist_genre']

preprocessor = ColumnTransformer(transformers=[
    (StandardScaler(), features_num),
    (OneHotEncoder(), features_cat),
])

# We'll do a "grouped" split to keep all of an artist's songs in one
# split or the other. This is to help prevent signal leakage.
def group_split(X, y, group, train_size=0.75):
    splitter = GroupShuffleSplit(train_size=train_size)
    train, valid = next(splitter.split(X, y, groups=group))
    return (X.iloc[train], X.iloc[valid], y.iloc[train], y.iloc[valid])

X_train, X_valid, y_train, y_valid = group_split(X, y, artists)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
y_train = y_train / 100 # popularity is on a scale 0-100, so this rescales to 0-1.
y_valid = y_valid / 100

DL

layers, activations, losses, optimizers, callbacks, metrics
keras.Sequential → model.compile → model.fit → model.predict/model.evaluate

模型构建：网络层(layers)与激活函数(activations)

from tensorflow import keras
from tensorflow.keras import layers

# tf.keras.layers: Dense, Conv2D, RNN; Embedding
# tf.keras.activations: relu, elu, selu, swish
input_shape = (2,)
model = keras.Sequential([
    layers.Dense(units=4, activation='relu', input_shape=input_shape),
    layers.Dense(units=3, activation='relu'),
    layers.Dense(units=1),  # the linear layer (without activation)
])
# 激活函数可单独拿出来作为一层：
# layers.Dense(units=8), layers.Activation('relu')
# 等效于 layers.Dense(units=8, activation='relu') 

model.weights  # [w1, b1, w2, b2, w3, b3]

模型训练：损失函数(losses)与优化策略(optimizers)

# tf.keras.losses: MeanAbsoluteError
# tf.keras.optimizers: SGD, Adam(自适应学习率??)
# epoch, batchsize, learning rate

# configure the model (why 'compile' instead of config?)
model.compile(
    optimizer="adam", # 这里无需调节参数，所以简单用字符串
    loss="mae",
)

# train the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

# plot the train history
import pandas as pd
history_df = pd.DataFrame(history.history)
history_df['loss'].plot()
# history.history为字典格式，存储loss及val_loss

提前终止：欠拟合、过拟合
输入数据包含信号和噪声信息，无论模型学习到了信号还是噪声的模式都有助于降低训练的损失，但从噪声学到的模式无助于应对新的数据(不能泛化)。未获取足够信号模式表现为欠拟合，而学习了过多噪声信息则对应过拟合，通过对比训练集和验证集损失，有助于判断模型拟合情况。
提前终止EarlyStopping类位于Keras的回调模块(tf.keras.callbacks)。这里所谓“回调”是指在模型训练时调用的函数，通过模型训练fit函数的callbacks参数传入，并在模型训练时进行调用，用于控制/分析训练过程。Keras默认提供了一系列回调函数，也可通过LambdaCallback类自定义简单的回调函数。

early_stopping = tf.keras.callbacks.EarlyStopping(
    min_delta=0.001, # mini improvement threshold
    patience=20,     # epochs to wait
    restore_best_weights=True,
) # 若连续20个epoch，验证集误差提升都小于0.001则终止

model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping], # put callbacks in a list
    verbose=0,  # turn off training log
)

特殊网络层
除了全连接Dense、卷积Conv2D等包含神经单元的通常网络层外，还有一些不含神经单元、实现特殊功能的网络层，如随机失活(Dropout)、批归一化(BatchNormalization)。随机失活在某种程度上可理解为众多小网络的集成学习，提升了整体结果鲁棒性；而批归一化则有助于缓解训练的不稳定性，提升训练速度。
Dropout层加在要失活的网络层之前，而BatchNormalization可用于激活后的特征，也可用于激活前(将激活函数拆出单独做一层)，甚至作为第一层用于输入数据。注意，这些特殊网络层无需指定单元数(不含神经元)。

1 2	layers.Dropout(rate=0.3) layers.BatchNormalization()

二分类问题
分类的准确率是离散分布，而损失函数需要是连续的 → 交叉熵(度量分布间差异)。虽然损失函数不能用准确度，但我们可能还是希望通过准确度更直观的判断模型表现，此时可通过度量指标metrics参数指定模型评价标准。

不同于Loss，Metric不用于模型优化，仅用于评估模型表现；
Metric可以是任何自定义函数，由compile的metrics参数传入；
不同于Callback，Metric是作为compile的参数传入，而非fit的参数；
Loss及所有Metric会在每个epoch结束计算，并保存于history.history的字典中，Metric会以函数名为关键字，当有验证集时会有相应的’val_xxx’项；
EarlyStopping(callback)可指定检测标准，默认为验证集的损失(‘val_loss’)，可设定为训练集损失(‘loss’)，以及训练集或验证集上的其他任何度量函数取值。

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

history_df = pd.DataFrame(history.history)
history_df.loc[5:, ['loss', 'val_loss']].plot()
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot()

Create Your First Submission
Cassava Leaf Disease
Classify images with TPUs in Petals to the Metal
Create art with GANs in I’m Something of a Painter Myself
Classify Tweets in Real or Not? NLP with Disaster Tweets
Detect contradiction and entailment in Contradictory, My Dear Watson

CV

图片预处理

ds_train_ = image_dataset_from_directory(
    '../input/car-or-truck/train',
    labels='inferred',
    label_mode='binary',
    image_size=[128, 128],
    interpolation='nearest',
    batch_size=64,
    shuffle=True,
)

# Data Pipeline
def convert_to_float(image, label):
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    return image, label

AUTOTUNE = tf.data.experimental.AUTOTUNE
ds_train = (
    ds_train_
    .map(convert_to_float)
    .cache()
    .prefetch(buffer_size=AUTOTUNE)
)

迁移学习

pretrained_base = tf.keras.applications.VGG19(
                include_top=False, input_shape=(128, 128, 3))
pretrained_base.trainable = False

model = keras.Sequential([
    # Base
    pretrained_base,
    # Head
    layers.Flatten(),
    layers.Dense(6, activation='relu'),
    layers.Dense(1, activation='sigmoid'),
])

数据增强
数据增强通常在训练时执行(online)，数据在输入网络前先执行一个随机变换(旋转、翻转、扭曲、调整颜色/对比度等)，使模型每次看到都是“新”数据。但需要注意针对特定问题，不是所有变换都有用或合理，而寻找合适增强变换最好的途径就是尝试。
在Keras中数据增强可以融入数据输入的Pipline，如ImageDataGenerator函数，也可以借助预处理层融入网络结构，后一种方式好处是可以自动运行在GPU上。

# Reproducability
import numpy as np
def set_seed(seed=31415):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
set_seed()

augmentation = keras.Sequential([
    layers.RandomFlip(mode='horizontal'),
    layers.RandomContrast(factor=0.5),   # up to 1 ± 0.5
    layers.RandomRotation(factor=0.2),   # up to ± 0.2 * 2pi
    # layers.RandomBrightness(factpr=0.2)
    # layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
])

model = keras.Sequential([
    augmentation,
    pretrained_base,
    layers.Flatten(),
    layers.Dense(6, activation='relu'),
    layers.Dense(1, activation='sigmoid'),
])

网络层

卷积层：针对特定特征(卷积核)进行图像滤波 layers.Conv2D
layers.Conv2D有两个基本参数filters, kernel_size(对应全连接的units)，分别为卷积核的数目和大小，其中卷积核大小通常会取3、5等奇数(有明确中心)。
另外两个关键参数为步长(strides)和填充(padding)，对应卷积核的单次移动距离以及输入图像的边缘处理。输出特征图尺寸 = (输出尺寸 + 两侧填充尺寸 - 卷积核尺寸)//步长 + 1。步长通常取1，填充可选择不填充(‘valid’)，此时输出特征图尺寸会逐渐缩减，或者四周补0，保持输出特征图尺寸(‘same’)。
ReLU激活：从滤波后图像中检测特征，并输出特征图
Everything unimportant is equally unimportant.
池化层：压缩特征图以凸显特征(最大池化) layers.MaxPool2D, layers.GlobalAvgPool2D
经过ReLU激活后，负向特征都变为0，池化操作可以移除这些“无效”信息。另一方面，这些零值虽然与感兴趣的特征无关，但其实包含着位置信息，因此池化操作还引入了局域的平移不变性。(通常取strides大于1，小于窗口自身大小)
除了常见的最大池化，输出层附近还常使用全局平均池化(GlobalAvgPool)代替展开(Flatten)或全连接(Dense)。全局平均池化以平均值代表整个特征图((batch_size, rows, cols, channels) → (batch_size, channels))，可理解为通过一个值指示某个特征存在与否，对于分类而言通常足够了，可以显著减低参数数量，避免过拟合。

import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

image_path = '../input/computer-vision-resources/car_illus.jpg'
#image_path = '../input/computer-vision-resources/k.jpg'
image = tf.io.read_file(image_path)
image = tf.io.decode_jpeg(image, channels=1)
image = tf.image.resize(image, size=[400, 500])
# 图片分辨率过高，卷积的效果反而不明显！

kernel1 = tf.constant([[-1, -2, -1], [0,  0, 0], [1, 2, 1]])
kernel2 = tf.constant([[-2, -1,  0], [-1, 1, 1], [0, 1, 2]])
kernel3 = tf.constant([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]])
# 卷积核数值之和对应最终图像亮度，通常不要过大或过小(接近0或1)

kernels = [kernel1, kernel2, kernel3]

def show_img(image, title):
    plt.imshow(tf.squeeze(image), cmap='gray')
    plt.axis('off')
    plt.title(title)

def show_kernel(kernel):
    sns.heatmap(kernel, annot=True, cmap='Blues_r', cbar=False)
    plt.title('Kernel')

def image_process(image, kernel):    
    plt.figure(figsize=(15, 3))
    plt.subplot(151)
    show_kernel(kernel)
    
    plt.subplot(152)
    show_img(image, 'Input')
    
    image = tf.expand_dims(image, axis=0)
    image = tf.cast(image, dtype=tf.float32)/255.0
    kernel = tf.reshape(kernel, [*kernel.shape, 1, 1])
    kernel = tf.cast(kernel, dtype=tf.float32)
    
    plt.subplot(153)
    image_filter = tf.nn.conv2d( input=image,
        filters=kernel, strides=1, padding='SAME')
    show_img(tf.squeeze(image_filter), 'Filtered')

    plt.subplot(154)
    image_detect = tf.nn.relu(image_filter)
    show_img(tf.squeeze(image_detect), 'Activated(ReLU)')
    
    plt.subplot(155)
    image_condense = tf.nn.pool( input=image_detect,
            window_shape=(2, 2), pooling_type='MAX',
            strides=(2, 2), padding='SAME' )
    show_img(tf.squeeze(image_condense), 'Condensed(MaxPool)')
    

for kernel in kernels:
    image_process(image, kernel)
    plt.tight_layout()
plt.show()

Three (3, 3) kernels have 27 parameters, while one (7, 7) kernel has 49, though they both create the same receptive field. This stacking-layers trick is one of the ways convnets are able to create large receptive fields without increasing the number of parameters too much.