Deep Neural Network from scratch, scikit-learn, keras&Tensorflow part2

ANN

TLU(threshold logic unit), linear threshold unit(LTU) activation function: 1.sigmoid function(y=1/(1+e^(-x))) 2.relu(y=max(0,x)) 3.step function: 3.1 heaviside(z) = (z>=0)?1:0

Perceptron the decision boundary of each output neuron is linear, perceptrons are incapable of learning complex patterns. Perceptron convergence theorem: if the training instances are linearly separable, this algorithm would converge to a solution

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:,(2,3)]
y = (iris.target == 0).astype(np.int) #if iris setosa, output 1, otherwise 0

per_clf = Perceptron()
per_clf.fit(X,y)

y_pred = per_clf.predict([[2, 0.5]])
print(y_pred)

[0]

note that perceptron make predictions based on a hard threshold, hwile logistic regression classifier outpu a class probability.
XOR cannot be solved by using perceptrons or any other linear classification model including logistic regression model. However, a simple transformation to the feature space, like x1 * x2 this kind of nonlinear feature allows logistic regression to learn a decision.

by stacking multiple perceptrons, which is called MLP(multiple perceptron) could solve this problem.

check book page288 to see how it could solve XOR problem

  File "<ipython-input-25-53e01c846747>", line 1
    note that perceptron make predictions based on a hard threshold, hwile logistic regression classifier outpu a class probability.
         ^
SyntaxError: invalid syntax

MLP:
input layer
hidden layer
output layer

1. the layers that close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers
2. every layers except output layers includes a bias neuron and is fully connected to the next layer

Batch Gradient descent: batch size = size of training set
stochastic gradient descent. batch size =1
mini-batch gradient descent. 1<batch size<size of training set

it is important to initialize all the hidden layers’ connectin weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. In other words, despite having hundres of neurons per layer, your model will act as if it had one one neuron per layer: it won’t be smart. If instead you randomly initialize the weights, you break the symmetry and allow backprograpation to train a diverse team of neurons.

change perceptron step function to MLP sigmoid function

* step function
* sigmoid function
* hyperbolic tangent function:
tanh(z) = sinh(x)/cosh(x)   #https://www.mathworks.com/help/matlab/ref/tanh.html
that range tends to make each layer's output more or less centered around 0 at the beginning of training, which oftern helps speed up convergence
* ReLU(z) = max(0,z)
slope changes abruptly and its derivative is 0 for z<0, in pratice it has become the default since it has the advantage of being fast to compute

* softplus activation function softplus(z)=log(1+exp(z)) which is smooth variant of ReLU, it is close to 0 when z is negative, and close to z when z is positive.

## the key idea why we need activation functions:
if chain several linear transformations, it is still linear transformation. so there's no nonlinearity between layers, even deep stack of layers is still quivalent to a single layer.

applications:
1. regression MLPs
1.1 output neurons
it depends on number of output values need to be predicted
* single value->single output neurons
* multiple values->one output neuron per output dimension
1.2 for output neurons, normally do not use any activation function, but alternatively choose different activation functions depending on output range of values

loss function/evaluation metric to use 
1.during training is typically the mean sqaured error
2.if you have alot of outliers in the training set, you may prefer to use the mean absolute error instead.

the average difference observed in the predicted and actual values across the whole test set.

page 293 not correct

Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable. The three tables below show examples where MAE is steady and RMSE increases as the variance associated with the frequency distribution of error magnitudes also increases.

3.RMSE

4.huber loss

* outlier

An outlier is an object that deviates significantly from the rest of the objects

Summary:
input neurons:one per input features
hidden layers:depends, typically 1 to 5
neurons per hidden layer:depends on the problem, typically 10 to 100
output neurons:1 perdiction dimension
hidden activation:Relu
output activation:none or depends
loss: mse, rmse,mae

classification MLPs

binary classification:
output layer activation:sigmoid
loss function:cross entropy(log loss)

multilabel binary classification:
output neurons:1 per label
loss function: cross entropy

multilabel binary classification for instance:
1.email classification:
1.1 ham or spam
1.2 urgent or nonurgent

multiclss classification:
output neurons: 1 per class
loss function:cross entropy

hyperparameters

are the variables which determines the network structure and the variables which determine how the network is trained

tensorflow: static graph pytorch: dynamic graph that allows defining/defining the graph

import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)

2.3.1
2.4.0

### load dataset
fashion_mnist = keras.datasets.fashion_mnist
(X_train,y_train),(X_test,y_test)=fashion_mnist.load_data()

print(X_train.shape)
print(X_train.dtype)

(60000, 28, 28)
uint8

X_valid,X_train = X_train[:5000] / 255.0, X_train[5000:] / 255.0
y_valid, y_train = y_train[:5000], y_train[5000:]
X_test = X_test / 255.0

label_name = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "bag", "ankle boot"]
label_name[y_train[0]]
print(y_train[:5])

[4 0 7 9 9]

Creating the model using the Sequential API

model = keras.models.Sequential()      ##sequential API
model.add(keras.layers.Flatten(input_shape=[28,28]))   ##flatten 1D layer
model.add(keras.layers.Dense(300, activation="relu"))  ##dense hidden layer
model.add(keras.layers.Dense(100, activation="relu"))  
model.add(keras.layers.Dense(10, activation="softmax"))

flatten_6
dense_15
dense_16
dense_17



---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-34-441d3c181c0e> in <module>
     12 print(hidden2.name)
     13 print(hidden3.name)
---> 14 model.get_layer('dense') is hidden1


~/Development/venv/my-venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in get_layer(self, name, index)
   2396         if layer.name == name:
   2397           return layer
-> 2398       raise ValueError('No such layer: ' + name + '.')
   2399     raise ValueError('Provide either a layer name or layer index.')
   2400 


ValueError: No such layer: dense.

pass a list of layers when creating the sequential model:

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])  ##sequential API

model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_5 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 300)               235500    
_________________________________________________________________
dense_13 (Dense)             (None, 100)               30100     
_________________________________________________________________
dense_14 (Dense)             (None, 10)                1010      
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________

model.layers
hidden0 = model.layers[0]
hidden1 = model.layers[1]
hidden2 = model.layers[2]
hidden3 = model.layers[3]
print(hidden0.name)
print(hidden1.name)
print(hidden2.name)
print(hidden3.name)
model.get_layer('dense_15') is hidden1

flatten_6
dense_15
dense_16
dense_17

True

weights, biases = hidden1.get_weights()
print(weights)
print(weights.shape)
print(biases)
biases.shape

[[ 4.7349229e-02 -9.7253174e-04 -1.1341944e-03 ...  2.5101207e-02
   6.8740964e-02 -5.6659300e-02]
 [ 4.9689531e-02 -1.6197551e-02 -2.3989312e-02 ...  5.8839262e-02
   3.3672228e-02  5.0742328e-02]
 [-4.2911947e-02 -5.1682625e-02 -3.1449642e-02 ...  9.9925473e-03
  -6.3275285e-02  2.2455417e-02]
 ...
 [ 3.5043560e-02 -6.6033557e-02 -5.3551573e-02 ... -7.4006386e-02
   1.8318184e-02 -3.4555219e-02]
 [-8.0467388e-03  6.4311802e-02  3.9261460e-02 ...  2.6988834e-03
  -2.5993869e-02  9.4175339e-06]
 [ 3.8276210e-02  2.2218361e-02  7.2731733e-02 ...  3.2831334e-02
   6.6786796e-02 -4.8152048e-02]]
(784, 300)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]





(300,)

compiling model

model.compile(loss="sparse_categorical_crossentropy",
                optimizer="sgd",
                metrics=["accuracy"])

sparse_categorical_cross entropy:
sparse labels and classes are exclusive, output for each instance only cetain class index for above example
categorical_crossentropy such as target probability per class for each instance
binary_crossentropy

convert labels:p303

history = model.fit(X_train, y_train, epochs=30, validation_data=(X_valid,y_valid))

Epoch 1/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.7304 - accuracy: 0.7618 - val_loss: 0.5188 - val_accuracy: 0.8288
Epoch 2/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4915 - accuracy: 0.8268 - val_loss: 0.4556 - val_accuracy: 0.8474
Epoch 3/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.4458 - accuracy: 0.8432 - val_loss: 0.4399 - val_accuracy: 0.8440
Epoch 4/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4179 - accuracy: 0.8527 - val_loss: 0.3924 - val_accuracy: 0.8672
Epoch 5/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.3963 - accuracy: 0.8612 - val_loss: 0.3854 - val_accuracy: 0.8718
Epoch 6/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3804 - accuracy: 0.8655 - val_loss: 0.3750 - val_accuracy: 0.8710
Epoch 7/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.3674 - accuracy: 0.8709 - val_loss: 0.3663 - val_accuracy: 0.8756
Epoch 8/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3555 - accuracy: 0.8732 - val_loss: 0.3635 - val_accuracy: 0.8760
Epoch 9/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3451 - accuracy: 0.8773 - val_loss: 0.3727 - val_accuracy: 0.8690
Epoch 10/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3354 - accuracy: 0.8811 - val_loss: 0.3447 - val_accuracy: 0.8802
Epoch 11/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3270 - accuracy: 0.8838 - val_loss: 0.3585 - val_accuracy: 0.8752
Epoch 12/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3200 - accuracy: 0.8860 - val_loss: 0.3314 - val_accuracy: 0.8840
Epoch 13/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3120 - accuracy: 0.8869 - val_loss: 0.3273 - val_accuracy: 0.8816
Epoch 14/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3054 - accuracy: 0.8901 - val_loss: 0.3370 - val_accuracy: 0.8806
Epoch 15/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2973 - accuracy: 0.8937 - val_loss: 0.3225 - val_accuracy: 0.8860
Epoch 16/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2921 - accuracy: 0.8951 - val_loss: 0.3430 - val_accuracy: 0.8786
Epoch 17/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2864 - accuracy: 0.8968 - val_loss: 0.3203 - val_accuracy: 0.8842
Epoch 18/30
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2808 - accuracy: 0.8998 - val_loss: 0.3209 - val_accuracy: 0.8866
Epoch 19/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2757 - accuracy: 0.9012 - val_loss: 0.3271 - val_accuracy: 0.8836
Epoch 20/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2704 - accuracy: 0.9027 - val_loss: 0.3176 - val_accuracy: 0.8890
Epoch 21/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2653 - accuracy: 0.9044 - val_loss: 0.3163 - val_accuracy: 0.8894
Epoch 22/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2615 - accuracy: 0.9051 - val_loss: 0.3046 - val_accuracy: 0.8928
Epoch 23/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2566 - accuracy: 0.9077 - val_loss: 0.3172 - val_accuracy: 0.8860
Epoch 24/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2513 - accuracy: 0.9099 - val_loss: 0.2993 - val_accuracy: 0.8938
Epoch 25/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2475 - accuracy: 0.9114 - val_loss: 0.3074 - val_accuracy: 0.8902
Epoch 26/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2431 - accuracy: 0.9124 - val_loss: 0.3210 - val_accuracy: 0.8872
Epoch 27/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2397 - accuracy: 0.9137 - val_loss: 0.3079 - val_accuracy: 0.8870
Epoch 28/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2363 - accuracy: 0.9145 - val_loss: 0.3007 - val_accuracy: 0.8942
Epoch 29/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.2320 - accuracy: 0.9163 - val_loss: 0.2948 - val_accuracy: 0.8952
Epoch 30/30
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2288 - accuracy: 0.9180 - val_loss: 0.3020 - val_accuracy: 0.8924

import pandas as pd
import matplotlib.pyplot as plt

pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)
plt.gca().set_ylim(0,1)  ##set vertical range to [0~1]
plt.show()

svg

model.evaluate(X_test, y_test)

313/313 [==============================] - 1s 3ms/step - loss: 0.3293 - accuracy: 0.8859

[0.3292936682701111, 0.8859000205993652]

Note that it is common to get slightly lower performance on the test set than on the validation set because the hyperparameters are tuned on the validation set, not the test set

X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.99],
       [0.  , 0.  , 0.98, 0.  , 0.02, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]],
      dtype=float32)

y_pred = model.predict_classes(X_new)
y_pred

WARNING:tensorflow:From <ipython-input-46-81ace37e545f>:1: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).

array([9, 2, 1])

np.argmax(model.predict(X_new),axis=-1)

array([9, 2, 1])

y_test[:3]

array([9, 2, 1], dtype=uint8)

Regression MLP using the sequential API

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

model = keras.models.Sequential([
    keras.layers.Dense(30,activation="relu",input_shape=X_train.shape[1:]),
    keras.layers.Dense(1)
])
model.compile(loss="mean_squared_error", optimizer="sgd")
history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_pred = model.predict(X_new)

Epoch 1/20
363/363 [==============================] - 1s 4ms/step - loss: 0.8456 - val_loss: 2.6605
Epoch 2/20
363/363 [==============================] - 1s 3ms/step - loss: 0.5434 - val_loss: 0.7184
Epoch 3/20
363/363 [==============================] - 1s 2ms/step - loss: 0.4632 - val_loss: 0.5022
Epoch 4/20
363/363 [==============================] - 1s 2ms/step - loss: 0.4310 - val_loss: 0.4699
Epoch 5/20
363/363 [==============================] - 1s 2ms/step - loss: 0.4211 - val_loss: 0.4140
Epoch 6/20
363/363 [==============================] - 1s 2ms/step - loss: 0.4010 - val_loss: 0.4003
Epoch 7/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3931 - val_loss: 0.4174
Epoch 8/20
363/363 [==============================] - 1s 2ms/step - loss: 0.4904 - val_loss: 0.4105
Epoch 9/20
363/363 [==============================] - 1s 2ms/step - loss: 0.4206 - val_loss: 0.4082
Epoch 10/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3931 - val_loss: 0.3944
Epoch 11/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3807 - val_loss: 0.3821
Epoch 12/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3758 - val_loss: 0.3792
Epoch 13/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3714 - val_loss: 0.3750
Epoch 14/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3683 - val_loss: 0.3732
Epoch 15/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3664 - val_loss: 0.3744
Epoch 16/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3631 - val_loss: 0.3683
Epoch 17/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3601 - val_loss: 0.3659
Epoch 18/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3591 - val_loss: 0.3669
Epoch 19/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3570 - val_loss: 0.3623
Epoch 20/20
363/363 [==============================] - 1s 2ms/step - loss: 0.3535 - val_loss: 0.3764
162/162 [==============================] - 0s 1ms/step - loss: 0.3955

Functional API: Example: wide&deep neural network

it connects all or part of the inputs directly to the output layer, this makes possible for the neural network to leanr both deep patterns and simple rules.

input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30,activation="relu")(input_)
hidden2 = keras.layers.Dense(30,activation="relu")(hidden1)
concat = keras.layers.Concatenate()([input_,hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.Model(inputs=[input_],outputs=[output])

Modification: send a subset of the features through the wide path and a different subset through deep path

input_A = keras.layers.Input(shape=[5], name="wide_input")
input_B = keras.layers.Input(shape=[6], name="deep_input")
hidden1 = keras.layers.Dense(30,activation="relu")(input_B)
hidden2 = keras.layers.Dense(30,activation="relu")(hidden1)
concat = keras.layers.Concatenate()([input_A,hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.Model(inputs=[input_A, input_B],outputs=[output])

class WideAndDeepModel(keras.Model):
    def __init__(self, units=30,activation="relu",**kwargs):
        super().__init__(**kwargs) #handles standard args
        self.hidden1 = keras.layers.Dense(units, activation=activation)
        self.hidden2 = keras.layers.Dense(units, activation=activation)
        self.main_output = keras.layers.Dense(1)
        self.aux_output = keras.layers.Dense(1)
    
    def call(self, inputs):
        input_A, inut_B = inputs
        hidden1 = self.hidden1(input_B)
        hidden2 = self.hidden2(hidden1)
        concat = keras.layers.concatenate([input_A, hidden2])
        main_output = self.main_output(concat)
        aux_output = self.aux_output(hidden2)
        return main_output, aux_output
    
model = WideAndDeepModel()

cons: when you call the summar() method, you only get list of layers, without any information on how they are connected to each other

save and load

model.save(“my_keras_model.h5”) model = keras.mdoels.load_model(“my_keras_model.h5”)

using callbacks

the fit method accepts a callbacks argument that lets you specify a list of objects that Keras will call at the start and end of training, at the start and end of each epoch, and even before and after processing each batch. 

checkpoint and early stopping

checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5",save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=100,
                validation_data=(X_valid,y_valid),
                callbacks=[checkpoint_cb,early_stopping_cb])

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-57-eb74217ce75e> in <module>
      1 checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5",save_best_only=True)
      2 early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
----> 3 history = model.fit(X_train, y_train, epochs=100,validation_data=(X_valid,y_valid),callbacks=[checkpoint_cb,early_stopping_cb])


~/Development/venv/my-venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.


~/Development/venv/my-venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1029     # Legacy graph support is contained in `training_v1.Model`.
   1030     version_utils.disallow_legacy_graph('Model', 'fit')
-> 1031     self._assert_compile_was_called()
   1032     self._check_call_args('fit')
   1033     _disallow_inside_tf_function('fit')


~/Development/venv/my-venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _assert_compile_was_called(self)
   2567     # (i.e. whether the model is built and its inputs/outputs are set).
   2568     if not self._is_compiled:
-> 2569       raise RuntimeError('You must compile your model before '
   2570                          'training/testing. '
   2571                          'Use `model.compile(optimizer, loss)`.')


RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer, loss)`.

Fine-Tuning Neural Network Hyperparameters

Option:
try many combinations of hyperparameter and see which one works best on the validation set.add
Use GridSearchCV or RandomizedSearchCV to explore the hyperparameter space
use randomized search rather than grid search


hyperparameter python libraries:
hyperopt
hyperas
keras tuner
scikit-optimize
spearmint
hyperband
sklearn-deap

hyperparameter optimization services:
google cloud ai platforms
arimo
sigOpt
CallDesks' Oscar

Overfitting: Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

K fold cross validation

In K Fold cross validation, the data is divided into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set.

Transfer learning:
to generalize new datasets:
if you have already trained a model to recognize faces in puctures and you now want to train a new neural network to recognize hairstyles, you can kickstart the training by reusing the lower layers of the first network. Instead of randomly initializing the weights and biases of the first few layers of the new neural network, you can initialize them to the values of the weights and biases of the lower layers of the first entwork. This way network will not have to learn the low level structures, it will only have to leran the higher-level strcutures.

Number of hidden layers:
for most of problems, start with just one or two hidden layers. for more complex problems, ramp up number of hidden layers until overfitting the training set.

Number of Neurons per hidden layers
size them to form a pyramid, with fewer and fewer neurons at each layer, but this practice it not always to be efficent.

1.try to increasing the number of neurons graudally until the ntwork starts overfitting
2.pick more layers and neurons than you actually need, then use early stopping to prevent it from overfitting.

learning rate:
optimizer
batch size
activation function
number of iterations

note that optimal learning rate depends on the other hyperparameters especially the batch size-so if you modify any hyperparameter, make sure to update the learning rate as well