This blog will show a new experimental method for data augmentation geared towards bio-science for deep learning. This is important for several reasons. 1: Collecting data is time-consuming especially in collecting large enough observations for training deep learning models. 2: It can be difficult to collect or sample enough observations due to the lack of access or chances to make collections. 3: Collecting observations can only be done at certain times or during certain periods, or the period of time for sampling has passed so the collection of further/more observations are impossible. 4: There are few species available to collect samples from. These are just 4 simple reasons why data augmentation is needed for biological studies.
Methods for Data Augmentation
The simplest method for data augmentation is to match the generated data both statistically and logically to the observed data. This means that the data that is generated should have a similar look and feel of the real-world data. The two data sets should have similar distributions, mean, modes, etc. to ensure that the data truly simulates the observed sequences. The simulated data should also be logically like the data that is observed. This means that the simulated data should not have outliers model into it as this will confuse any model. The augmented data should flow alongside the observations and almost mirror each observation. But, just copying the real observations is not an appropriate method for data augmentation. The observations should change slightly. For example, common methods for data augmentations in CNN are image rotation, flipping, cropping, changing color, etc. to create “new” unseen images for a CNN to be trained on. This is also true for numerical data, but not as easy as just flipping the numbers from 10 to 01 as they are not the same.
There are very few methods that exist for data augmentation for numerical data. There are even fewer geared specifically towards biodata or biostudies. This blog will show a new method for generating near-infinite observations based simply on the minimum and maximum observations in a data set.
The data set that I am using is a publicly available data set of Body Measurements (BDIMS)(Heinz, Peterson, Johnson, & Kerk, 2003). This data set is the girth and skeletal measurement of 247 men and 260 women.
Now let's get into the coding aspect of it:
CODE
First, let's get all the import statements out of the way.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import pymc3 as pm
import theano
from statsmodels.formula.api import glm as glm_sm
import statsmodels.api as sm
from pandas.plotting import scatter_matrix
from random import randint
Next, we need to do some quick examination of the data we downloaded.
# Read the data in from the csv file
data = pd.read_csv("bdims.csv")
print(data.columns)
Much nicer. Now we only want to look at one subject as this is biological data. So we will filter out females from males and just look at males. This process will work on both sexes as the steps will be the same, but doing both at the same time will yield poor results as there are biological differences between males and females in general.
# Split between male and female
male_mask = filter_data['sex'] > 0
male = filter_data[male_mask]
female = filter_data[~male_mask]
# After sperating the two exes lets drop the sex collumn as we dont need it
male = male.drop(['sex'], axis=1)
male.describe()
hgt wgt che.gi hip.gi kne.gi thi.gi ank.gi wri.gi wai.gi
count 247.000000247.000000247.000000247.000000247.000000247.000000247.000000247.000000247.000000
mean 177.74534478.144534100.98987997.76315837.19554756.49797623.15910917.19028384.533198
std 7.18362910.5128907.2090186.2280432.2729994.2466671.7290880.9079978.782241min157.20000053.90000079.30000081.50000031.10000046.80000016.40000014.60000067.10000025% 172.90000070.95000095.95000093.25000035.75000053.70000022.00000016.50000077.90000050% 177.80000077.300000101.00000097.40000037.00000056.00000023.00000017.10000083.40000075% 182.65000085.500000106.050000101.55000038.45000059.15000024.30000017.85000090.000000max198.100000116.400000118.700000118.70000045.70000070.00000029.30000019.600000113.200000
Now with the first step of preprocessing, we can get into the process of creating the dataset from only two points! These two points will be the minimum and maximum based on height. Height is chosen because this variable is the dominating variable in biology and bio-mass. Weight is normally heavily dependant on height (pun intended). The dependent variable will be weight. (X = height Y = weight).
So let's find the smallest and largest person in the dataset.
# Find the smallest item based on height # Create a new dataframe of the smallest and larget
min_max_male = pd.DataFrame(male[male.hgt == male.hgt.max()])
min_max_male = min_max_male.append(male[male.hgt == male.hgt.min()])
# Sort by height
sort_min_mix_male = min_max_male.sort_values('hgt')
print(sort_min_mix_male)
hgt wgt che.gi hip.gi kne.gi thi.gi ank.gi wri.gi wai.gi
105157.258.491.691.335.555.020.816.480.6126198.185.596.994.939.254.427.517.982.5
ax1 = min_max_male.plot.scatter(x='hgt',y='wgt',c='DarkBlue')
So the first and simplest method to interpolation is linear regersion. This will give us a few extra points of missing data.
# Now use linear regression to fill in some of the missing pointsimport numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([min_max_male.hgt.min(),min_max_male.hgt.max()]).reshape((-1, 1))
y = np.array([min_max_male.wgt.min(), min_max_male.wgt.max()])
# Define a linear regerssion model
model = LinearRegression()
model.fit(x, y)
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('slope:', model.coef_)
coefficient of determination: 1.0
intercept: -45.75941320293397
slope: [0.66259169]
Ok looks fine so far. The blue dots are the original data (min and max) and the red dots are the newly generated data. This makes sense as weight should increase as height increases. But, not really. There are variations in weight because of other factors. Also, 41 new points don't make a deep learning set.
Lets create a few more points:
# Now lets fine tune the hieght veriable by a float instead of a int# We can resue the linerar regression model to generate more data# Go from 41 observations to 409000 observatsions # All equally possible to occure in the real world
current_hgt = min_max_male.hgt.min()
count = 0
large_hgt = []
while current_hgt <= min_max_male.hgt.max():
# increase the height by 0.1 cm
current_hgt +=0.0001
large_hgt.append(current_hgt)
count +=1print(len(large_hgt))
409000# Now using the newlly generated fine scale height lets get the weight
large_pred = []
for h in large_hgt:
new_x = np.array(h).reshape((-1, 1))
pred = model.predict(new_x)
large_pred.append(pred[0])
print(len(large_pred))
409000# Now lest plot everything again
plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()
As you can see perfectly overlaps and each observation makes sense and is logical.
The blue dots are the original, the red is the first step, and the green is fine-tuned steps.
This jumps from 2 observations (min and max) to 41 observations (fully synthetic) to 409000 observations.
But in the real world, biology does not always follow a linear line
Let's introduce some variability into the data generation!
# Define a new line using all the data from the real data set# Define a linear regerssion model
X = np.array(male.hgt).reshape(-1, 1)
Y = np.array(male.wgt).reshape(-1, 1)
model2 = LinearRegression()
model2.fit(X,Y)
r_sq2 = model2.score(X,Y)
print('coefficient of determination:', r_sq2)
print('intercept:', model2.intercept_)
print('slope:', model2.coef_)
coefficient of determination: 0.28594874074704446
intercept: [-60.95336414]
slope: [[0.78256845]]
# Linear regresion using real data
y_pred = model2.predict(X)
# Now plot all the data
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.plot(male.hgt, male.wgt, 'yo')
plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()
As you can see the regression line is some what close to the line of data that is generated. It is not perfect and there will be a lot of variability between the two datasets. But seeing that this is only based on two observations, (the min and max) the lines are pretty close. The intercept and slope are close enough to use the ones found from the two points only. So let us continue and make a fully synthetic deep learning dataset from two observations.
# The slope of the line is b, and a is the intercept found from Sklenar linear model# Simple Linear regressoin model Y = a + bX that will be the model for out MCMC
alpha = -45.75941320293397# Intercept
beta = [0.66259169] # Slope
X = np.array(large_hgt)
Y = np.array(large_pred)
print(len(X))
print(len(Y))
409000409000# Weight Histogram
hist = male.hist(column='wgt')
#Normal distribution. mu is the mean, and sigma is the standard deviation.# Seeing that the weight is normally distributed (basically) we can use that knowledge to generate new data via a normally# Distrubuted method#for random.normalvariate(mu, sigma)
std = np.std(X, axis=0)
real_std = np.std(male.wgt, axis=0)
print(std)
print(real_std)
11.80681300528450410.491587167890629
temp_min_max = []
temp_min_max.append(male.wgt.max())
temp_min_max.append(male.wgt.min())
mean = np.mean(temp_min_max)
real_mean = np.mean(male.wgt)
print(mean)
print(real_mean)
85.1578.14453441295547
Looking at the mean and standard deviation they are close enough for this example. Lets make a Million data points for our new dataset! That should be enough for any deep learning dataset.
Well thats no good. Now to be fair, given a infinate number of samples, it is highly likely that at least for each point there would have been someone that mathces the height and weight on this chart, but that is like using a shotgun to fish. It is not as accurate and not really following the regression line of the real data which means that the dataset is not useful and cannot be used in a deep learning model as it wont learn anything.
So how can we fix this?
Let's perform some rejections by using a concept of banding. So if the observation falls outside the bands it won't get plotted. The bands themselves set up an upper and lower limit so that all predictions will have to fall within these limits. To form these limit expert knowledge of the observed phenomenon is needed especially for only two observations, luckily for us, we have more than two observations so we can define out limits based on the full real dataset.
# Use upper and lower limits to reject samplesdefmake_sample(lower, upper, mean, std):
sample = np.random.normal(mean,std)
if lower < sample < upper:
return sample
else:
make_sample(lower, upper, mean, std)
# Define bands for each interval# The more bands the finer the level of rejection# Each item in the array is defined as# [band lower, band upper, lower limit, upper limit]
band1 = [0, 155, 50, 70]
band2 = [156,160, 55, 70]
band3 = [161, 165, 56, 75]
band4 = [166, 170, 57, 80]
band5 = [171, 175, 60, 88]
band6 = [176, 180, 60, 94]
band7 = [181, 185, 60, 100]
band8 = [186, 190, 63, 105]
band9 = [191, 195, 64, 110]
band10 = [196, 299, 65, 110]
# Put all the bands into a single array for easy use
bands = []
bands.append(band1)
bands.append(band2)
bands.append(band3)
bands.append(band4)
bands.append(band5)
bands.append(band6)
bands.append(band7)
bands.append(band8)
bands.append(band9)
bands.append(band10)
new_X = []
new_Y = []
for i inrange(0, 1000000):
index = randint(0, len(X) -1)
for band in bands:
if band[0] <= X[index] <= band[1]:
new_X.append(X[index])
new_Y.append(make_sample(band[2], band[3], mean, std))
plt.plot(new_X, new_Y, 'go',marker='^')
plt.plot(male.hgt, male.wgt, 'yo')
plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()
Which gives us this!There are still a million points, but some my be repeated. But, the general flow is far more similar to the real data which is perfect now for training a deep learning model.
Conclusion
From this blog, we saw how to use only two observations, the minimum and maximum, and how to create a fully synthetic dataset that can be used for deep learning.
The main idea when building a fully synthetic dataset is to ensure it is statistically and logically similar to that of the observed/real dataset. This gives the benefit of creating a large training dataset and then using the real data as a testing set. This can give very good results when creating a deep learning model as you won't have to train the model on the very limited (and precious) real data that can be very difficult to capture or collect.
This approach can be improved significantly, especially in the banding section. By adding a larger number of bands, smoothing out the lower and upper limits, and even using more complex algorithms like a random walk can improve the final results. But, this method still needs to be vetted before use in different models and/or real-world applications. The next step would be to model more independent variables, other phenomenons, and improve the generation steps.
References:
Heinz G, Peterson LJ, Johnson RW, Kerk CJ. 2003. Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2).
Most people are used to the Sequential model from Keras as it is a straightforward method for creating simple models. The functional API is Keras way of creating far more complex models. This can allow for the creation of models with multiple inputs and outputs, different types of inputs, merging inputs, having two loss functions, and more.
Code Comparison:
So, let’s look at the most basic model possible. Using the MNIST dataset that is already included in Keras is an easy model and dataset that is available for everyone and should need no introduction. So I will skip the setup, loading, and training-test splits of the data and go into the model. The below code is a basic setup for a Sequential model to learn how to recognize handwritten numbers. This code sample comes from the Keras team GitHub [1].
Easy right? Now we can build a similar model using the Functional API from Keras. Looking at them compared side by side, they are very similar. But now you don’t need Sequential to be defined.
First, we will need to import a few more modules:
from keras.layers import Input, Dense
from keras.models import Model
These modules are needed for the Functional API.
Then we need the first part defines the input shape much like this from the original Sequential model.
# Sequntial way
model.add(Conv2D(32, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
# Which is the same as# Functional API
inputs = Input(shape=(input_shape))
# Define the Conv2d Layer
x = Conv2D(32, kernel_size=(3, 3),activation='relu')(inputs)
The next lines are the same as they start building out the architecture. So, they have the same setup. The next difference is the output this is where you define the output and the model.
predictions = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=inputs, outputs=predictions)
The last layer (prediction) is pretty much the same as the last Fully connected layer in the basic model.
So you should end up with something that looks like this:
# Define input shape as the input Reuse the original inputshape
inputs = Input(shape=(input_shape))
# Define the Conv2d Layer
x = Conv2D(32, kernel_size=(3, 3),activation='relu')(inputs)
x = Conv2D(64, kernel_size=(3, 3),activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Dropout(0.25)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(num_classes, activation='softmax')(x)
# This creates a model that includes# the Input layer and three Dense layers
functional_model = Model(inputs=inputs, outputs=predictions)
functional_model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
functional_model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
Results:
As you can see from the scoring the two methods produced pretty much the same results. The added advantage though with the Functional API model is that it is more extendable and far more customizable. When performing a more complex task, the use of the Functional API may be mandatory as a single Sequential model cannot handle the complexity of it.
Now, what is the point you may say? The biggest benefit is not the model defined above can then be used as another layer in another model like so:
x = Input(shape=(input_shape))
pred = functional_model(x)
That will produce the classification results of any input that is sent in. This can be used to aid a classification into a video feed, or a more complex model needed multiple types of inputs.
Trining of the models behave the same as well and yield similar results too.
Sequential
Test loss: 0.027761173594164575
Test accuracy: 0.992
Functional
Test loss: 0.028270527327229955
Test accuracy: 0.9921
A Better Example! Image Similarity:
This model will use a ResNet50 pre-trained model to create the vectors used for image comparison. For each image, the features will be calculated and then merged into on input for the Fully Connected layers. But, honestly, any CNN will work you can even define your own CNN and use it to extract features. The final layer will produce a probability that the two images are similar or not based on a threshold. This model will not do very complex comparisons as it is too simple. But for images of scenery, it should get satisfactory results.
The basic model for image similarity can be done like this:
Now can you see the usefulness of Functional API in Keras? This is just the tip of the iceberg on what can be accomplished with this API. There are many more possibilities to be had.
This API is not limited to images but can be used to define any complex model with multiple inputs and outputs. Using for natural language processing or even complex analysis of the stock market where there are numerical and nonnumerical data used in the same model.