# Synthetic Features and Outliers

Sören Dobberschütz · August 8, 2018

In this second part, we create a synthetic feature and remove some outliers from the data set.

This notebook is based on the file Synthetic Features and Outliers, which is part of Google’s Machine Learning Crash Course.

``````# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# Unless required by applicable law or agreed to in writing, software
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
``````

# Synthetic Features and Outliers

Learning Objectives:

• Create a synthetic feature that is the ratio of two other features
• Use this new feature as an input to a linear regression model
• Improve the effectiveness of the model by identifying and clipping (removing) outliers out of the input data

Let’s revisit our model from the previous First Steps with TensorFlow exercise.

First, we’ll import the California housing data into `DataFrame`:

## Setup

``````using Plots
gr(fmt=:png)
using DataFrames
using TensorFlow
import CSV
using Random
using Statistics

sess=Session()

california_housing_dataframe[:median_house_value] /= 1000.0
california_housing_dataframe
``````

17,000 rows × 9 columns

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64
1-114.3134.1915.05612.01283.01015.0472.01.493666.9
2-114.4734.419.07650.01901.01129.0463.01.8280.1
3-114.5633.6917.0720.0174.0333.0117.01.650985.7
4-114.5733.6414.01501.0337.0515.0226.03.191773.4
5-114.5733.5720.01454.0326.0624.0262.01.92565.5
6-114.5833.6329.01387.0236.0671.0239.03.343874.0
7-114.5833.6125.02907.0680.01841.0633.02.676882.4
8-114.5934.8341.0812.0168.0375.0158.01.708348.5
9-114.5933.6134.04789.01175.03134.01056.02.178258.4
10-114.634.8346.01497.0309.0787.0271.02.190848.1
11-114.633.6216.03741.0801.02434.0824.02.679786.5
12-114.633.621.01988.0483.01182.0437.01.62562.0
13-114.6134.8448.01291.0248.0580.0211.02.157148.6
14-114.6134.8331.02478.0464.01346.0479.03.21270.4
15-114.6332.7615.01448.0378.0949.0300.00.858545.0
16-114.6534.8917.02556.0587.01005.0401.01.699169.1
17-114.6533.628.01678.0322.0666.0256.02.965394.9
18-114.6532.7921.044.033.064.027.00.857125.0
19-114.6632.7417.01388.0386.0775.0320.01.204944.0
20-114.6733.9217.097.024.029.015.01.265627.5
21-114.6833.4920.01491.0360.01135.0303.01.639544.4
22-114.7333.4324.0796.0243.0227.0139.00.896459.2
23-114.9434.5520.0350.095.0119.058.01.62550.0
24-114.9833.8215.0644.0129.0137.052.03.209771.3
25-115.2233.5418.01706.0397.03424.0283.01.62553.5
26-115.3232.8234.0591.0139.0327.089.03.6528100.0
27-115.3732.8230.01602.0322.01130.0335.03.573571.1
28-115.3732.8214.01276.0270.0867.0261.01.937580.9
29-115.3732.8132.0741.0191.0623.0169.01.760468.6
30-115.3732.8123.01458.0294.0866.0275.02.359474.3

Next, we’ll set up our input functions, and define the function for model training:

``````function create_batches(features, targets, steps, batch_size=5, num_epochs=0)

if(num_epochs==0)
num_epochs=ceil(batch_size*steps/length(features))
end

features_batches=Union{Float64, Missings.Missing}[]
target_batches=Union{Float64, Missings.Missing}[]

for i=1:num_epochs

select=shuffle(1:length(features))

append!(features_batches, features[select])
append!(target_batches, targets[select])
end

return features_batches, target_batches
end
``````
``````function next_batch(features_batches, targets_batches, batch_size, iter)

select=mod((iter-1)*batch_size+1, length(features_batches)):mod(iter*batch_size, length(features_batches));

ds=features_batches[select];
target=targets_batches[select];

return ds, target
end

``````
``````function my_input_fn(features_batches, targets_batches, iter, batch_size=5, shuffle_flag=1)
"""Trains a linear regression model of one feature.

Args:
features: DataFrame of features
targets: DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""

# Construct a dataset, and configure batching/repeating.
ds, target = next_batch(features_batches, targets_batches, batch_size, iter)

# Shuffle the data, if specified.
if shuffle_flag==1
select=Random.shuffle(1:size(ds, 1));
ds = ds[select,:]
target = target[select, :]
end

# Return the next batch of data.
return convert(Matrix{Float64},ds), convert(Matrix{Float64},target)
end
``````
``````function train_model(learning_rate, steps, batch_size, input_feature=:total_rooms)
"""Trains a linear regression model of one feature.

Args:
learning_rate: A `float`, the learning rate.
steps: A non-zero `int`, the total number of training steps. A training step
consists of a forward and backward pass using a single batch.
batch_size: A non-zero `int`, the batch size.
input_feature: A `symbol` specifying a column from `california_housing_dataframe`
to use as input feature.
"""

periods = 10
steps_per_period = steps / periods

my_feature = input_feature
my_feature_data = convert.(Float32,california_housing_dataframe[my_feature])
my_label = :median_house_value
targets = convert.(Float32,california_housing_dataframe[my_label])

# Create feature columns.
feature_columns = placeholder(Float32)
target_columns = placeholder(Float32)

# Create a linear regressor object.
m=Variable(0.0)
b=Variable(0.0)
y=m.*feature_columns .+ b
loss=reduce_sum((target_columns - y).^2)
run(sess, global_variables_initializer())
features_batches, targets_batches = create_batches(my_feature_data, targets, steps, batch_size)

# Use gradient descent as the optimizer for training the model.

# Set up to plot the state of our model's line each period.
sample = california_housing_dataframe[rand(1:size(california_housing_dataframe,1), 300),:];
p1=scatter(sample[my_feature], sample[my_label], title="Learned Line by Period", ylabel=my_label, xlabel=my_feature,color=:coolwarm)
colors= [ColorGradient(:coolwarm)[i] for i in range(0,stop=1, length=periods+1)]

# Train the model, but do so inside a loop so that we can periodically assess
# loss metrics.
println("Training model...")
println("RMSE (on training data):")
root_mean_squared_errors = []
for period in 1:periods
# Train the model, starting from the prior state.
for i=1:steps_per_period
features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
run(sess, my_optimizer, Dict(feature_columns=>features, target_columns=>labels))
end
# Take a break and compute predictions.
predictions = run(sess, y, Dict(feature_columns=> my_feature_data));

# Compute loss.
mean_squared_error = mean((predictions- targets).^2)
root_mean_squared_error = sqrt(mean_squared_error)
# Occasionally print the current loss.
println("  period ", period, ": ", root_mean_squared_error)
# Add the loss metrics from this period to our list.
push!(root_mean_squared_errors, root_mean_squared_error)
# Finally, track the weights and biases over time.

# Apply some math to ensure that the data and line are plotted neatly.
y_extents = [0 maximum(sample[my_label])]
weight = run(sess,m)
bias = run(sess,b)
x_extents = (y_extents .- bias) / weight
x_extents = max.(min.(x_extents, maximum(sample[my_feature])),
minimum(sample[my_feature]))
y_extents = weight .* x_extents .+ bias
p1=plot!(x_extents', y_extents', color=colors[period], linewidth=2)
end

predictions = run(sess, y, Dict(feature_columns=> my_feature_data));
weight = run(sess,m)
bias = run(sess,b)

println("Model training finished.")

# Output a graph of loss metrics over periods.
p2=plot(root_mean_squared_errors, title="Root Mean Squared Error vs. Periods", ylabel="RMSE", xlabel="Periods")

# Output a table with calibration data.
calibration_data = DataFrame()
calibration_data[:predictions] = predictions
calibration_data[:targets] = targets
describe(calibration_data)

println("Final RMSE (on training data): ", root_mean_squared_errors[end])
println("Final Weight (on training data): ", weight)
println("Final Bias (on training data): ", bias)

return p1, p2, calibration_data
end
``````

## Task 1: Try a Synthetic Feature

Both the `total_rooms` and `population` features count totals for a given city block.

But what if one city block were more densely populated than another? We can explore how block density relates to median house value by creating a synthetic feature that’s a ratio of `total_rooms` and `population`.

In the cell below, we create a feature called `rooms_per_person`, and use that as the `input_feature` to `train_model()`.

``````california_housing_dataframe[:rooms_per_person] =(
california_housing_dataframe[:total_rooms] ./ california_housing_dataframe[:population]);
``````
``````p1, p2, calibration_data= train_model(
0.05, # learning rate
1000, # steps
5, # batch size
:rooms_per_person #feature
)
``````
``````Training model...
RMSE (on training data):
period 1: 174.73499015754794
period 2: 135.18378014936647
period 3: 124.81483763650894
period 4: 124.99348861666715
period 5: 128.0855648925441
period 6: 129.86065434272652
period 7: 128.06995380520772
period 8: 129.3628149624267
period 9: 129.9533622545398
period 10: 129.8691255721607
Model training finished.
Final RMSE (on training data): 129.8691255721607
Final Weight (on training data): 74.18756812501896
Final Bias (on training data): 67.78876122300292
``````
``````plot(p1, p2, layout=(1,2), legend=false)
``````

We can visualize the performance of our model by creating a scatter plot of predictions vs. target values. Ideally, these would lie on a perfectly correlated diagonal line.

We use `scatter` to create a scatter plot of predictions vs. targets, using the rooms-per-person model you trained in Task 1.

Do you see any oddities? Trace these back to the source data by looking at the distribution of values in `rooms_per_person`.

``````scatter(calibration_data[:predictions], calibration_data[:targets], legend=false)
``````

The calibration data shows most scatter points aligned to a line. The line is almost vertical, but we’ll come back to that later. Right now let’s focus on the ones that deviate from the line. We notice that they are relatively few in number.

If we plot a histogram of `rooms_per_person`, we find that we have a few outliers in our input data:

``````histogram(california_housing_dataframe[:rooms_per_person], nbins=20, legend=false)
``````

We see if we can further improve the model fit by setting the outlier values of `rooms_per_person` to some reasonable minimum or maximum.

The histogram we created in Task 2 shows that the majority of values are less than `5`. Let’s clip `rooms_per_person` to 5, and plot a histogram to double-check the results.

``````california_housing_dataframe[:rooms_per_person] = min.(
california_housing_dataframe[:rooms_per_person],5)

histogram(california_housing_dataframe[:rooms_per_person], nbins=20, legend=false)
``````

To verify that clipping worked, let’s train again and print the calibration data once more:

``````p1, p2, calibration_data= train_model(
0.05, # learning rate
500, # steps
10, # batch size
:rooms_per_person #feature
)
``````
``````Training model...
RMSE (on training data):
period 1: 204.65393150901195
period 2: 173.7183427312223
period 3: 145.97809305428905
period 4: 125.39350453104828
period 5: 113.5851230428793
period 6: 108.94376856469054
period 7: 107.51132608903492
period 8: 107.37501891367756
period 9: 107.35720127223883
period 10: 107.36959825293329
Model training finished.
Final RMSE (on training data): 107.36959825293329
Final Weight (on training data): 70.5
Final Bias (on training data): 71.72122192382814
``````
``````plot(p1, p2, layout=(1,2), legend=false)
``````

``````scatter(calibration_data[:predictions], calibration_data[:targets], legend=false)
``````