Synthetic Features and Outliers

Sören Dobberschütz · August 8, 2018

In this second part, we create a synthetic feature and remove some outliers from the data set.

The Jupyter notebook can be downloaded here.


This notebook is based on the file Synthetic Features and Outliers, which is part of Google’s Machine Learning Crash Course.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Synthetic Features and Outliers

Learning Objectives:

  • Create a synthetic feature that is the ratio of two other features
  • Use this new feature as an input to a linear regression model
  • Improve the effectiveness of the model by identifying and clipping (removing) outliers out of the input data

Let’s revisit our model from the previous First Steps with TensorFlow exercise.

First, we’ll import the California housing data into DataFrame:

Setup

using Plots
gr(fmt=:png)
using DataFrames
using TensorFlow
import CSV
using Random
using Statistics

sess=Session()

california_housing_dataframe = CSV.read("california_housing_train.csv", delim=",");
california_housing_dataframe[:median_house_value] /= 1000.0
california_housing_dataframe

17,000 rows × 9 columns

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64⍰Float64
1-114.3134.1915.05612.01283.01015.0472.01.493666.9
2-114.4734.419.07650.01901.01129.0463.01.8280.1
3-114.5633.6917.0720.0174.0333.0117.01.650985.7
4-114.5733.6414.01501.0337.0515.0226.03.191773.4
5-114.5733.5720.01454.0326.0624.0262.01.92565.5
6-114.5833.6329.01387.0236.0671.0239.03.343874.0
7-114.5833.6125.02907.0680.01841.0633.02.676882.4
8-114.5934.8341.0812.0168.0375.0158.01.708348.5
9-114.5933.6134.04789.01175.03134.01056.02.178258.4
10-114.634.8346.01497.0309.0787.0271.02.190848.1
11-114.633.6216.03741.0801.02434.0824.02.679786.5
12-114.633.621.01988.0483.01182.0437.01.62562.0
13-114.6134.8448.01291.0248.0580.0211.02.157148.6
14-114.6134.8331.02478.0464.01346.0479.03.21270.4
15-114.6332.7615.01448.0378.0949.0300.00.858545.0
16-114.6534.8917.02556.0587.01005.0401.01.699169.1
17-114.6533.628.01678.0322.0666.0256.02.965394.9
18-114.6532.7921.044.033.064.027.00.857125.0
19-114.6632.7417.01388.0386.0775.0320.01.204944.0
20-114.6733.9217.097.024.029.015.01.265627.5
21-114.6833.4920.01491.0360.01135.0303.01.639544.4
22-114.7333.4324.0796.0243.0227.0139.00.896459.2
23-114.9434.5520.0350.095.0119.058.01.62550.0
24-114.9833.8215.0644.0129.0137.052.03.209771.3
25-115.2233.5418.01706.0397.03424.0283.01.62553.5
26-115.3232.8234.0591.0139.0327.089.03.6528100.0
27-115.3732.8230.01602.0322.01130.0335.03.573571.1
28-115.3732.8214.01276.0270.0867.0261.01.937580.9
29-115.3732.8132.0741.0191.0623.0169.01.760468.6
30-115.3732.8123.01458.0294.0866.0275.02.359474.3

Next, we’ll set up our input functions, and define the function for model training:

function create_batches(features, targets, steps, batch_size=5, num_epochs=0)

    if(num_epochs==0)
        num_epochs=ceil(batch_size*steps/length(features))
    end

    features_batches=Union{Float64, Missings.Missing}[]
    target_batches=Union{Float64, Missings.Missing}[]

    for i=1:num_epochs

        select=shuffle(1:length(features))

        append!(features_batches, features[select])
        append!(target_batches, targets[select])
    end

    return features_batches, target_batches
end
function next_batch(features_batches, targets_batches, batch_size, iter)

    select=mod((iter-1)*batch_size+1, length(features_batches)):mod(iter*batch_size, length(features_batches));

    ds=features_batches[select];
    target=targets_batches[select];

    return ds, target
end

function my_input_fn(features_batches, targets_batches, iter, batch_size=5, shuffle_flag=1)
    """Trains a linear regression model of one feature.

    Args:
      features: DataFrame of features
      targets: DataFrame of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """

    # Construct a dataset, and configure batching/repeating.
    ds, target = next_batch(features_batches, targets_batches, batch_size, iter)

    # Shuffle the data, if specified.
    if shuffle_flag==1
      select=Random.shuffle(1:size(ds, 1));
        ds = ds[select,:]
        target = target[select, :]
    end

    # Return the next batch of data.
    return convert(Matrix{Float64},ds), convert(Matrix{Float64},target)
end
function train_model(learning_rate, steps, batch_size, input_feature=:total_rooms)
  """Trains a linear regression model of one feature.

  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    input_feature: A `symbol` specifying a column from `california_housing_dataframe`
      to use as input feature.
  """

  periods = 10
  steps_per_period = steps / periods

  my_feature = input_feature
  my_feature_data = convert.(Float32,california_housing_dataframe[my_feature])
  my_label = :median_house_value
  targets = convert.(Float32,california_housing_dataframe[my_label])

  # Create feature columns.
  feature_columns = placeholder(Float32)
  target_columns = placeholder(Float32)

  # Create a linear regressor object.
  m=Variable(0.0)
  b=Variable(0.0)
  y=m.*feature_columns .+ b
  loss=reduce_sum((target_columns - y).^2)
  run(sess, global_variables_initializer())
  features_batches, targets_batches = create_batches(my_feature_data, targets, steps, batch_size)

  # Use gradient descent as the optimizer for training the model.
  #my_optimizer=train.minimize(train.GradientDescentOptimizer(learning_rate), loss)
  my_optimizer=(train.GradientDescentOptimizer(learning_rate))
  gvs = train.compute_gradients(my_optimizer, loss)
  capped_gvs = [(clip_by_norm(grad, 5.), var) for (grad, var) in gvs]
  my_optimizer = train.apply_gradients(my_optimizer,capped_gvs)

  # Set up to plot the state of our model's line each period.
  sample = california_housing_dataframe[rand(1:size(california_housing_dataframe,1), 300),:];
  p1=scatter(sample[my_feature], sample[my_label], title="Learned Line by Period", ylabel=my_label, xlabel=my_feature,color=:coolwarm)
  colors= [ColorGradient(:coolwarm)[i] for i in range(0,stop=1, length=periods+1)]

  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  println("Training model...")
  println("RMSE (on training data):")
  root_mean_squared_errors = []
  for period in 1:periods
    # Train the model, starting from the prior state.
   for i=1:steps_per_period
    features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
    run(sess, my_optimizer, Dict(feature_columns=>features, target_columns=>labels))
   end
    # Take a break and compute predictions.
    predictions = run(sess, y, Dict(feature_columns=> my_feature_data));    

    # Compute loss.
    mean_squared_error = mean((predictions- targets).^2)
    root_mean_squared_error = sqrt(mean_squared_error)
    # Occasionally print the current loss.
    println("  period ", period, ": ", root_mean_squared_error)
    # Add the loss metrics from this period to our list.
    push!(root_mean_squared_errors, root_mean_squared_error)
    # Finally, track the weights and biases over time.

    # Apply some math to ensure that the data and line are plotted neatly.
    y_extents = [0 maximum(sample[my_label])]
    weight = run(sess,m)
    bias = run(sess,b)          
    x_extents = (y_extents .- bias) / weight    
    x_extents = max.(min.(x_extents, maximum(sample[my_feature])),
                           minimum(sample[my_feature]))    
    y_extents = weight .* x_extents .+ bias
    p1=plot!(x_extents', y_extents', color=colors[period], linewidth=2)
 end

  predictions = run(sess, y, Dict(feature_columns=> my_feature_data));
  weight = run(sess,m)
  bias = run(sess,b)

  println("Model training finished.")

  # Output a graph of loss metrics over periods.
  p2=plot(root_mean_squared_errors, title="Root Mean Squared Error vs. Periods", ylabel="RMSE", xlabel="Periods")       

  # Output a table with calibration data.
  calibration_data = DataFrame()
  calibration_data[:predictions] = predictions
  calibration_data[:targets] = targets
  describe(calibration_data)

  println("Final RMSE (on training data): ", root_mean_squared_errors[end])
  println("Final Weight (on training data): ", weight)
  println("Final Bias (on training data): ", bias)

  return p1, p2, calibration_data   
end

Task 1: Try a Synthetic Feature

Both the total_rooms and population features count totals for a given city block.

But what if one city block were more densely populated than another? We can explore how block density relates to median house value by creating a synthetic feature that’s a ratio of total_rooms and population.

In the cell below, we create a feature called rooms_per_person, and use that as the input_feature to train_model().

california_housing_dataframe[:rooms_per_person] =(
    california_housing_dataframe[:total_rooms] ./ california_housing_dataframe[:population]);
p1, p2, calibration_data= train_model(
    0.05, # learning rate
    1000, # steps
    5, # batch size
    :rooms_per_person #feature
)
Training model...
RMSE (on training data):
  period 1: 174.73499015754794
  period 2: 135.18378014936647
  period 3: 124.81483763650894
  period 4: 124.99348861666715
  period 5: 128.0855648925441
  period 6: 129.86065434272652
  period 7: 128.06995380520772
  period 8: 129.3628149624267
  period 9: 129.9533622545398
  period 10: 129.8691255721607
Model training finished.
Final RMSE (on training data): 129.8691255721607
Final Weight (on training data): 74.18756812501896
Final Bias (on training data): 67.78876122300292
plot(p1, p2, layout=(1,2), legend=false)

png

Task 2: Identify Outliers

We can visualize the performance of our model by creating a scatter plot of predictions vs. target values. Ideally, these would lie on a perfectly correlated diagonal line.

We use scatter to create a scatter plot of predictions vs. targets, using the rooms-per-person model you trained in Task 1.

Do you see any oddities? Trace these back to the source data by looking at the distribution of values in rooms_per_person.

scatter(calibration_data[:predictions], calibration_data[:targets], legend=false)

png

The calibration data shows most scatter points aligned to a line. The line is almost vertical, but we’ll come back to that later. Right now let’s focus on the ones that deviate from the line. We notice that they are relatively few in number.

If we plot a histogram of rooms_per_person, we find that we have a few outliers in our input data:

histogram(california_housing_dataframe[:rooms_per_person], nbins=20, legend=false)

png

Task 3: Clip Outliers

We see if we can further improve the model fit by setting the outlier values of rooms_per_person to some reasonable minimum or maximum.

The histogram we created in Task 2 shows that the majority of values are less than 5. Let’s clip rooms_per_person to 5, and plot a histogram to double-check the results.

california_housing_dataframe[:rooms_per_person] = min.(
    california_housing_dataframe[:rooms_per_person],5)

histogram(california_housing_dataframe[:rooms_per_person], nbins=20, legend=false)

png

To verify that clipping worked, let’s train again and print the calibration data once more:

p1, p2, calibration_data= train_model(
    0.05, # learning rate
    500, # steps
    10, # batch size
    :rooms_per_person #feature
)
Training model...
RMSE (on training data):
  period 1: 204.65393150901195
  period 2: 173.7183427312223
  period 3: 145.97809305428905
  period 4: 125.39350453104828
  period 5: 113.5851230428793
  period 6: 108.94376856469054
  period 7: 107.51132608903492
  period 8: 107.37501891367756
  period 9: 107.35720127223883
  period 10: 107.36959825293329
Model training finished.
Final RMSE (on training data): 107.36959825293329
Final Weight (on training data): 70.5
Final Bias (on training data): 71.72122192382814
plot(p1, p2, layout=(1,2), legend=false)

png

scatter(calibration_data[:predictions], calibration_data[:targets], legend=false)

png

Twitter, Facebook