Monday, June 8, 2015

Introduction to Apache Spark

Introduction to Apache Spark

These are some rough notes for a talk to be given at the June 2015 Chadoop meeting.

What is Apache Spark?

  1. Enterprise parallel processing framework based on resilient distributed datasets (RDD)
  2. for java, scala, python and sql
  3. with memory caching (memory reusable, used as progressive enhancement)

How can I get Spark up and running quickly?

–> 1. Docker on collection of various firewalled hardware (described here)
2. Amazon EC2 script (included in Spark distribution)

How does it perform / scale?

  1. Resilient against (some) worker failures
  2. Can add / remove workers [Amazon spot market compatible]
    –> 3. Can perform well or poorly depending on classical parallel processing concerns (cpu vs memory vs disk vs net bottlenecks)

Quirks

  • Transformations, such as reading a file, mapping data over a function, etc. do not instantiate.
  • Actions do – this is good and allows you to be deliberate about exactly how large intermediate results are to be handled (cached in memory, on disk, recomputed each time).
  • Spark is not currently hierarchical. Networking is not layered as App/Driver <-> Master <-> Workers. Important make sure all workstations/masters/workers can connect to each other.
  • Random port numbers.
  • Networking is not-NAT friendly and the documentation says little about this issue. This can be a blocker for simple combining your cluster of 3-4 home PCs with extra Amazon EC2 machines, as the home machines may be on the 192.168 LAN and the Amazon EC2 machines on a difference Amazon LAN. If the spark master has a local IP like 192.168.1.100 and a router exposes it to the internet as 1.2.3.4, packets to the spark master at 1.2.3.4 will be read and discarded with logged errors.
  • Ideally, jobs that don’t finish timeout and are rescheduled. But that wasn’t always my experience. Spark still has a ways to go on robustness/reliability. There are unsolved issues. Workers can hang. A malformed csv file import was one way. I suspect though, reliability will improve over time.

Using Pipework and Docker to Run Apache Spark

My cluster at home experience: For Apache Spark I learned the hard way not to use docker’s hostname flag -h, but instead use default hostnames and use pipework to establish an alternate network interface in each container in order to place them all with their own IPs on a shared firewalled LAN. By default they can not be reached from outside the home, and would require planned holes in the firewall for outside access probably via ssh to a shell account inside the home with access to the cluster.
Pipework is a free utility to route an additional ethernet bridge into a docker container
Docker is a Virtualization Technology for Linux based on Containers.
To install in Ubuntu use apt-get install docker.io where the name is docker.io due to name conflict with an older project also called docker (system tray for KDE3/Gnome2 applications).

Docker vs. Traditional Virtual Machine

  • Docker uses Linux Kernel namespace and cgroup capabilities to isolate and restrict containers
  • All the host and the docker containers share the hosts’ Linux kernel, memory, swap and hardware
  • the container sees its own file tree and process list, and can only access devices as permitted by docker

In a Traditional VM, resources are wasted:

  • booting and running multiple kernels
  • duplicate drivers, system utlities and cron jobs, etc.
  • by requiring hard allocations of memory for each running VM that might turn out to be incorrect later, like giving a workers too little or too much memory versus the master or driver.

The traditional VM is probably safer/more secure. Docker is lighter weight in a trusted environment.

Dockerhub is a public repository of layered images for container filesystems
A Dockerfile is a script for building a container filesystem from the repository, docker directives and shell scripting.
FROM drpaulbrewer/spark-roasted-elephant:latest
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-worker.sh /spark/
CMD /spark/my-spark-worker.sh
The docker command, in conjunction with a docker daemon, builds, pushes, pulls, runs, stops, kills, etc… containers.
shell script ./run-docker-spark for master and 1st worker [AMD FX 8150]
#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 --env SPARK_LOCAL_IP=192.168.1.10 \
 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/24@192.168.1.1
SPARK1=$(docker run --name="spark1" --expose=1-65535 --env mem=10G \
--env master=spark://192.168.1.10:7077 \
--env SPARK_LOCAL_IP=192.168.1.11 \
--env SPARKDIR=/spark/spark-1.3.1 \
-v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/24@192.168.1.1
shell script for additional worker on the same wired LAN
#!/bin/bash
sudo -v
SPARK=$(docker run --name="spark" --expose=1-65535 \
--env mem=10G \
--env master=spark://192.168.1.10:7077 \
--env SPARK_LOCAL_IP=192.168.1.13 \
-v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK 192.168.1.13/24@192.168.1.1
shell script for additional worker on wireless LAN.
Issue: wireless LAN apparently not compatible with pipework’s bridging designed for eth0
Solution: –net=”host” runs on hosts’ network stack; no pipework run
#!/bin/bash
sudo -v
SPARK=$(docker run --net="host" --expose=1-65535 \
--env SPARKDIR=/spark/spark-1.3.1 \
--env mem=10G \
--env master=spark://192.168.1.10:7077 \
-v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
To reset or on system restart:
need to docker stop (or docker kill) and docker rm each container before running again

Parallel Processing Toy example:

Calculate the Leibniz formula for PI

(π/4) = 1 - (1/3) + (1/5) - (1/7) +(1/9) - (1/11) + …

Let’s use denominators out to 4 Billion and 1.

A. Without Apache Spark – Python

total=1.0
for n0 in xrange(1000000000):
    n = 1+n0
    total += (1.0/(4.0*n+1.0)) - (1.0/(4.0*n-1.0))
print 4.0*total

A. Run Time: 409 sec on 1 core of AMD-FX8150


B. Without Apache Spark – Python + Numpy

import numpy
billionInt = 1+numpy.arange(1000000000, dtype=numpy.int32) 
total = 1.0 + ((1.0/(4.0*billionInt+1.0)) - (1.0/(4.0*billionInt-1.0))).sum() 
print 4.0*total

Issue: wants 26G ram

Solution: Rent Amazon EC2 Xeon E5-2680 2.8Ghz 32GB (c3-4xlarge)

B. Run Time 25 sec on one core


C. Without Apache Spark – C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char ** argv)
{
 long int n;
 if (argc!=3){
    fprintf(stderr, "usage: LeibnizPi1G <fromidx> <toidx_uninclusive>\n");
    exit(EXIT_FAILURE);
 }
 long int loop_from = strtol(argv[1],NULL,10);
 long int loop_to = strtol(argv[2], NULL, 10);
 long double pisum = 0.0L;
 const long double one =  1.0L;
 const long double four = 4.0L;
 for(n=loop_from;n<loop_to;++n)
   pisum+=(one/(four*n+one))-(one/(four*n-one));
 long double LeibnizPi = 4.0*(1.0+pisum);
 printf("%.20Lf\n", LeibnizPi);
}

C. Run Time (AMD FX8150, 1 Core): 9 sec


D. Without Apache Spark – hand written map/reduce

===
#!/bin/bash
X=./LeibnizPi1G
$X 1 100000001 >/tmp/x1 &
$X 100000001 200000001 >/tmp/x2 &
$X 200000001 300000001 >/tmp/x3 &
$X 300000001 400000001 >/tmp/x4 &
$X 400000001 500000001 >/tmp/x5 &
$X 500000001 600000001 >/tmp/x6 &
$X 600000001 700000001 >/tmp/x7 &
$X 700000001 800000001 >/tmp/x8 &
$X 800000001 900000001 >/tmp/x9 &
$X 900000001 1000000001 >/tmp/x10 &
wait
cat /tmp/x* | gawk 'BEGIN {sum=0.0} {sum+=($1-4.0)/4.0} END { printf("%0.17g\n",4.0+4.0*sum) }'

D. Run Time: 2-3 sec on 8 core AMD FX 8150


JS. Browser-side Javascript

Leibniz π on jsfiddle

Run Time: 8.88 sec in Chromium Browser on i5-4570


Home Apache Spark Cluster

Box 1 (master, worker1, and workstation) AMD FX-8150 8 cores 16GB

Box 2 (worker2) i5-3470S HP Ultra Slim 4 cores 16 GB

Box 3 (worker3) i5-4570R Gigabyte Brix Cube 4 cores 16GB WLAN

After the master is running and the workers connected, we can go to the master WebUI to monitor
the cluster and check stdout/stderr on any of the workers
Spark Cluster Management Console

Spark – Python

from operator import add
from pyspark import SparkContext

def term(n):
    return (1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0))

def LeibnizPi(Nterms):
    sc = SparkContext(appName="LeibnizPI")
    piOver4Minus1 = sc.parallelize(xrange(1,Nterms+1),30).map(term).reduce(add)
    return 4*(1+piOver4Minus1)

print LeibnizPi(1000*1000*1000)

To run: spark-submit –master $MASTER /path/to/code.py or pyspark

Run Time 36 sec


Idea: Make the map task more substantial

Spark – Python #2

from operator import add
from pyspark import SparkContext

def Msum(n):
    megaN = 1000000*n
    s = 0.0
    for k in xrange(1000000):
        d0 = 4*(k+1+megaN)+1
        d1 = d0-2
        s += ((1.0/d0)-(1.0/d1))
    return s

def LeibnizPi(Nterms):
    sc = SparkContext(appName="LeibnizPI")
    piOver4Minus1 = sc.parallelize(xrange(0,Nterms+1), 20).map(Msum).reduce(add)
    return 4*(1+piOver4Minus1)

print LeibnizPi(999)

Run Time: 24 sec


Spark – Scala

def term(n: Int) = (1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0))

def LeibnizSum(Nterms: Int) = sc.parallelize(1 to Nterms).map(term).reduce(_ + _)

def LeibnizPi(Nterms: Int) = 4.0*(1.0+LeibnizSum(Nterms))   

println(LeibnizPi(1000*1000*1000))

To run: spark-shell –master $MASTER, then :load code.sc

Run Time: 19 sec


Spark – Scala #2

def LeibnizMSum(k: Int) : Double = {
  var sum:Double = 0.0;
  var n:Long = 0;
  sum = 0.0
  for( n <- (1000000*k.toLong+1) to (1000000*(k.toLong+1)) ){
    sum = sum + ((1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0))) ;
  }
  return sum;
}

def LeibnizSum(Nterms: Int) = sc.parallelize(0 to Nterms).
   map(LeibnizMSum).reduce(_ + _)

def LeibnizPi(Nterms: Int) = 4.0*(1.0+LeibnizSum(Nterms))   

println(LeibnizPi(999))

Run Time: 2.15 - 2.25 sec

Run time for 10 Billion terms, run as 64 map tasks: 17 sec


Spark – Java #1

Failed

parallelize() required a List<Integer> to split up among workers, but it first had to be initialized to 1, 2, 3, …, 1 Billion, which was problematic.


Spark – Java #2

import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import java.util.ArrayList;

class LeibnizMSum implements Function<Integer,Double> {
    public Double call(Integer k){
        double s=0.0;
        int n = 0;
        int limit = 1000000*(k+1);
        for(n=1000000*k+1; n<limit; ++n){
            s = s + ((1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0)));
        }
        return s;
    }
}

public class LeibnizPIApp {

    public static void main (String[] args){
        int billion = 1000*1000*1000;
        String masterURL="spark://192.168.1.10:7077";
        String appName = "LeibnizPIJava";
        String home = "/spark/spark-1.3.1";
        String jarfile = "/Z/pi/LeibnizJava/target/LeibnizPiJava-1.0.jar";
        System.out.println("Hello this is LeibnizPIApp.  Initializing...");
        ArrayList<Integer> gigalist = new ArrayList<Integer>(1000);
        int i;
        for(i = 0; i<1000; ++i) gigalist.add(i);
        System.out.println("Initialized gigalist");
        JavaSparkContext sc = new JavaSparkContext(masterURL, appName, home, jarfile);
        System.out.println("have spark context");
        JavaRDD<Integer> gigaRDD = sc.parallelize(gigalist, 32);
        System.out.println("have paralellized gigalist, starting map/reduce");
        Double pisum = gigaRDD.map(new LeibnizMSum()).reduce( (a,b) -> a+b );
        Double pi = 4.0 + 4.0*pisum;
        System.out.println(pi);
    }
}

Run Time 3.14 sec


Spark SQL

  • Prepare a file with entries 1,2,3,4,5…1 Billion. One int per line.
  • Load it as a temporary SQL table with a scala script
  • Transformations, such as reading a file, mapping data over a function, etc. do not instantiate. Actions do
  • A cache request stores the object the first time an action instantiates it
  • Must use –executor-memory option to set available memory to workers

SQL Table Load script

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Gig(n: Int)
val gigcsv = sc.textFile("/data/gig.csv")
               .map(x => Gig(x.trim.toInt)).toDF().cache()
gigcsv.registerTempTable("gig")

SQL count statement

sqlContext.sql("select count(*) from gig").collect().foreach(println)

First Run: 59 secs to read 1 billion records from SSD

Second Run: Cached in memory – 3.5 secs

SQL Pi Calculation

sqlContext.sql("select 4.0*(1.0+sum( (1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0)) )) as LeibnizPI FROM gig").collect().foreach(println)

Run Time: 20 secs (from previous cached result)

Wednesday, March 25, 2015

LoanModificationApprovalRates

Inside the US Treasury HAMP 2015.01 Data Dump: Fun with Loan Modification Approval Rates

It’s no secret that many people who applied for a loan modification were unable to obtain one. Some people received the run around as banks lost their paperwork over and over. Others received a trial modification but not a permanent one. Some people simply weren’t eligible.

The fields ln_st_nme and ln_mdfc_mode_nme in the data dump tell this story:

outcomes of loan modification requests

Only rows 2 and 6 are good outcomes. This article concentrates on the rate of row 2 “Active Payment - Official Modification” outcomes, about 1 million out of the 6+ million total requests.

A recent public data dump, together with the capabilities of Google BigQuery, allows a deeper look at the approval rates for loan modification requests.

Loan Modification Approvals Percentages by MSA State

top 10 states by loan modification approval rate

Active is the number of official, permanent modifications with active payment

Total is the number of all requests, including rejected requests, cancellations, trials and permanent.

Approval percentage is a percentage calculated as 100 x Active / Total

Puerto Rico is the surprising entry on this list . Even though the number of applications and official modifications is small, there’s no obvious explanation why it should be easier for someone to get a loan modification approval in Puerto Rico, as opposed, to say, Hawaii, which had about than twice as many applications but fewer finished official, permanent modifications.

DC is not exactly a surprise. Putting cynicism aside for a moment, the place is full of professionals who can navigate, or help others navigate, the government red tape involved in the loan modification program.

Bottom States

Bottom States by Loan Modification Approval Rate (Metro Areas)

Among the bottom approval rates we find Texas, with a ~ 12% approval percentage for loan modifications. Texas has about as many active, permanent, official modifications as Arizona, about 33k. However, 272k Texans applied as opposed to 200k Arizonans.

What’s an MSA?

Practically, an MSA is an area around a city or town, identified by a code number such as 31080 for Los Angeles, or 12060 for Atlanta.

Technically, an MSA is a Metropolitan/Micropolitan Statistical Area, a group of counties identified by the federal government (probably the census bureau) as representing a named area.

In the HAMP data dump, a row is tagged with either an MSA code number (~5.8 million records) or a state (~419,000 records) or nothing (8 records) . Knowing both the state and MSA would be useful in state border regions such as New York City, Memphis, or Chicago as the MSA can cross state borders.

Best Approval Rates by MSA

enter image description here

Once again, Puerto Rico takes the lead, followed by the LA metro area and various California towns.

A Bit of Fun with Politics

Going back to our state-level data, we can also fetch whether Obama (D) or Romney(R) won the state in the 2012 Election and put that alongside the loan modification approval percentage.

2012 Presidential Election State by State

2012 election map from Wikipedia
source: Wikipedia

top loan modification approval percentage with 2012 election info

Although the US political scene is not so simple or rigid as to put an entire state in the Democratic or Republican bucket based on the 2012 election, for the sake of argument we see 8 (D) states getting the top 10 loan modification approval rates and only 2 (R) states in the top 10.

And if we look at the bottom 10 states by loan modification approval percentage…

bottom loan modification approval percentages wth 2012 election info

we see 9 (R) states at the bottom of the rankings and only 1 (D) state.

From this, what can a hypothetical talk-radio host conclude? Probably not much. Certainly none of these things:

  • that the (D) states are being paid off for voting (D), at the expense of the (R) states [**]?
  • that (R) folks have a better chance attracting a tax audit than receiving a loan modification[**]?
  • that the (R) folks are sabotaging the loan modification process in some states [**]?
  • that people who needed a loan modification voted, in 2012, for the party that would give them one[**]?
    [**] except Utah and Arizona of course.

After all, correlation does not prove causation.

There can be unknown factors creating these effects that are also correlated with politics.

Speaking of correlation, if we use 2012 popular vote, % Romney - % Obama, to put each state on a left-right axis, we get this pretty plot:

MSA HAMP Loan Modification Approval Rate vs 2012 Popular Vote

The R^2 for the linear fit is around 0.40, meaning that about 40% of the variance is explained by the linear model (leaving 60% unexplained…)

Once again, remember, correlation does not prove causation.

R provided this summary of the linear model shown in the chart.

Call:
lm(formula = hamp_approval_rate ~ 1 + popular_vote2012, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6433 -0.9806 -0.3023  1.1444  6.2260 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      14.76049    0.26599  55.492  < 2e-16 ***
popular_vote2012 -0.07342    0.01307  -5.618 9.54e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.875 on 48 degrees of freedom
Multiple R-squared:  0.3967,    Adjusted R-squared:  0.3842 
F-statistic: 31.57 on 1 and 48 DF,  p-value: 9.544e-07

The queries, raw data, and R scripts for this are in this github repository for those who would like to use these files as a basis for other explorations, commentary, or improvement.

Totalling Savings to Borrowers

The data dump contains fields for borrower monthly housing expense before and after the modification. These fields do not always contain any numbers, or numbers that make sense, so we restrict the data to permanent modifications where these numbers take on somewhat reasonable values (I used up to $10,000/mo, which might seem a bit high for a house payment plus miscellaneous – but there are hundreds of entries that are higher) and eliminate any reports showing zero or negative savings.

Monthly Savings from Loan Modification

The California savings of $209 million/month dwarfs the #2 and #3 states of Florida and New York that are tied at around $72 million/month saved on housing expenses. Other states in the top 12 come in from $10-$30 million/month saved.

The top average savings per borrower/mod studied goes to HI with $1008/mo (off chart), then NY ($910/mo), CA ($857/mo), DC ($769/mo)

Bottom states, Monthly Savings from Loan Modification

The bottom end of the savings list dips under $100K/month total saved, with average savings in the $300/mo-$500/mo range.

Data Issues

I've heard that there are data issues with some fields in the HAMP data dump. Some of these are mentioned in the data dictionary and users guide. I've also seen reports of higher HAMP approval rates than the measly 10-20% we see here.

Here's a document that says the number is more like 30% but it includes, in the fine print, trials that do not become permanent modifications:
HAMP Application Activity by Servicer Jan 2015

So, take what you read here with a grain of salt. If anyone has further insight about the data issues, please, explain in the comments section.

Saturday, March 21, 2015

GoogleBigQueryAsCalculatorFromPiToMonteCarlo

Using Google BigQuery As A Massively Parallel Scientific Calculator: From π to Monte Carlo

An idea was discussed a bit at a talk I gave on BigQuery at a Google Developer Group meetup in Chattanooga. Can you use BigQuery to generate massive data instead of simply loading or analyzing existing data? Sure. One can write up a giant select query to fix some errors in a dataset, rearrange the column order, calculate some new columns, and save it in a table.
But something more basic, even subversive, is possible. It could be a Hadoop killer for some calculations.
All we need is a table of integers, say a list from 1 to 1,000,000,000.
Not finding such a dataset we release one as eaftc-free:billion.integers for public exploration.
Although this dataset would seem to be uninteresting, even trivial, it allows the end user to build all sorts of calculations and simulations using mathematical functions and the uniform [0,1) random function rand() built into BigQuery.
You can get access to our table at
https://bigquery.cloud.google.com/table/eaftc-free:billion.integers
Processing costs are your own responsibility, but Google has a limited free tier. Clicking this link will take you to a Google BigQuery signup unless you already have a BigQuery account.
So what? You ask. This simple table was a missing piece to make a massive parallel calculator.

Calculating π in Google BigQuery

Normal Usage

Normally, π in Google BigQuery can be obtained from the built in function PI()
Warning: I do not necessarily advocate repetition of the pi calculations shown below. You should use your free monthly processing in BigQuery to do something new and useful. Really. On the other hand, the public regularly uses similar levels of resources to search for furry kittens.
Queries can cost money… disclaimer: Most of the queries reported here will use 7 - 25 GB of BigQuery processing. After the free tier, in March 2015 Google BigQuery is currently priced by Google at US$5/TB. So, some of the queries here could cost you US$0.10 every time you run them. Therefore, you should probably not attach a web page that shows furry kittens to a back end cgi script that runs these queries on demand for the public from your account. Also in this article are instructions for constucting large tables. These large tables also attract storage costs from Google BigQuery until you delete them. Neither the author nor my company, Economic and Financial Technology Consulting LLC, will be responsible for any Google BigQuery processing, storage, or other charges incurred by experimenting with these queries.

Brute forcing a sum

Let’s calculate π using one of those huge sums you learn in math class.

Leibniz formula

From the Leibniz formula for Pi
(π/4) = 1 - (1/3) + (1/5) - (1/7) +(1/9) - (1/11) + …
We can write the calculation for π from the first 2,000,000,001 (two billion and one) terms in this series as this BigQuery against [eaftc-free:billion.integers]:

SELECT 4*(1+sum( (1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0)) )) as LeibnizPI FROM [eaftc-free:billion.integers];

Leibniz PI via Google BigQuery
Obtaining this approximation: 3.1415926540897736 in about 2 seconds
Compare to the pi day page - : 3.1415926535897932384
shows the two billion and one term Leibniz sum to be correct to 9 places and is a bit high for digits 10-11.
This sum can also be easily written as a simple loop in Python, or your favorite programming language, and is left as an exercise to the reader. A decent desktop computer, like our AMD FX8150, found a similar value in about 6 minutes running in one CPU core. Due to rounding and the kind of floats you are using, different answers are possible.

Euler formula

Euler had a result that Pi squared over 6 is the limit of the sum of all the squared reciprocals
(1/1)+(1/4)+(1/9)+(1/16)+…
This can be written in Google BigQuery as:

SELECT sqrt(6.0*sum(1.0/(n*n))) as EulerPI FROM [eaftc-free:billion.integers]

Euler PI via Google BigQuery
Obtaining this Euler approx: 3.141592652634905
Compare to the pi day page - : 3.1415926535897932384
shows the one billion term Euler sum to also be correct to 9 places and is a bit low for digits 10-11.
Although I don’t think you should waste of all Google’s processing power recomputing π over and over, Wikipedia provides a list of approximations of π that might be reconstructed in a similar manner.

The Road to Monte Carlo

Monte Carlo methodology has broad application in computational finance, particularly in the pricing of various kinds of bets (derivatives) about the motion of a price or price index, as well as in applied mathematics, and thus, many applications in science and engineering, to estimate integrals and dynamic processes that can not be derived on paper analytically.
The basic idea of Monte Carlo methods is:
  1. Express the desired calculation as an average or other reduction of functions of a bunch of random numbers that are obtained from calling a random number generator
  2. generate that bunch of random numbers
  3. reduce the giant table of random numbers to the answer using other functions, averages, etc.

Bumps on the Road to Monte Carlo

Strange results repeating Big Query’s RAND()

You’d think multiple calls to RAND() in a query would generate independent calls and independent random numbers. Not so, as of March 2015.

Should this query yield 1/4 or 1/3 of 1,000,000,000 ?


 select sum(rand()*rand()) from [eaftc-free:billion-integers]

What this does is, sum up rand()*rand() 1 billion times, once for each of the integers in the table. Each call to rand() yields a random floating number between 0.0 and 1.0.
Notice that no columns of the table are needed, so this query attracts 0 bytes of Google BigQuery processing charges even though it is work for BigQuery to calculate. The number of rows in the billion-integers table, 1 billion, controls the number of times the innermost expression will be executed before being summed.

Why 1/4 of 1 billion?

If the RAND() calls produce different, independent, random numbers, then we can use the formula that applies to independent random numbers E[x*y]=E[x]*E[y] where E[x] is the expected vaue of x, a fancy technical way to say the average of x. Since the average value of rand() is 0.5, we should have avg(rand()*rand()) = avg(rand()) * avg(rand()) = 0.5 * 0.5 = 0.25. If the avg is 0.25, then summing 1,000,000,000 should yield around 2.5E8, or 250,000,000

Why 1/3 of 1 billion?

Cacheing is a technique to speed up computer processing by recalling a result from memory instead of processing it over again. If multiple calls to RAND() in a row of output were somehow cached into one call to RAND() then avg(rand()*rand()) is not multiplying two different independent random numbers and taking the average, but actually multiplying a single random number by itself and taking the average, as inavg(square(rand())). In such a case, the result is the definite integral of x^2 over the interval [0,1], which you might remember from freshman calculus is 1/3.

And the Google BigQuery Answer is …. 1/3 … ugh :-(

BigQuery for product of two rand calls

Subselects won’t generate independent tables of RAND()

rands strangely equal
The next section explores how to work around this issue to generate a table with two independent random values.

Paving the Road to Monte Carlo

We want to generate a table with independent realizations of two random variables on each row.
Let’s call these variables u and v
This can be done in a straightforward series of steps.

  1. Make a u table where u is generated by rand()
  2. Make a v table where v is generated by rand()
  3. Join the u and v tables
  4. Test for independence
  5. Clean up by deleting the tables in steps 1 and 2

We publicly released the resulting u,v table as [eaftc-free:billion.randUV]

Making the randU table

In a query returning a billion entries into a new table, we had to select Allow Large Results. As a matter of courtesy to other users we also ran these table construction queries as batch mode queries.
making the u table

Making the randV table

making the v table

Joining to make the randUV table

making the randUV table

Testing independence

testing the randUV table

Generating Normals: The Box-Muller Transform

Having a table with two independent random uniform variables allows us to make standardized normal random variables from the normal, or bell-shape, distribution. Although we won’t get that far in this article, the normal distribution is the basic building block for building random walks of stock market prices and Monte Carlo estimation of options prices.
The Box-Muller Transform says we can take our two independent rand() uniform random numbers u and v and make two zero mean, unit variance standard normal distribution random numbers z0 and z1using these formulas:

z0 = sqrt(-2.0 ln(u)) cos(2πv)

z1 = sqrt(-2.0 ln(u)) sin(2πv)

In Google BigQuery, to avoid extra storage charges we store a view computing z0, z1 from u, v based on table randUV instead of computing a new table of z0, z1.
So you can follow along, we’ve made this view publicly available as [eaftc-free:billion.BoxMuller]
Box Muller view

Testing averages, variances from Box-Muller

We were happy to see averages near zero and variances near 1, and all slightly different from each other.
testing Normals for avg and var

Examining the normal tail: That infamous sixth sigma

By six sigma here we are referring to the Normal distribution, not the Six Sigma managerial process and training organizations, although the goals of the latter are marketed through the mathematical properties of the former.
In the normal distribution z0 >= 6.0 occurs with a probability of about 1 in a billion.
We can use Google BigQuery to see how many of these occurred in our billion normal values.
count of six sigma normal values
and we can find the row of the Box Muller view with this value
row with six sigma value

Five sigma

Exceeding five sigma should give us about 573 events per billion for a two sided (+ or -) deviation, or about 286 if we just look at one half… let’s look:
five sigma events
We find 302 in the positive, or right tail.
Is this too many? No.
From The Law of Rare Events the standard deviation in the number of rare events is the square root of the expected number or, for 286 expected events, +/- 17 events.
302 is within one sigma of the expected value and thus not a statistically significant difference.

Conclusion: Normal variables look good

Opportunity for Future work

Now that we have a way to produce standard random normal variables, it should be possible to build random walks as cumulative sums. Then, from a random walk, one can do Monte Carlo stock options pricing.
According to Stack Overflow, Google BigQuery supports cumulative sums
Further exploration of random walks is left to the reader or a future article.

Conclusions

With some care and the lessons from this article, Google BigQuery can, in its current form, function as an interesting, massively parallel scientific calculator and simulation platform.
It might be even better if rand() were a little more well-behaved. Although this article did not consider whether the random number generator passes statistical quality tests or standards for random generation, even if the rand function failed such a tests it would be a fairly simple matter to upload a fairly large table of properly generated random numbers...
We don’t know whether Google wants to be in the simulation generation space, or only the data analysis space. This will probably be more clear as Google BigQuery develops, or if Google develops additional tools or products that can leverage their parallel processing capabilities and help users avoid the steep learning curves that otherwise exist for setting up and using most modern parallel technologies.

Tuesday, March 17, 2015

HowToPubliclyShareAGoogleBigQueryDataset

How to Publicly Share a Google BigQuery Dataset: Sharing ~7GB of Loan Modification Public Data

This is the 2nd in a series of articles for Sunshine Week at blog.sunshineonacloudy.net concerning Google BigQuery and ways to share Big Data, or in this case Big Govt’ Data, so that others may have a look.

UPDATES:  Cunningham's Law worked well with this post.  Thanks to Felipe Hoffa for some inside info as to how free usage works.

Data Access TL:DR; If you want to see the shared loan modification data, try
https://bigquery.cloud.google.com/table/eaftc-free:unofficialhamp.mods201501

Goals:

  1. share my “unofficial” Google BigQuery dataset of loan modifications from the US Treasury’s public data dumps. This raw data is free, mostly a copy of US Treasury public domain data, and comes with NO WARRANTY.
  2. keep my bill to a minimum and require you to pay your own Google processing bill for extensive processing. We are willing to pay for storage only (about $0.25/month). The processing bill is pay-as-you-go,  with the first TB apparently free from Google and should be less than one US cent per query thereafter for most simple queries against the dataset. But it could be $500/day if you start analyzing other, huge, datasets such as those from Wikipedia.
  3. I’d like to credit my company, Economic and Financial Technology Consulting LLC with some good karma. We’re always looking for new business and interesting problems to solve. It looks like Google BigQuery is a great solution for keeping multi-GB operations in the human patience zone for interactive queries.

Names

Shakespeare’s “A rose by any other name would smell as sweet” clearly doesn’t apply to Google BigQuery’s fully qualified dataset names.
The format of a fully qualified BigQuery dataset name is:
[project:dataset.table] 
You want people to be able to find the data, know what it is and what it isn’t, and not be confused or hate the experience.
Since this is free data, with storage sponsored by my company abbreviated eaftc, consisting of an unofficial copy of the US Treasury Making Home Afffordable HAMP Data Files we are going with these fully qualified names:
[eaftc-free:unofficialhamp.mods201501]
[eaftc-free:unofficialhamp.npvs201501]
and, unfortunately, we can not rename what we already have to these names. So, we have to create a new project and upload the data all over again. After some thought, I think this is a good thing. I’d like to keep my own copy of the tables and any ad-hoc analyses I do separate from others’ usage of the data.
The date 201501 indicates cumulative data through Jan 2015.

Raw Data and Documentation Archive

A free archive of the raw data is available in our unofficial-hamp-201501 publicly shared Google Drive folder.
The US Treasury has changed the data formats at least once in the past, and they add new data from time to time…. so this archive is necessary for maintaining the provenance of the data. Anyone who wants to use the original raw data outside of BigQuery for, say, a college database course or to test it against what is in the eaftc-free:unofficialhamp BigQuery tables is welcome to do so. The archive contains:
  • Raw .zip files as downloaded from Treasury web site
  • python script unfubarcsv.py used to rewrite the csv files to remove extra commas and quotes, in preparation for Google BigQuery
  • Data dictionaries and usage guide as downloaded from Treasury web site
  • Google BigQuery schemas, in JSON format, for the mods and npvs tables

Creating and Sharing The Free Dataset

Note: Internet users seeking to access the free dataset can safely skip this section.
This section explains how to share. It does not explain how to access the shared data.
Steps we took to create and share the free dataset. The past tense (went, created, made) is used to emphasize that these are not steps for others to use the data. We had to do this, exactly once, to set up the data and share it with everyone else.
  1. from a web browser, logged in to Google, e.g. Gmail
  2. went to the Google API Console
  3. Signed up and turn on Billing if not already set up.
  4. created a new project. I set my project-id to eaftc-free Note that this was a permanent setting and was required to be unique across everyone’s Google API Projects globally on Earth. And I can not easily change the id later. Changing project-id would require deleting and recreating the project and all its stored files and datasets.
  5. enabled Google BigQuery API and Google Cloud Storage API for the project. There is a list of APIs on the web console and you can click to turn them on and off.
  6. ssh to a computer that has the prepared data and Google Cloud SDK already installed.
  7. from the bash shell, ran gcloud auth login, open the huge super-secret URL in a browser and authenticate
  8. set the project: gcloud config set project eaftc-free
  9. made a cloud storage bucket: gsutil mb gs://eaftc-free The bucket name must be globally unique among gs bucket names on Earth.
  10. Copied the mods table .csv data file to Google Cloud Storage: gsutil cp ./HMP_Public_User_Mod_Data_All_20150116-unfubar.csv gs://eaftc-free/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv
  11. Copied the npvs table .csv data file to Google Cloud Storage: gsutil cp ./HMP_Public_User_NPV_Data_All_20150116-unfubar.csv gs://eaftc-free/HMP_Public_User_NPV_Data_All_20150116-unfubar.csv
  12. Went to the BigQuery API web console and clicked the little down arrow to add a dataset unofficialhamp to the eaftc-free project. In the same step, was prompted to add a description to the dataset for future viewers.
  13. Back to the ssh session to load the cloud .csv files into BigQuery.
  14. Took 5 min to load ~ 5GB into the mods table: bq load --skip_leading_rows=1 unofficialhamp.mods201501 gs://eaftc-free/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv ./mod.bq
  15. Took 3 min to load ~2 GB into the npvs table: bq load --skip_leading_rows=1 unofficialhamp.npvs201501 gs://eaftc-free/HMP_Public_User_NPV_Data_All_20150116-unfubar.csv ./NPV.bq
  16. Back at the BigQuery API Web Console, clicked the second tiny down arrow next to the data set name unofficialhamp to bring up the Share Dataset window. Under Add people selected All Authenticated Users. This means everyone on Earth with a Google account and BigQuery set up. Unchecked Notify people via email (probably wouldn’t do anything…) set everyone to Can View and clicked Add. Then clicked Save changes at bottom of dialog. Received a popup that the dataset has been shared.
  17. Done. Kinda. Except: No one knew it was there. The existence of the dataset still needed to be publicized in, say, a blog (like the one you are reading right now) or on social media.
  18. Cleaned up by deleting the Cloud Storage bucket and associated data.
Note: publicdata:samples is Google promoted public data. Currently (March 2015) there doesn’t seem to be an obvious way for users to publicize public datasets in a way that makes them easy to discover within the BigQuery platform. Hopefully that will change in the future.

How Internet Users can Access the Shared “Unofficial” Dataset on HAMP Loan Modifications


Try this first!  Thanks Felipe!

According to Felipe Hoffa, in a comment below and also some private messages I could not attend to immediately, you should try this first to see if you can get access without a credit card:

1) people can directly access the table, instead of doing the step-by-step:
https://bigquery.cloud.google.com/table/eaftc-free:unofficialhamp.mods201501
 …

2) No credit card needed! People can query up to a terabyte of data for free every month. They need to sign up for google cloud, but no credit card.  

If that doesn't work for any reason, here is what I had posted previously: 

  1. Login to Google
  2. Go to https://console.developers.google.com and have your credit card ready. You must sign up for Google BigQuery and leave a credit card to pay your own Google BigQuery data processing bill. At $5/TB most queries of the 7GB=0.007 TB. Unofficial HAMP dataset will cost less than $0.05, and many less than $0.01,  each time you run a query . Processing larger datasets (such as free datasets from wikipedia and github) can run up larger fees faster. And there’s always the ugly possibility of identity theft. I think the service shuts down at $500/day unless larger limits are established. Scary stuff, but you should know the potential damage if you lose control of your account. Keep an eye on things, and use good passwords and perhaps turn on two-factor-authentication on your Google account.
  3. In the Google Developer’s API console add a project (call it anything you like) and sign up for the Google BigQuery API
  4. Go to your project in the Google BigQuery console and add the shared dataset.
The Google BigQuery docs describe this as follows:
Switch to a shared dataset
It is possible for another user to share a dataset with you but not
add you as a user to the corresponding Google Developers Console
project. If that happens, the dataset does not automatically appear in
the projects list, and you need to manually add it using the Display
Projects option.
To manually add a dataset:
Select the display menu icon next to the project name.
Hover over the Switch to Project option.
Click the Display Project... option.
On the popup that appears, enter the project ID of the dataset that was shared with you.
Your shared dataset now appears at the bottom of the dataset list.
In Switch to Project–>Display Project–>Add Project/Project ID enter:
eaftc-free
to add the publicly shared eaftc-free dataset from Economic and Financial Technology Consulting LLC containing the unofficialhamp dataset.
You will see eaftc-free:unofficialhamp added to the projects list in the left sidebar.
If you click the expansion arrow it will show the table names in the shared dataset:
 mods201501
 npvs201501
You can now click to select the mods201501 table and run a query by selecting the “query table” from the right hand side.


Running an example query


Let’s run a query to get counts of the loan statuses on file in the mods table.
Look at the top of the screen. The New Query box should show as a default
SELECT FROM [eaftc-free:unofficialhamp.mods201501] LIMIT 1000    
Change that to:
SELECT ln_st_nme, ln_mdfc_mode_nme, count(*) as N FROM [eaftc-free:unofficialhamp.mods201501] GROUP BY ln_st_nme, ln_mdfc_mode_nme ORDER BY N DESC LIMIT 20;
and check that the little check mark has turned from red to green. Green indicates a valid query. Click the green check mark for a preview of the processing charges for running the query. This query costs 230MB. 1TB is $5. Thats 230/1000000 x $5 ~= $ 0.001
Click the RUN QUERY button. The Query Results should match the following:

What is this Query telling us?

Notice this query result is telling us that most of the records (3.8 million) in the mods201501 table are HAMP requests that were not approved. There are other database fields that can provide additional explanation, and perhaps a basis for writing another article. Only about 1 million records are for “Official Modifications” with “Active Payment” status.

Saving the Query as a Table or View

You can do anything you like with the free dataset, but you have to save any views or results into your own project. You will be paying Google for BigQuery storage charges for any new tables you create (around $0.02/GB-Month as of March 2015).

Conclusion

This article explained how to share a dataset with the world.
It is important to know what you want to call the public dataset before you do all the work of loading it into BigQuery.
To share a dataset, you have to be willing to pay the storage charges for the tables you upload. You do not have to pay for the data processing charges. These can be paid by those using the tables.

Next steps

After reading this article, you should be able to:
  • access the eaftc-free:unofficialhamp dataset to follow along with other explorations reported on this blog
  • share your own “Big Data” project on Google BigQuery with the world some day

A final word of warning

If you turn on billing and leave a credit card and are concerned about possible billing problems or abuse, I believe you can turn off BigQuery billing any time you like – provided you are only using others’ shared data or perhaps only trying out BigQuery for an afternoon and have no permanent intention of using it.

If you have a storage bill for data you are using or sharing with the world, then your’e stuck paying that as long as Google is hosting the datasets you are sharing.
Also, keep in mind that Google BigQuery is still subject to change, I do not represent Google Inc. in any way, and am certainly not a free support desk for answering questions about their products.
I wrote this article in the hope that it is useful and will save other users some time and aggravation, or at least let people know if they are on the right track and what to expect.
Some day, the technical information presented here may be obsolete.
If you have questions about the procedures or encounter errors, do a Google search or Ask on Stack Overflow. These methods are the fastest and correct way to find help.

Saturday, March 14, 2015

BigQueryNotes.md

Comparing MySQL and Google BigQuery with a bit of unofficial analysis of The US Treasury’s Loan Modification Data

This is my first blog post for sunshineonacloudy.net

The goal is to compare mySQL and Google BigQuery within the context of a real-world financial application: specifically, taking a first look at some of the US Treasury’s data from loan modifications. Enjoy!

Overview

Google BigQuery is a Google-hosted SQL-like database that scales your data out to a cluster of machines in the Google infrastructure.

Google’s Big Query Demo https://cloud.google.com/bigquery/

Good things about Google BigQuery
* in this article, runs 10x faster than MySQL on a table made from 5GB of csv data
* according to Tigani and Naidu, Google BigQuery Analytics will scale out to match the data size, and can run full column scans as fast as ~ 50GB/sec
* SQL-like interface has SELECT, JOIN, WHERE, GROUP BY, and various functions
* quickly loads csv and json files; also takes streaming input
* (json) fields may contain deeper structure such as an array or more fields
* fast network
* web console. https://bigquery.cloud.google.com
* Google account access control
* Two levels of sharing (paid and unpaid)
* BQ pricing as of Mar 2015: $0.02/GB*Month storage + $5/TB processing

Bad
* lacks indexing feature => slow, expen$ive pinpoint queries
* csv importer not fully RFC 4180 compliant
* need credit card to analyze any private data

Different things about Google BigQuery
* needs its own json schema
* JOIN on large tables requires JOIN EACH or it fails
* bq command line queries need a -n flag to return >100 rows
* lacks DATE type. wants full date & time or unix timestamp
* lacks UPDATE, ALTER TABLE ADD COLUMN
* nothing smaller than 64 bit floats/integers

Unknowns (for me)
* do importers/schema converters exist for popular databases?


Concrete Example: Loan modification data from the US Treasury

Background

  • In the mid-2000s, a price bubble in the real estate market coincided with easy low interest rate financing from adjustable rate mortgages (ARMs). Borrowers defaulted in the late 2000s as these ARM loans were structured to rise from a low rate, like 1-3%, to a higher rate, like 6-10%, after an initial period of 1-7 years, and the borrowers were unable to pay the higher rates. Refinancing to “get cash out” also became impossible as real estate values declined below the amounts owed.
  • The Obama administration approved a number of assistance programs including HAMP, a loan modification program.
  • Various algorithms were defined for HAMP
  • Lenders began processing applications (loan mod hell)
  • Anonymized Data was made public on ~150 aspects of each application

Why?

Mostly Prior Personal/Professional Interest:

  • Before the financial crisis, I participated in a startup that had, as its goal, a mitigation of the crisis through new loan products. Obviously this was not adopted but some progress was made and a patent awarded.
  • Subsequently I ran a free “Unofficial HAMP Loan Modification Calculator” at http://armdisarm.com that provided a javascript app for calculating the details of modifications from procedures disclosed in the MHA Handbook, together with a nodeJS/sqLite-based lookup of some example loan modifications from an earlier Treasury .csv file.

While some may find the subject matter of loan mods a tad dry, the real estate market is the largest valued market of US assets. Programs like HAMP represent a government intervention into the market to potentially alter the loans of millions of distressed borrowers. In this age of government spying on the citizen, it is interesting to see what data the government is willing to disclose on such a historic, large-scale economic intervention.

Downloading the Data

We want the HAMP “Mods” table
* 6 million+ rows, 100+ columns, 5GB uncompressed
* CSV format
* Metro location, borrower gender, age, race, income, before/after monthly expenses, interest rates, property taxes, HOA fees, current payment status, rejection reasons
* Data quality issues (missing values, out-of-range values, units issues)

For other analysis we also want the other tables, but to narrow the scope we will primarily concern ourselves with the “Mods” table.

To download and unzip:

wget http://www.treasury.gov/initiatives/financial-stability/reports/Reports/HMP_Public_User_Mod_Data_All.zip

unzip HMP_Public_User_Mod_Data_All.zip

Prepping the Raw Data

Usually it isn’t a good idea to feed raw data directly to your database and expect everything to work out perfectly the first time.

Garbage In. Garbage out.

Usually a cycle of loading and testing is required. Here was no different:

For mySQL, NULL entries in a CSV need to be ,NULL,, not ,"",

sed -e 's/,""/,NULL/g' HMP_Public_User_Mod_Data_All.csv > sqlMod.csv

For Google BigQuery, CSV data should not contain additional commas or quotes.

python unfubarcsv.py HMP_Public_User_Mod_Data_All.csv bqMod.csv

My Python script unfubarcsv.py does these things:

  • ensure each CSV field is quoted, and contains no interior quotes
  • within a field, remove commas in numbers: 10,000 => 10000
  • within a field, replace remaining commas with spaces: Atlanta, GA => Atlanta GA

    import csv
    import sys
    import re
    infile = open(sys.argv[1],’r’)
    outfile = open(sys.argv[2],’w’)
    csvin = csv.reader(infile)
    csvout = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for rowin in csvin:
    rowout = []
    for origfield in rowin:
    field = origfield[1:-1] if origfield[0:1]==’”’ else origfield
    field = field.replace(r’”’,r’ ‘)
    field = field.replace(r”’”,r” “)
    field = re.sub(r’(\d+),(\d+)’, r’\1\2’, field)
    field = field.replace(r’,’,r’ ‘)
    rowout.append(field)
    csvout.writerow(rowout)

Note: Python table scan and edit takes about 10 sec per million rows in one core of AMD FX8150

Schemas

At a minimum, a database schema is a definition of all the columns in each table.

mySQL is a strongly-typed system, and distinguishes between a one byte TINYINT integer and an 8 byte BIGINT, FLOATs and DECIMALs, and DATEs, DATETIMEs and TIMESTAMPs.

Google BigQuery has fewer basic types: integer, float, string, and timestamp. There is also a record type that is basically an implementation of JSON and thus potentially competes with noSQL JSON databases like mongoDB.

A mySQL schema is independent of the order of the input data as the input data order is typically specified later in LOAD DATA INFILE or INSERT statements.

Here’s a mySQL schema that will work for the mods table.

CREATE TABLE mods ( 
brwr_amdfc_mthy_hsng_exp_amt DECIMAL(10,2) NULL,
brwr_bir_dt VARCHAR(16) NULL,
brwr_bmdfc_mthy_hsng_exp_amt DECIMAL(10,2) NULL,
brwr_ctbn_amt DECIMAL(10,2) NULL,
brwr_ethcy_typ_cd TINYINT NULL,
brwr_ethcy_typ_nme VARCHAR(80) NULL,
brwr_excn_dt DATE NULL,
brwr_hmda_data_src_cd TINYINT NULL,
brwr_hmda_data_src_nme VARCHAR(80) NULL,
brwr_mthy_dpmt_xcldg_piti_amt DECIMAL(10,2) NULL,
brwr_mthy_grs_incm_amt DECIMAL(10,2) NULL,
brwr_race VARCHAR(255) NULL,
brwr_sex_typ_cd TINYINT NULL,
brwr_sex_typ_nme VARCHAR(80) NULL,
brwr_t2_mthy_grs_rntl_incm_amt INT NULL,
brwr_t2_prr_tot_hexp_amt DECIMAL(10,2) NULL,
cbrwr_ethcy_typ_cd TINYINT NULL,
cbrwr_ethcy_typ_nme VARCHAR(80) NULL,
cbrwr_race VARCHAR(255) NULL,
cbrwr_sex_typ_cd TINYINT NULL,
cbrwr_sex_typ_nme VARCHAR(80) NULL,
fncl_ast_id BIGINT NOT NULL PRIMARY KEY,
ivsr_grp VARCHAR(80) NULL,
last_rptd_mthy_acvy_dt DATE NULL,
ln_acvy_actn_cd TINYINT NULL,
ln_acvy_actn_dt DATE NULL,
ln_acvy_actn_nme VARCHAR(80) NULL,
ln_aft_mdfc_amrt_term SMALLINT NULL,
ln_aft_mdfc_bck_rto_pct FLOAT NULL,
ln_aft_mdfc_escr_pmt_amt DECIMAL(10,2) NULL,
ln_aft_mdfc_frnt_rto_pct FLOAT NULL,
ln_aft_mdfc_fst_pmt_due_dt DATE NULL,
ln_aft_mdfc_int_rt FLOAT NULL,
ln_aft_mdfc_lpi_dt DATE NULL,
ln_aft_mdfc_max_int_rt FLOAT NULL,
ln_aft_mdfc_mtry_dt INT NULL,
ln_aft_mdfc_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_aft_mdfc_pi_pmt_amt DECIMAL(10,2) NULL,
ln_aft_mdfc_prdc_lbl_typ_cd TINYINT NULL,
ln_aft_mdfc_prdc_lbl_typ_nme VARCHAR(80) NULL,
ln_aft_mdfc_rmng_term SMALLINT NULL,
ln_aft_mdfc_upb_amt INT NULL,
ln_amdfc_pral_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_atrn_fees_not_in_escr_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_amrt_term SMALLINT NULL,
ln_bef_mdfc_assc_mthy_pmt_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_bck_rto_pct FLOAT NULL,
ln_bef_mdfc_escr_pmt_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_frnt_rto_pct FLOAT NULL,
ln_bef_mdfc_int_rt FLOAT NULL,
ln_bef_mdfc_lpi_dt DATE NULL,
ln_bef_mdfc_mtry_dt INT NULL,
ln_bef_mdfc_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_pi_pmt_amt INT NULL,
ln_bef_mdfc_prdc_lbl_typ_cd TINYINT NULL,
ln_bef_mdfc_prdc_lbl_typ_nme VARCHAR(80) NULL,
ln_bef_mdfc_rmng_term SMALLINT NULL,
ln_bef_mdfc_upb_amt INT NULL,
ln_bmdfc_pral_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_dlq_int_cplzd_amt DECIMAL(10,2) NULL,
ln_dlqy_typ_cd TINYINT NULL,
ln_dlqy_typ_nme VARCHAR(80) NULL,
ln_dsbt_frgv_amt DECIMAL(10,2) NULL,
ln_escr_cplzd_adv_amt DECIMAL(10,2) NULL,
ln_fcl_refl_dt VARCHAR(16) NULL,
ln_frbrn_plan_typ_cd TINYINT NULL,
ln_frbrn_plan_typ_nme VARCHAR(80) NULL,
ln_fst_lien_ind VARCHAR(80) NULL,
ln_fst_trl_pmt_due_dt DATE NULL,
ln_fst_trl_pmt_pstd_dt DATE NULL,
ln_fst_trl_pmt_rcvd_amt DECIMAL(10,2) NULL,
ln_hshp_rsn_cd TINYINT NULL,
ln_hshp_rsn_nme VARCHAR(80) NULL,
ln_int_owed_or_not_rptd_amt DECIMAL(10,2) NULL,
ln_int_remd_amt INT NULL,
ln_lpi_dt DATE NULL,
ln_mdfc_eff_dt DATE NULL,
ln_mdfc_eval_dt DATE NULL,
ln_mdfc_int_rt_lock_dt DATE NULL,
ln_mdfc_mode_cd TINYINT NOT NULL,
ln_mdfc_mode_nme VARCHAR(80) NOT NULL,
ln_mtg_typ_cd TINYINT NULL,
ln_mtg_typ_nme VARCHAR(80) NULL,
ln_npv_calc_dt DATE NULL,
ln_orig_note_dt INT NULL,
ln_othr_cplzd_adv_amt DECIMAL(10,2) NULL,
ln_othr_ctbn_amt DECIMAL(10,2) NULL,
ln_pgm_typ_desc VARCHAR(80) NULL,
ln_pgm_typ_nme VARCHAR(80) NULL,
ln_pi_at_31_pct_dti_rto_amt DECIMAL(10,2) NULL,
ln_pi_at_38_pct_dti_rto_amt DECIMAL(10,2) NULL,
ln_ppmt_owed_or_not_rptd_amt DECIMAL(10,2) NULL,
ln_pral_wflrstrn_typ_cd TINYINT NULL,
ln_pral_wflrstrn_typ_nme VARCHAR(80) NULL,
ln_prin_frbrn_amt DECIMAL(10,2) NULL,
ln_prin_rdcn_altv_amt DECIMAL(10,2) NULL,
ln_prin_rdcn_altv_cd TINYINT NULL,
ln_prin_rdcn_altv_nme VARCHAR(80) NULL,
ln_prin_remd_amt INT NULL,
ln_prjd_fcl_sle_dt VARCHAR(16) NULL,
ln_prtl_clm_amt DECIMAL(10,2) NULL,
ln_rpur_typ_cd TINYINT NULL,
ln_rpur_typ_nme VARCHAR(100) NULL,
ln_sec_lien_fncl_ast_id BIGINT NULL,
ln_smss_stat_cd TINYINT NULL,
ln_smss_stat_nme VARCHAR(80) NULL,
ln_splmy_astnce_cd TINYINT NULL,
ln_splmy_astnce_nme VARCHAR(80) NULL,
ln_st_cd TINYINT NOT NULL,
ln_st_nme VARCHAR(80) NOT NULL,
ln_t1_not_ofrd_rsn_cd TINYINT NULL,
ln_t1_not_ofrd_rsn_nme VARCHAR(80) NULL,
ln_tot_cplzd_amt DECIMAL(10,2) NULL,
ln_trl_mdfc_dnal_rsn_cd TINYINT NULL,
ln_trl_mdfc_dnal_rsn_nme VARCHAR(80) NULL,
ln_trl_plan_typ_cd TINYINT NULL,
ln_trl_plan_typ_nme VARCHAR(80) NULL,
ln_trl_pmt_no VARCHAR(80) NULL,
ln_trl_pmt_pstd_dt VARCHAR(80) NULL,
ln_trl_pmt_rcvd_amt VARCHAR(80) NULL,
ln_trl_prd_cnt INT NULL,
ln_umpt_frbrn_plan_durn_cnt INT NULL,
ln_umpt_mthy_frbrn_amt DECIMAL(10,2) NULL,
ln_umpt_plan_frbrn_intn_dt DATE NULL,
ln_upb_amt DECIMAL(10,2) NULL,
ln_upb_frgv_amt DECIMAL(10,2) NULL,
mha_ln_id BIGINT NOT NULL,
perm_mdfc_cncln_rsn_cd TINYINT NULL,
perm_mdfc_cncln_rsn_nme VARCHAR(80) NULL,
prop_cndn_cd TINYINT NULL,
prop_cndn_nme VARCHAR(80) NULL,
prop_geoc_cnsus_msa_cd MEDIUMINT NULL,
prop_ocpy_stat_cd TINYINT NULL,
prop_ocpy_stat_nme VARCHAR(80) NULL,
prop_rgn_cd TINYINT NULL,
prop_rgn_nme VARCHAR(80) NULL,
prop_stdz_st_cd VARCHAR(2) NULL,
prop_typ_cd TINYINT NULL,
prop_typ_nme VARCHAR(80) NULL,
prop_unt_cnt INT NULL,
prop_usg_typ_cd TINYINT NULL,
prop_usg_typ_nme VARCHAR(80) NULL,
prop_valu_as_is_val_amt DECIMAL(10,2) NULL,
prop_valu_dt DATE NULL,
prop_valu_mthd_cd TINYINT NULL,
prop_valu_mthd_nme VARCHAR(80) NULL,
step_schedule VARCHAR(1000) NULL,
trial_mdfc_fout_rsn_cd TINYINT NULL,
trial_mdfc_fout_rsn_nme VARCHAR(80) NULL
) ENGINE=Aria ;

The BigQuery schema is in JSON, and the field list must be ordered to match the order of fields in lines of the input csv. The mode indicates whether a field can be null or is required (not null). There is an optional “description” field, unused here but obviously needed, that allows providing a human-readable label for the data field.

[
    {
        "mode": "required", 
        "name": "fncl_ast_id", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "mha_ln_id", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "ln_st_cd", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "ln_st_nme", 
        "type": "string"
    }, 
    {
        "mode": "required", 
        "name": "ln_mdfc_mode_cd", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "ln_mdfc_mode_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_smss_stat_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_smss_stat_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ivsr_grp", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pgm_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pgm_typ_desc", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_trl_pmt_due_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_trl_pmt_pstd_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_trl_pmt_rcvd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_acvy_actn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_acvy_actn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_acvy_actn_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_amrt_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_amrt_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_assc_mthy_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_atrn_fees_not_in_escr_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_bck_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_bck_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_ctbn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_bir_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_ethcy_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_ethcy_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_excn_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_race", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_sex_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_sex_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_ethcy_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_ethcy_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_race", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_sex_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_sex_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_orig_note_dt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dlq_int_cplzd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dsbt_frgv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_escr_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_escr_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_escr_cplzd_adv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_lien_ind", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_fst_pmt_due_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fcl_refl_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_frnt_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_frnt_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_hshp_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_hshp_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_int_owed_or_not_rptd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_int_remd_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_int_rt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_int_rt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mdfc_int_rt_lock_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_lpi_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_lpi_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_prd_cnt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mtg_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mtg_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_mtry_dt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_mtry_dt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_max_int_rt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mdfc_eff_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_mthy_dpmt_xcldg_piti_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_amdfc_mthy_hsng_exp_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_bmdfc_mthy_hsng_exp_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_othr_cplzd_adv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_othr_ctbn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_pi_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pi_at_31_pct_dti_rto_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pi_at_38_pct_dti_rto_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_pi_pmt_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_remd_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_ppmt_owed_or_not_rptd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_prdc_lbl_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_prdc_lbl_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_prdc_lbl_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_prdc_lbl_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prjd_fcl_sle_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_cndn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_cndn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_geoc_cnsus_msa_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_unt_cnt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_ocpy_stat_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_ocpy_stat_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_rgn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_rgn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_stdz_st_cd", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_usg_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_usg_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_mthd_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_mthd_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_rmng_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "step_schedule", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_rmng_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "trial_mdfc_fout_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "trial_mdfc_fout_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_mdfc_dnal_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_mdfc_dnal_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_pmt_no", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_pmt_pstd_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_pmt_rcvd_amt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_upb_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_upb_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_tot_cplzd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dlqy_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dlqy_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_frbrn_plan_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_frbrn_plan_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_lpi_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "last_rptd_mthy_acvy_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_upb_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_mthy_grs_incm_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_npv_calc_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_frbrn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_rdcn_altv_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_rdcn_altv_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_amdfc_pral_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bmdfc_pral_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_rdcn_altv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_upb_frgv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_as_is_val_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_hmda_data_src_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_hmda_data_src_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pral_wflrstrn_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pral_wflrstrn_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_splmy_astnce_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_splmy_astnce_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_plan_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_plan_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_umpt_mthy_frbrn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_umpt_frbrn_plan_durn_cnt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_umpt_plan_frbrn_intn_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_sec_lien_fncl_ast_id", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prtl_clm_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "perm_mdfc_cncln_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "perm_mdfc_cncln_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_t1_not_ofrd_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_t1_not_ofrd_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mdfc_eval_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_t2_mthy_grs_rntl_incm_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_t2_prr_tot_hexp_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_rpur_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_rpur_typ_nme", 
        "type": "string"
    }
]

Loading

Once mySQL is installed, the CREATE TABLE command of the previous section is run to create the empty table from the schema.

To load data, mySQL has a special statement:

LOAD DATA INFILE '/tmp/ModData.csv'
INTO TABLE mods
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY "\n"
IGNORE 1 LINES
(fncl_ast_id,mha_ln_id,ln_st_cd,ln_st_nme,ln_mdfc_mode_cd,...)

Notice the last line of the statement requires the order of fields. Here we used head -1 filename.csv >> ModsLoad.sql in the Linux shell to copy the header line from the .csv file to the .sql file and then added the () in a text editor.

Big Query, since it is Google-hosted, requires a somewhat longer series of steps to load data

  1. Setup project in Google Developers’ API page
  2. Enable Google Big Query and Google Cloud Storage APIs
  3. Set up billing
  4. Create Data set in Project
  5. Install gsutil on data source computer (pip install gsutil)
  6. Copy bqMod.csv file from data source computer to Google Cloud Storage
  7. install bq on some computer - already installed in Google Compute Cloud (now part of GCloud SDK)
  8. gcloud auth login
  9. use bq to create Table within Dataset from cloud storage .csv file and schema.json
  10. query table using bq or web interface at http://bigquery.googleapi.com

Making a Google Cloud Storage bucket
gsutil mb gs://hamp201501

Copying the data to the storage bucket
gsutil cp ./HMP_Public_User_Mod_Data_All_20150116-unfubar.csv gs://hamp201501/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv

Creating and loading the table in Google Big Query
bq load --skip_leading_rows=1 hamp201501.mods gs://hamp201501/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv ./mod.bq

Notice that the bq load here is taking a google storage URL for the data but a local file for the schema. It seems to me natural to also load the schema from google storage but that was not successful.

In both mySQL and BigQuery, once the data is loaded, the data that exists in the database is independent of the input files. The original input files can be moved, removed, etc., if need be, or kept as a backup. Google Cloud Storage does cost $0.026/GB (Mar 2015) for files left there in addition to the storage costs in Google BigQuery, another $0.02/GB (Mar 2015).

Querying

Interfaces

MySQLBigQuery
web console 3rd party tools
phpmyadmin
https://bigquery.cloud.google.com
command line client mysql bq
part of Google Cloud SDK
API various external libs
mysql protocol
tcp and unix sockets
BigQuery Client Libs
REST
HTTP(S)

Performance

OperationMySQLBigQuery
Find a Record by Unique Id
select * from hamp201501.mods where fncl_ast_id=44448899004
a few ms
(PRIMARY KEY)
8 sec !!!
scan: 4.4GB
(lacks indexing)
Count all
select count(*) from hamp201501.mods
a few ms
(Aria table)
1.7 sec
scan: 0B
Group - count - sort loan hardships
select ln_hshp_rsn_cd, ln_hshp_rsn_nme, count(*) as N from hamp201501.mods group by ln_hshp_rsn_cd, ln_hshp_rsn_nme order by N desc
18 sec
(no indexes)
1.5 sec
scan: 71MB
Group by metro area - average home values in eligible value range - sort by avg value highest to lowest
select prop_geoc_cnsus_msa_cd, average(prop_valu_as_is_val_amt) as avg_prop_value from hamp201501.mods where prop_valu_as_is_val_amt between 1 and 800000 group by prop_geoc_cnsus_msa_cd order by avg_prop_value desc
17.5 sec 1.7 sec
scan: 65MB
Table join mods table to npvs table - group by metro area - count strange HAMP applications where the npv table reports a property value that is at least twice as much as the property value reported in the mods table ~ 46 sec
select prop_geoc_cnsus_msa_cd, count(*) as N from mods, npvs where mods.fncl_ast_id=npvs.fncl_ast_id and mods.prop_valu_as_is_val_amt > 0 and mods.prop_valu_as_is_val_amt < 0.5*npvs.prop_valu_as_is_val_amt group by prop_geoc_cnsus_msa_cd order by N desc limit 5
~ 3.5 sec, 147MB
uses non standard JOIN EACH
select prop_geoc_cnsus_msa_cd,count(*) as N from hamp201501.mods as t1 JOIN EACH hamp201501.npvs as t2 ON t1.fncl_ast_id=t2.fncl_ast_id where t1.prop_valu_as_is_val_amt > 0 and t1.prop_valu_as_is_val_amt < 0.5*t2.prop_valu_as_is_val_amt group by prop_geoc_cnsus_msa_cd order by N desc limit 5

Note that the join of large tables in BigQuery currently requires a size hint in the form of JOIN EACH. Otherwise, BigQuery will abort queries JOINing large tables.

mysql vs bq Command Line Tools

running interactively

Either mysql or bq can be run in an interactive shell mode:

mysql -u root -p

bq shell

running commands from a file

If hardshipList.sql contains the sql for the Loan Hardship Reasons query above, we can write command-line queries as follows:

mysql -u root -p hamp201501 <hardshipList.sql

bq query "$(cat hardshipList.sql)"

The "$(cat ...)" is not part of Google BigQuery. It is part of the Linux Bash shell. In Bash, "$(cat somefile)" simply inserts the contents of the file (the SQL commands) into the command line.

You can also enter the SQL directly in bq like this:

bq query "select stuff, morestuff, evenmorestuff from sometable where this=that and foo=bar"

output options

limits on number of rows returned

As far as I know, MySQL only limits the number of rows returned in response to an explicit LIMIT clause in an SQL query.

bq query has a default row limit that can surprisingly truncate results. I suppose Google’s engineers felt result limiting was necessary for big data so that beginners do not bring down the system with a limitless select * from ...

The row limit for an individual query can be redefined with the -n option

bq query -n 1000 "select ..." will return 1000 rows

formatting

Mysql’s default is a pretty printed table when interactive, otherwise tab-separated-values

command line option -H provides HTML table; -X provides XML

BQ’s default is a pretty printed table like this:

paul@home:~/treasury-data-hamp-Feb-2015$ bq query "$(cat hardshipList.sql)"
Waiting on bqjob_r2dcfef6105951f57_0000014c1026d2f8_1 ... (0s) Current status: DONE    
+----------------+-----------------------------------+---------+
| ln_hshp_rsn_cd |          ln_hshp_rsn_nme          |    N    |
+----------------+-----------------------------------+---------+
|           NULL |                                   | 3854918 |
|              6 | Curtailment of income             | 1305375 |
|             14 | Other                             |  390404 |
|              7 | Excessive obligation              |  237070 |
|             15 | Unemployment                      |  147586 |
|              2 | Illness of principal borrower     |   86661 |
|              5 | Marital difficulties              |   46282 |
|             24 | Unable to contact borrower        |   27194 |
|              1 | Death of borrower                 |   26637 |
|              3 | Illness of borrower family member |   24100 |
|             16 | Business failure                  |   21529 |
|             20 | Payment adjustment                |   17097 |
|              4 | Death of borrower family member   |   15865 |
|             21 | Payment dispute                   |    7189 |
|             19 | Servicing problems                |    6739 |
|             23 | Fraud                             |    4235 |
|             17 | Casualty Loss                     |    2796 |
|             12 | Inability to rent property        |    2223 |
|             18 | Energy environment costs          |    1914 |
|             10 | Property problem                  |    1798 |
|              9 | Distant employment transfer       |     876 |
|             13 | Military service                  |     855 |
|             25 | Incarceration                     |     751 |
|             11 | Inability to sell property        |     748 |
|             22 | Transfer of ownership pending     |     592 |
|              8 | Abandonment of property           |     538 |
+----------------+-----------------------------------+---------+

There is a --format option bq --format X query ...
where X can be

pretty: formatted table output
sparse: simpler table output
prettyjson: easy-to-read JSON format
json: maximally compact JSON
csv: csv format with header

(source: output of bq --help)

Unlike the mysql client there is no HTML table generation, or XML.

You probably won’t miss it.

bq can provide JSON output, which is ultimately more useful than HTML/XML, especially at the API call level

The csv format, oddly enough, is unquoted, like this:

paul@home:~/treasury-data-hamp-Feb-2015$ bq --format csv query "$(cat hardshipList.sql)"
Waiting on bqjob_r3f61cd54c48083df_0000014c102ebd6b_1 ... (0s) Current status: DONE    
ln_hshp_rsn_cd,ln_hshp_rsn_nme,N
,,3854918
6,Curtailment of income,1305375
14,Other,390404
7,Excessive obligation,237070
15,Unemployment,147586
2,Illness of principal borrower,86661
5,Marital difficulties,46282
24,Unable to contact borrower,27194
1,Death of borrower,26637
3,Illness of borrower family member,24100
16,Business failure,21529
20,Payment adjustment,17097
4,Death of borrower family member,15865
21,Payment dispute,7189
19,Servicing problems,6739
23,Fraud,4235
17,Casualty Loss,2796
12,Inability to rent property,2223
18,Energy environment costs,1914
10,Property problem,1798
9,Distant employment transfer,876
13,Military service,855
25,Incarceration,751
11,Inability to sell property,748
22,Transfer of ownership pending,592
8,Abandonment of property,538

Query Results for the Curious

Highest Average Home Price Metro Areas from HAMP applications

Row prop_geoc_cnsus_msa_cd  avg_prop_value   
1   46520 (Urban Honolulu, HI since 2013)   437832.0     
2   27980   (Kailuku, Wailuku Lahaina HI) 415797.0  
3   26180   (Honolulu, HI)  413659.0     
4   41940   (Sunnyvale San Jose Santa Clara CA) 395081.0     
5   42100   (Santa Cruz Watsonville CA) 379636.0     

HAMP Applications where much higher property values are reported in the npv table vs. the mod table, by metro area

MariaDB [hamp201501]> select prop_geoc_cnsus_msa_cd, count(*) as N from mods, npvs where mods.fncl_ast_id=npvs.fncl_ast_id and mods.prop_valu_as_is_val_amt>0 and mods.prop_valu_as_is_val_amt<0.5*npvs.prop_valu_as_is_val_amt group by prop_geoc_cnsus_msa_cd order by N desc limit 5;
+------------------------+-----+
| prop_geoc_cnsus_msa_cd | N   |
+------------------------+-----+
|                  47900 | 463 |   (Washington DC Alex/Arl VA)
|                  33100 | 410 |   (Miami FL)
|                  16980 | 398 |   (Chicago IL and burbs)
|                  35620 | 395 |   (NYC and burbs incl NNJ)
|                  12060 | 391 |   (Atlanta and burbs)
+------------------------+-----+
5 rows in set (46.27 sec)

BigQuery

select prop_geoc_cnsus_msa_cd,count(*) as N
from hamp201501.mods as t1
JOIN EACH hamp201501.npvs as t2 ON t1.fncl_ast_id=t2.fncl_ast_id
where t1.prop_valu_as_is_val_amt>0 and 
t1.prop_valu_as_is_val_amt < 0.5*t2.prop_valu_as_is_val_amt
group by prop_geoc_cnsus_msa_cd
order by N desc limit 5;

Query complete (3.5s elapsed, 147 MB processed)

Query Results Download as CSV Save as Table
Row prop_geoc_cnsus_msa_cd  N    
1   47900   463  
2   33100   410  
3   16980   398  
4   35620   395  
5   12060   391  

Correcting errors through UPDATE

Errors in Data. It happens.

Interest rate example:

In an interest rate column, the following input probably means 2%/year interest.

2.00

0.02
think POWER(1+r,n)

200
because that pesky . key breaks the flow, and haters hate FLOAT

We can further test other database fields, such as the loan payment for a given loan amount and duration, to ensure that the interest rate is 2% and not 200% or 0.02%, but this is time consuming. While some people prefer a view of the raw data, the problem with leaving it as-is becomes a problem when an app takes data from the database and assumes that the interest rate is in a particular unit.

Do we have this problem in Treasury’s HAMP data?

–>Yes Let’s have a look.

Here we’ll see the “Loan Before Modification Interest Rate” field appears to be rounded to whole numbers. Values between 0.0-1.0 are not present, but values above 100 are present.

paul@home:~$ bq query "select ln_bef_mdfc_int_rt, count(*) as N from hamp201501.mods group by ln_bef_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r5922df3129384f2b_0000014c19c67588_1 ... (0s) Current status: DONE   
+--------------------+---------+
| ln_bef_mdfc_int_rt |    N    |
+--------------------+---------+
|               NULL | 3854918 |
|                7.0 |  624122 |
|                6.0 |  616902 |
|                0.0 |  290885 |
|                8.0 |  249219 |
|                5.0 |  186273 |
|                9.0 |  124665 |
|                4.0 |   86959 |
|               10.0 |   64327 |
|                3.0 |   58358 |
+--------------------+---------+
paul@home:~$ bq query "select ln_bef_mdfc_int_rt, count(*) as N from hamp201501.mods where ln_bef_mdfc_int_rt<1 group by ln_bef_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r611b008b5b4d4c1c_0000014c19c7125c_1 ... (0s) Current status: RWaiting on bqjob_r611b008b5b4d4c1c_0000014c19c7125c_1 ... (0s) Current status: DONE   
+--------------------+--------+
| ln_bef_mdfc_int_rt |   N    |
+--------------------+--------+
|                0.0 | 290885 |
+--------------------+--------+
paul@home:~$ bq query "select ln_bef_mdfc_int_rt, count(*) as N from hamp201501.mods where ln_bef_mdfc_int_rt>100 group by ln_bef_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r7df9e803f5e3b12a_0000014c19c7373b_1 ... (0s) Current status: RWaiting on bqjob_r7df9e803f5e3b12a_0000014c19c7373b_1 ... (0s) Current status: DONE   
+--------------------+----+
| ln_bef_mdfc_int_rt | N  |
+--------------------+----+
|              543.0 | 99 |
|              688.0 | 53 |
|              650.0 | 43 |
|              700.0 | 35 |
|              638.0 | 34 |
|              675.0 | 33 |
|              663.0 | 28 |
|              725.0 | 28 |
|              625.0 | 27 |
|              750.0 | 26 |
+--------------------+----+

If we look at the loan rate after modification, we find that values are not rounded and values between 0.0 and 0.20 are also present in the top 10 results.

bq query "select ln_aft_mdfc_int_rt, count(*) as N from hamp201501.mods group by ln_aft_mdfc_int_rt order by N desc limit 10"Waiting on bqjob_r7eb2a38506e0038c_0000014c19ca7d33_1 ... (0s) Current status: RWaiting on bqjob_r7eb2a38506e0038c_0000014c19ca7d33_1 ... (0s) Current status: DONE   
+--------------------+---------+
| ln_aft_mdfc_int_rt |    N    |
+--------------------+---------+
|               NULL | 3854918 |
|                2.0 | 1233142 |
|              4.125 |   47809 |
|                4.0 |   46445 |
|               4.25 |   42632 |
|                5.0 |   42094 |
|               0.02 |   41414 |
|               4.75 |   35550 |
|              4.875 |   33509 |
|                4.5 |   32931 |
+--------------------+---------+

The big values are in the after modification interest rate too, though there are fewer of these.

paul@home:~$ bq query "select ln_aft_mdfc_int_rt, count(*) as N from hamp201501.mods where ln_aft_mdfc_int_rt>100 group by ln_aft_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r61251cf0d55e135d_0000014c19cc6c18_1 ... (0s) Current status: RWaiting on bqjob_r61251cf0d55e135d_0000014c19cc6c18_1 ... (0s) Current status: DONE   
+--------------------+-----+
| ln_aft_mdfc_int_rt |  N  |
+--------------------+-----+
|              211.1 | 100 |
|              200.0 |  11 |
|              237.5 |   2 |
|              537.5 |   1 |
|              225.0 |   1 |
|              375.0 |   1 |
|              475.0 |   1 |
|              412.5 |   1 |
|              387.5 |   1 |
|              362.5 |   1 |
+--------------------+-----+

How to Fix it?

Like most simple choices, this is not without controversy:

  • If you fix it, you are destroying data.
  • Does the fact that the rate is in the “wrong units” have value?
  • The names of the lenders are not disclosed in the HAMP data files. Now maybe the big rates might be from bank A and B and the tiny rates from banks C and D.
  • The HAMP program was controversial and many lenders initially failed to comply. Errors can prevent proper operation of some programs yet still look correct to a human.
  • One should always leave a way to get back to the raw data, even if that involves going back to the CSV files.

In MySQL you could update a column with some nested IF statements like this:

UPDATE hamp201501.mods SET ln_bef_aft_mdfc_int_rt=IF(ln_bef_mdfc_int_rt BETWEEN 0.01 and 0.20,100*ln_bef_mdfc_int_rt, IF(ln_bef_mdfc_int_rt BETWEEN 100 AND 2000, ln_bef_mdfc_int_rt/100.0, ln_bef_mdfc_int_rt));

and we need a similar UPDATE to ln_aft_mdfc_int_rt

This takes ~21 sec per query on an AMF FX 8150 running MySQL.

To test the replace, we can count the offending entries:

MariaDB [(none)]> select count(*) from hamp201501.mods where ln_bef_mdfc_int_rt between 0.01 and 0.20 or ln_bef_mdfc_int_rt between 100.0 and 2000.0 or ln_aft_mdfc_int_rt between 0.01 and 0.20 or ln_aft_mdfc_int_rt between 100.0 and 2000.0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (17.90 sec)

BigQuery has no UPDATE OR ALTER TABLE

and this is initially a problem when it comes to fixing up buggy data.

The utility of UPDATE is obvious. However, there is also utility in not having UPDATE in knowing the data hasn’t been altered.

ALTER TABLE MODIFY COLUMN comes in handy when there is a need to change types. For instance, it could be necessary to change from an integer type to a floating or decimal type.

In the Treasury Loan Mod data, my automated schema script initially identified ln_bef_mdfc_int_rt as an integer type because there were no decimal entries in the data. Later I hand edited the schema before importing the data into mySQL to make all _rt variables FLOAT (and not DECIMAL as it is not a monetary amount and the rate will go into floating point math routines anyway). Obviously, the original data was rounded, as interest rates like “3.25%” were quite common. In order to fix the entries like 325 into 3.25, it would have been necessary to change the type.

As far as I know, Google BigQuery doesn’t support changing the schema as a stand alone operation.

Options:

A: Create a new table with all the data of the old table. The query will probably only take a few seconds to execute, but will process the entire table’s worth of data. Constructing the query itself is annoying and would probably require a short program to convert the schema to a string SELECT query format, together with manual text editing to get the desired result.

B: Use the nested IF above to create a new table of the corrected values together with a row identifier, and use JOIN EACH This might be fine if the number of corrections is small.

C: Use the nested IF above whenever corrected values are needed. This might be more palatable if BigQuery had stored functions.

The conclusion of this section is that BigQuery isn’t designed for existing data cleaning workflows common to MySQL. Simple cleanup workflow can probably be re-engineered – but is unnecessary if existing systems like MySQL or python csv-based scripts can clean the data as a batch job, and then load the cleaned data into BigQuery.

Conclusion

Google BigQuery is much faster than do-it-yourself MySQL and worthwhile when you have only a little big data, like 5GB. OK, so it isn’t terabytes. But 5GB will not fit in Excel. It won’t fit in a lot of tools. You’ll normally need a database or scripts to chop it into smaller pieces, and BigQuery fits the needs for a read-only database fairly well so long as you don’t use it for pinpoint queries.

While 5GB of csv data will fit in MySQL, with minimal trouble it can run 10x faster on Google BigQuery and, at $5/TB, individual queries are likely under a cent. An AMD FX8150 with 16GB ram – as used for this article – or an equivalent VM-based solution – will be much more expensive.

Next steps

Hopefully the foregoing has satisfied your curiosity on the basics of using Google BigQuery and how it compares to do-it-yourself mySQL.

Read a book

Google BigQuery Analytics by
Jordan Tigani and Siddartha Naidu

Build a dashboard?

Google Dashboard for Bigquery article
https://cloud.google.com/bigquery/articles/dashboard