Saturday, March 14, 2015

BigQueryNotes.md

Comparing MySQL and Google BigQuery with a bit of unofficial analysis of The US Treasury’s Loan Modification Data

This is my first blog post for sunshineonacloudy.net

The goal is to compare mySQL and Google BigQuery within the context of a real-world financial application: specifically, taking a first look at some of the US Treasury’s data from loan modifications. Enjoy!

Overview

Google BigQuery is a Google-hosted SQL-like database that scales your data out to a cluster of machines in the Google infrastructure.

Google’s Big Query Demo https://cloud.google.com/bigquery/

Good things about Google BigQuery
* in this article, runs 10x faster than MySQL on a table made from 5GB of csv data
* according to Tigani and Naidu, Google BigQuery Analytics will scale out to match the data size, and can run full column scans as fast as ~ 50GB/sec
* SQL-like interface has SELECT, JOIN, WHERE, GROUP BY, and various functions
* quickly loads csv and json files; also takes streaming input
* (json) fields may contain deeper structure such as an array or more fields
* fast network
* web console. https://bigquery.cloud.google.com
* Google account access control
* Two levels of sharing (paid and unpaid)
* BQ pricing as of Mar 2015: $0.02/GB*Month storage + $5/TB processing

Bad
* lacks indexing feature => slow, expen$ive pinpoint queries
* csv importer not fully RFC 4180 compliant
* need credit card to analyze any private data

Different things about Google BigQuery
* needs its own json schema
* JOIN on large tables requires JOIN EACH or it fails
* bq command line queries need a -n flag to return >100 rows
* lacks DATE type. wants full date & time or unix timestamp
* lacks UPDATE, ALTER TABLE ADD COLUMN
* nothing smaller than 64 bit floats/integers

Unknowns (for me)
* do importers/schema converters exist for popular databases?


Concrete Example: Loan modification data from the US Treasury

Background

  • In the mid-2000s, a price bubble in the real estate market coincided with easy low interest rate financing from adjustable rate mortgages (ARMs). Borrowers defaulted in the late 2000s as these ARM loans were structured to rise from a low rate, like 1-3%, to a higher rate, like 6-10%, after an initial period of 1-7 years, and the borrowers were unable to pay the higher rates. Refinancing to “get cash out” also became impossible as real estate values declined below the amounts owed.
  • The Obama administration approved a number of assistance programs including HAMP, a loan modification program.
  • Various algorithms were defined for HAMP
  • Lenders began processing applications (loan mod hell)
  • Anonymized Data was made public on ~150 aspects of each application

Why?

Mostly Prior Personal/Professional Interest:

  • Before the financial crisis, I participated in a startup that had, as its goal, a mitigation of the crisis through new loan products. Obviously this was not adopted but some progress was made and a patent awarded.
  • Subsequently I ran a free “Unofficial HAMP Loan Modification Calculator” at http://armdisarm.com that provided a javascript app for calculating the details of modifications from procedures disclosed in the MHA Handbook, together with a nodeJS/sqLite-based lookup of some example loan modifications from an earlier Treasury .csv file.

While some may find the subject matter of loan mods a tad dry, the real estate market is the largest valued market of US assets. Programs like HAMP represent a government intervention into the market to potentially alter the loans of millions of distressed borrowers. In this age of government spying on the citizen, it is interesting to see what data the government is willing to disclose on such a historic, large-scale economic intervention.

Downloading the Data

We want the HAMP “Mods” table
* 6 million+ rows, 100+ columns, 5GB uncompressed
* CSV format
* Metro location, borrower gender, age, race, income, before/after monthly expenses, interest rates, property taxes, HOA fees, current payment status, rejection reasons
* Data quality issues (missing values, out-of-range values, units issues)

For other analysis we also want the other tables, but to narrow the scope we will primarily concern ourselves with the “Mods” table.

To download and unzip:

wget http://www.treasury.gov/initiatives/financial-stability/reports/Reports/HMP_Public_User_Mod_Data_All.zip

unzip HMP_Public_User_Mod_Data_All.zip

Prepping the Raw Data

Usually it isn’t a good idea to feed raw data directly to your database and expect everything to work out perfectly the first time.

Garbage In. Garbage out.

Usually a cycle of loading and testing is required. Here was no different:

For mySQL, NULL entries in a CSV need to be ,NULL,, not ,"",

sed -e 's/,""/,NULL/g' HMP_Public_User_Mod_Data_All.csv > sqlMod.csv

For Google BigQuery, CSV data should not contain additional commas or quotes.

python unfubarcsv.py HMP_Public_User_Mod_Data_All.csv bqMod.csv

My Python script unfubarcsv.py does these things:

  • ensure each CSV field is quoted, and contains no interior quotes
  • within a field, remove commas in numbers: 10,000 => 10000
  • within a field, replace remaining commas with spaces: Atlanta, GA => Atlanta GA

    import csv
    import sys
    import re
    infile = open(sys.argv[1],’r’)
    outfile = open(sys.argv[2],’w’)
    csvin = csv.reader(infile)
    csvout = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for rowin in csvin:
    rowout = []
    for origfield in rowin:
    field = origfield[1:-1] if origfield[0:1]==’”’ else origfield
    field = field.replace(r’”’,r’ ‘)
    field = field.replace(r”’”,r” “)
    field = re.sub(r’(\d+),(\d+)’, r’\1\2’, field)
    field = field.replace(r’,’,r’ ‘)
    rowout.append(field)
    csvout.writerow(rowout)

Note: Python table scan and edit takes about 10 sec per million rows in one core of AMD FX8150

Schemas

At a minimum, a database schema is a definition of all the columns in each table.

mySQL is a strongly-typed system, and distinguishes between a one byte TINYINT integer and an 8 byte BIGINT, FLOATs and DECIMALs, and DATEs, DATETIMEs and TIMESTAMPs.

Google BigQuery has fewer basic types: integer, float, string, and timestamp. There is also a record type that is basically an implementation of JSON and thus potentially competes with noSQL JSON databases like mongoDB.

A mySQL schema is independent of the order of the input data as the input data order is typically specified later in LOAD DATA INFILE or INSERT statements.

Here’s a mySQL schema that will work for the mods table.

CREATE TABLE mods ( 
brwr_amdfc_mthy_hsng_exp_amt DECIMAL(10,2) NULL,
brwr_bir_dt VARCHAR(16) NULL,
brwr_bmdfc_mthy_hsng_exp_amt DECIMAL(10,2) NULL,
brwr_ctbn_amt DECIMAL(10,2) NULL,
brwr_ethcy_typ_cd TINYINT NULL,
brwr_ethcy_typ_nme VARCHAR(80) NULL,
brwr_excn_dt DATE NULL,
brwr_hmda_data_src_cd TINYINT NULL,
brwr_hmda_data_src_nme VARCHAR(80) NULL,
brwr_mthy_dpmt_xcldg_piti_amt DECIMAL(10,2) NULL,
brwr_mthy_grs_incm_amt DECIMAL(10,2) NULL,
brwr_race VARCHAR(255) NULL,
brwr_sex_typ_cd TINYINT NULL,
brwr_sex_typ_nme VARCHAR(80) NULL,
brwr_t2_mthy_grs_rntl_incm_amt INT NULL,
brwr_t2_prr_tot_hexp_amt DECIMAL(10,2) NULL,
cbrwr_ethcy_typ_cd TINYINT NULL,
cbrwr_ethcy_typ_nme VARCHAR(80) NULL,
cbrwr_race VARCHAR(255) NULL,
cbrwr_sex_typ_cd TINYINT NULL,
cbrwr_sex_typ_nme VARCHAR(80) NULL,
fncl_ast_id BIGINT NOT NULL PRIMARY KEY,
ivsr_grp VARCHAR(80) NULL,
last_rptd_mthy_acvy_dt DATE NULL,
ln_acvy_actn_cd TINYINT NULL,
ln_acvy_actn_dt DATE NULL,
ln_acvy_actn_nme VARCHAR(80) NULL,
ln_aft_mdfc_amrt_term SMALLINT NULL,
ln_aft_mdfc_bck_rto_pct FLOAT NULL,
ln_aft_mdfc_escr_pmt_amt DECIMAL(10,2) NULL,
ln_aft_mdfc_frnt_rto_pct FLOAT NULL,
ln_aft_mdfc_fst_pmt_due_dt DATE NULL,
ln_aft_mdfc_int_rt FLOAT NULL,
ln_aft_mdfc_lpi_dt DATE NULL,
ln_aft_mdfc_max_int_rt FLOAT NULL,
ln_aft_mdfc_mtry_dt INT NULL,
ln_aft_mdfc_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_aft_mdfc_pi_pmt_amt DECIMAL(10,2) NULL,
ln_aft_mdfc_prdc_lbl_typ_cd TINYINT NULL,
ln_aft_mdfc_prdc_lbl_typ_nme VARCHAR(80) NULL,
ln_aft_mdfc_rmng_term SMALLINT NULL,
ln_aft_mdfc_upb_amt INT NULL,
ln_amdfc_pral_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_atrn_fees_not_in_escr_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_amrt_term SMALLINT NULL,
ln_bef_mdfc_assc_mthy_pmt_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_bck_rto_pct FLOAT NULL,
ln_bef_mdfc_escr_pmt_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_frnt_rto_pct FLOAT NULL,
ln_bef_mdfc_int_rt FLOAT NULL,
ln_bef_mdfc_lpi_dt DATE NULL,
ln_bef_mdfc_mtry_dt INT NULL,
ln_bef_mdfc_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_bef_mdfc_pi_pmt_amt INT NULL,
ln_bef_mdfc_prdc_lbl_typ_cd TINYINT NULL,
ln_bef_mdfc_prdc_lbl_typ_nme VARCHAR(80) NULL,
ln_bef_mdfc_rmng_term SMALLINT NULL,
ln_bef_mdfc_upb_amt INT NULL,
ln_bmdfc_pral_npv_mdl_rslt_amt DECIMAL(10,2) NULL,
ln_dlq_int_cplzd_amt DECIMAL(10,2) NULL,
ln_dlqy_typ_cd TINYINT NULL,
ln_dlqy_typ_nme VARCHAR(80) NULL,
ln_dsbt_frgv_amt DECIMAL(10,2) NULL,
ln_escr_cplzd_adv_amt DECIMAL(10,2) NULL,
ln_fcl_refl_dt VARCHAR(16) NULL,
ln_frbrn_plan_typ_cd TINYINT NULL,
ln_frbrn_plan_typ_nme VARCHAR(80) NULL,
ln_fst_lien_ind VARCHAR(80) NULL,
ln_fst_trl_pmt_due_dt DATE NULL,
ln_fst_trl_pmt_pstd_dt DATE NULL,
ln_fst_trl_pmt_rcvd_amt DECIMAL(10,2) NULL,
ln_hshp_rsn_cd TINYINT NULL,
ln_hshp_rsn_nme VARCHAR(80) NULL,
ln_int_owed_or_not_rptd_amt DECIMAL(10,2) NULL,
ln_int_remd_amt INT NULL,
ln_lpi_dt DATE NULL,
ln_mdfc_eff_dt DATE NULL,
ln_mdfc_eval_dt DATE NULL,
ln_mdfc_int_rt_lock_dt DATE NULL,
ln_mdfc_mode_cd TINYINT NOT NULL,
ln_mdfc_mode_nme VARCHAR(80) NOT NULL,
ln_mtg_typ_cd TINYINT NULL,
ln_mtg_typ_nme VARCHAR(80) NULL,
ln_npv_calc_dt DATE NULL,
ln_orig_note_dt INT NULL,
ln_othr_cplzd_adv_amt DECIMAL(10,2) NULL,
ln_othr_ctbn_amt DECIMAL(10,2) NULL,
ln_pgm_typ_desc VARCHAR(80) NULL,
ln_pgm_typ_nme VARCHAR(80) NULL,
ln_pi_at_31_pct_dti_rto_amt DECIMAL(10,2) NULL,
ln_pi_at_38_pct_dti_rto_amt DECIMAL(10,2) NULL,
ln_ppmt_owed_or_not_rptd_amt DECIMAL(10,2) NULL,
ln_pral_wflrstrn_typ_cd TINYINT NULL,
ln_pral_wflrstrn_typ_nme VARCHAR(80) NULL,
ln_prin_frbrn_amt DECIMAL(10,2) NULL,
ln_prin_rdcn_altv_amt DECIMAL(10,2) NULL,
ln_prin_rdcn_altv_cd TINYINT NULL,
ln_prin_rdcn_altv_nme VARCHAR(80) NULL,
ln_prin_remd_amt INT NULL,
ln_prjd_fcl_sle_dt VARCHAR(16) NULL,
ln_prtl_clm_amt DECIMAL(10,2) NULL,
ln_rpur_typ_cd TINYINT NULL,
ln_rpur_typ_nme VARCHAR(100) NULL,
ln_sec_lien_fncl_ast_id BIGINT NULL,
ln_smss_stat_cd TINYINT NULL,
ln_smss_stat_nme VARCHAR(80) NULL,
ln_splmy_astnce_cd TINYINT NULL,
ln_splmy_astnce_nme VARCHAR(80) NULL,
ln_st_cd TINYINT NOT NULL,
ln_st_nme VARCHAR(80) NOT NULL,
ln_t1_not_ofrd_rsn_cd TINYINT NULL,
ln_t1_not_ofrd_rsn_nme VARCHAR(80) NULL,
ln_tot_cplzd_amt DECIMAL(10,2) NULL,
ln_trl_mdfc_dnal_rsn_cd TINYINT NULL,
ln_trl_mdfc_dnal_rsn_nme VARCHAR(80) NULL,
ln_trl_plan_typ_cd TINYINT NULL,
ln_trl_plan_typ_nme VARCHAR(80) NULL,
ln_trl_pmt_no VARCHAR(80) NULL,
ln_trl_pmt_pstd_dt VARCHAR(80) NULL,
ln_trl_pmt_rcvd_amt VARCHAR(80) NULL,
ln_trl_prd_cnt INT NULL,
ln_umpt_frbrn_plan_durn_cnt INT NULL,
ln_umpt_mthy_frbrn_amt DECIMAL(10,2) NULL,
ln_umpt_plan_frbrn_intn_dt DATE NULL,
ln_upb_amt DECIMAL(10,2) NULL,
ln_upb_frgv_amt DECIMAL(10,2) NULL,
mha_ln_id BIGINT NOT NULL,
perm_mdfc_cncln_rsn_cd TINYINT NULL,
perm_mdfc_cncln_rsn_nme VARCHAR(80) NULL,
prop_cndn_cd TINYINT NULL,
prop_cndn_nme VARCHAR(80) NULL,
prop_geoc_cnsus_msa_cd MEDIUMINT NULL,
prop_ocpy_stat_cd TINYINT NULL,
prop_ocpy_stat_nme VARCHAR(80) NULL,
prop_rgn_cd TINYINT NULL,
prop_rgn_nme VARCHAR(80) NULL,
prop_stdz_st_cd VARCHAR(2) NULL,
prop_typ_cd TINYINT NULL,
prop_typ_nme VARCHAR(80) NULL,
prop_unt_cnt INT NULL,
prop_usg_typ_cd TINYINT NULL,
prop_usg_typ_nme VARCHAR(80) NULL,
prop_valu_as_is_val_amt DECIMAL(10,2) NULL,
prop_valu_dt DATE NULL,
prop_valu_mthd_cd TINYINT NULL,
prop_valu_mthd_nme VARCHAR(80) NULL,
step_schedule VARCHAR(1000) NULL,
trial_mdfc_fout_rsn_cd TINYINT NULL,
trial_mdfc_fout_rsn_nme VARCHAR(80) NULL
) ENGINE=Aria ;

The BigQuery schema is in JSON, and the field list must be ordered to match the order of fields in lines of the input csv. The mode indicates whether a field can be null or is required (not null). There is an optional “description” field, unused here but obviously needed, that allows providing a human-readable label for the data field.

[
    {
        "mode": "required", 
        "name": "fncl_ast_id", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "mha_ln_id", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "ln_st_cd", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "ln_st_nme", 
        "type": "string"
    }, 
    {
        "mode": "required", 
        "name": "ln_mdfc_mode_cd", 
        "type": "integer"
    }, 
    {
        "mode": "required", 
        "name": "ln_mdfc_mode_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_smss_stat_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_smss_stat_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ivsr_grp", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pgm_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pgm_typ_desc", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_trl_pmt_due_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_trl_pmt_pstd_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_trl_pmt_rcvd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_acvy_actn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_acvy_actn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_acvy_actn_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_amrt_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_amrt_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_assc_mthy_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_atrn_fees_not_in_escr_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_bck_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_bck_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_ctbn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_bir_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_ethcy_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_ethcy_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_excn_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_race", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_sex_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_sex_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_ethcy_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_ethcy_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_race", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_sex_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "cbrwr_sex_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_orig_note_dt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dlq_int_cplzd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dsbt_frgv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_escr_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_escr_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_escr_cplzd_adv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fst_lien_ind", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_fst_pmt_due_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_fcl_refl_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_frnt_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_frnt_rto_pct", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_hshp_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_hshp_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_int_owed_or_not_rptd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_int_remd_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_int_rt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_int_rt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mdfc_int_rt_lock_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_lpi_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_lpi_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_prd_cnt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mtg_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mtg_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_mtry_dt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_mtry_dt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_max_int_rt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mdfc_eff_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_mthy_dpmt_xcldg_piti_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_amdfc_mthy_hsng_exp_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_bmdfc_mthy_hsng_exp_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_othr_cplzd_adv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_othr_ctbn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_pi_pmt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pi_at_31_pct_dti_rto_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pi_at_38_pct_dti_rto_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_pi_pmt_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_remd_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_ppmt_owed_or_not_rptd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_prdc_lbl_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_prdc_lbl_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_prdc_lbl_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_prdc_lbl_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prjd_fcl_sle_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_cndn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_cndn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_geoc_cnsus_msa_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_unt_cnt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_ocpy_stat_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_ocpy_stat_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_rgn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_rgn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_stdz_st_cd", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_usg_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_usg_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_mthd_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_mthd_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_rmng_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "step_schedule", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_rmng_term", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "trial_mdfc_fout_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "trial_mdfc_fout_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_mdfc_dnal_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_mdfc_dnal_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_pmt_no", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_pmt_pstd_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_pmt_rcvd_amt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_upb_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_upb_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_tot_cplzd_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dlqy_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_dlqy_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_frbrn_plan_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_frbrn_plan_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_lpi_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "last_rptd_mthy_acvy_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_upb_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_mthy_grs_incm_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_npv_calc_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_aft_mdfc_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bef_mdfc_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_frbrn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_rdcn_altv_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_rdcn_altv_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_amdfc_pral_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_bmdfc_pral_npv_mdl_rslt_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prin_rdcn_altv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_upb_frgv_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "prop_valu_as_is_val_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_hmda_data_src_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_hmda_data_src_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pral_wflrstrn_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_pral_wflrstrn_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_splmy_astnce_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_splmy_astnce_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_plan_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_trl_plan_typ_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_umpt_mthy_frbrn_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_umpt_frbrn_plan_durn_cnt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_umpt_plan_frbrn_intn_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_sec_lien_fncl_ast_id", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_prtl_clm_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "perm_mdfc_cncln_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "perm_mdfc_cncln_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_t1_not_ofrd_rsn_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_t1_not_ofrd_rsn_nme", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_mdfc_eval_dt", 
        "type": "string"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_t2_mthy_grs_rntl_incm_amt", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "brwr_t2_prr_tot_hexp_amt", 
        "type": "float"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_rpur_typ_cd", 
        "type": "integer"
    }, 
    {
        "mode": "nullable", 
        "name": "ln_rpur_typ_nme", 
        "type": "string"
    }
]

Loading

Once mySQL is installed, the CREATE TABLE command of the previous section is run to create the empty table from the schema.

To load data, mySQL has a special statement:

LOAD DATA INFILE '/tmp/ModData.csv'
INTO TABLE mods
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY "\n"
IGNORE 1 LINES
(fncl_ast_id,mha_ln_id,ln_st_cd,ln_st_nme,ln_mdfc_mode_cd,...)

Notice the last line of the statement requires the order of fields. Here we used head -1 filename.csv >> ModsLoad.sql in the Linux shell to copy the header line from the .csv file to the .sql file and then added the () in a text editor.

Big Query, since it is Google-hosted, requires a somewhat longer series of steps to load data

  1. Setup project in Google Developers’ API page
  2. Enable Google Big Query and Google Cloud Storage APIs
  3. Set up billing
  4. Create Data set in Project
  5. Install gsutil on data source computer (pip install gsutil)
  6. Copy bqMod.csv file from data source computer to Google Cloud Storage
  7. install bq on some computer - already installed in Google Compute Cloud (now part of GCloud SDK)
  8. gcloud auth login
  9. use bq to create Table within Dataset from cloud storage .csv file and schema.json
  10. query table using bq or web interface at http://bigquery.googleapi.com

Making a Google Cloud Storage bucket
gsutil mb gs://hamp201501

Copying the data to the storage bucket
gsutil cp ./HMP_Public_User_Mod_Data_All_20150116-unfubar.csv gs://hamp201501/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv

Creating and loading the table in Google Big Query
bq load --skip_leading_rows=1 hamp201501.mods gs://hamp201501/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv ./mod.bq

Notice that the bq load here is taking a google storage URL for the data but a local file for the schema. It seems to me natural to also load the schema from google storage but that was not successful.

In both mySQL and BigQuery, once the data is loaded, the data that exists in the database is independent of the input files. The original input files can be moved, removed, etc., if need be, or kept as a backup. Google Cloud Storage does cost $0.026/GB (Mar 2015) for files left there in addition to the storage costs in Google BigQuery, another $0.02/GB (Mar 2015).

Querying

Interfaces

MySQLBigQuery
web console 3rd party tools
phpmyadmin
https://bigquery.cloud.google.com
command line client mysql bq
part of Google Cloud SDK
API various external libs
mysql protocol
tcp and unix sockets
BigQuery Client Libs
REST
HTTP(S)

Performance

OperationMySQLBigQuery
Find a Record by Unique Id
select * from hamp201501.mods where fncl_ast_id=44448899004
a few ms
(PRIMARY KEY)
8 sec !!!
scan: 4.4GB
(lacks indexing)
Count all
select count(*) from hamp201501.mods
a few ms
(Aria table)
1.7 sec
scan: 0B
Group - count - sort loan hardships
select ln_hshp_rsn_cd, ln_hshp_rsn_nme, count(*) as N from hamp201501.mods group by ln_hshp_rsn_cd, ln_hshp_rsn_nme order by N desc
18 sec
(no indexes)
1.5 sec
scan: 71MB
Group by metro area - average home values in eligible value range - sort by avg value highest to lowest
select prop_geoc_cnsus_msa_cd, average(prop_valu_as_is_val_amt) as avg_prop_value from hamp201501.mods where prop_valu_as_is_val_amt between 1 and 800000 group by prop_geoc_cnsus_msa_cd order by avg_prop_value desc
17.5 sec 1.7 sec
scan: 65MB
Table join mods table to npvs table - group by metro area - count strange HAMP applications where the npv table reports a property value that is at least twice as much as the property value reported in the mods table ~ 46 sec
select prop_geoc_cnsus_msa_cd, count(*) as N from mods, npvs where mods.fncl_ast_id=npvs.fncl_ast_id and mods.prop_valu_as_is_val_amt > 0 and mods.prop_valu_as_is_val_amt < 0.5*npvs.prop_valu_as_is_val_amt group by prop_geoc_cnsus_msa_cd order by N desc limit 5
~ 3.5 sec, 147MB
uses non standard JOIN EACH
select prop_geoc_cnsus_msa_cd,count(*) as N from hamp201501.mods as t1 JOIN EACH hamp201501.npvs as t2 ON t1.fncl_ast_id=t2.fncl_ast_id where t1.prop_valu_as_is_val_amt > 0 and t1.prop_valu_as_is_val_amt < 0.5*t2.prop_valu_as_is_val_amt group by prop_geoc_cnsus_msa_cd order by N desc limit 5

Note that the join of large tables in BigQuery currently requires a size hint in the form of JOIN EACH. Otherwise, BigQuery will abort queries JOINing large tables.

mysql vs bq Command Line Tools

running interactively

Either mysql or bq can be run in an interactive shell mode:

mysql -u root -p

bq shell

running commands from a file

If hardshipList.sql contains the sql for the Loan Hardship Reasons query above, we can write command-line queries as follows:

mysql -u root -p hamp201501 <hardshipList.sql

bq query "$(cat hardshipList.sql)"

The "$(cat ...)" is not part of Google BigQuery. It is part of the Linux Bash shell. In Bash, "$(cat somefile)" simply inserts the contents of the file (the SQL commands) into the command line.

You can also enter the SQL directly in bq like this:

bq query "select stuff, morestuff, evenmorestuff from sometable where this=that and foo=bar"

output options

limits on number of rows returned

As far as I know, MySQL only limits the number of rows returned in response to an explicit LIMIT clause in an SQL query.

bq query has a default row limit that can surprisingly truncate results. I suppose Google’s engineers felt result limiting was necessary for big data so that beginners do not bring down the system with a limitless select * from ...

The row limit for an individual query can be redefined with the -n option

bq query -n 1000 "select ..." will return 1000 rows

formatting

Mysql’s default is a pretty printed table when interactive, otherwise tab-separated-values

command line option -H provides HTML table; -X provides XML

BQ’s default is a pretty printed table like this:

paul@home:~/treasury-data-hamp-Feb-2015$ bq query "$(cat hardshipList.sql)"
Waiting on bqjob_r2dcfef6105951f57_0000014c1026d2f8_1 ... (0s) Current status: DONE    
+----------------+-----------------------------------+---------+
| ln_hshp_rsn_cd |          ln_hshp_rsn_nme          |    N    |
+----------------+-----------------------------------+---------+
|           NULL |                                   | 3854918 |
|              6 | Curtailment of income             | 1305375 |
|             14 | Other                             |  390404 |
|              7 | Excessive obligation              |  237070 |
|             15 | Unemployment                      |  147586 |
|              2 | Illness of principal borrower     |   86661 |
|              5 | Marital difficulties              |   46282 |
|             24 | Unable to contact borrower        |   27194 |
|              1 | Death of borrower                 |   26637 |
|              3 | Illness of borrower family member |   24100 |
|             16 | Business failure                  |   21529 |
|             20 | Payment adjustment                |   17097 |
|              4 | Death of borrower family member   |   15865 |
|             21 | Payment dispute                   |    7189 |
|             19 | Servicing problems                |    6739 |
|             23 | Fraud                             |    4235 |
|             17 | Casualty Loss                     |    2796 |
|             12 | Inability to rent property        |    2223 |
|             18 | Energy environment costs          |    1914 |
|             10 | Property problem                  |    1798 |
|              9 | Distant employment transfer       |     876 |
|             13 | Military service                  |     855 |
|             25 | Incarceration                     |     751 |
|             11 | Inability to sell property        |     748 |
|             22 | Transfer of ownership pending     |     592 |
|              8 | Abandonment of property           |     538 |
+----------------+-----------------------------------+---------+

There is a --format option bq --format X query ...
where X can be

pretty: formatted table output
sparse: simpler table output
prettyjson: easy-to-read JSON format
json: maximally compact JSON
csv: csv format with header

(source: output of bq --help)

Unlike the mysql client there is no HTML table generation, or XML.

You probably won’t miss it.

bq can provide JSON output, which is ultimately more useful than HTML/XML, especially at the API call level

The csv format, oddly enough, is unquoted, like this:

paul@home:~/treasury-data-hamp-Feb-2015$ bq --format csv query "$(cat hardshipList.sql)"
Waiting on bqjob_r3f61cd54c48083df_0000014c102ebd6b_1 ... (0s) Current status: DONE    
ln_hshp_rsn_cd,ln_hshp_rsn_nme,N
,,3854918
6,Curtailment of income,1305375
14,Other,390404
7,Excessive obligation,237070
15,Unemployment,147586
2,Illness of principal borrower,86661
5,Marital difficulties,46282
24,Unable to contact borrower,27194
1,Death of borrower,26637
3,Illness of borrower family member,24100
16,Business failure,21529
20,Payment adjustment,17097
4,Death of borrower family member,15865
21,Payment dispute,7189
19,Servicing problems,6739
23,Fraud,4235
17,Casualty Loss,2796
12,Inability to rent property,2223
18,Energy environment costs,1914
10,Property problem,1798
9,Distant employment transfer,876
13,Military service,855
25,Incarceration,751
11,Inability to sell property,748
22,Transfer of ownership pending,592
8,Abandonment of property,538

Query Results for the Curious

Highest Average Home Price Metro Areas from HAMP applications

Row prop_geoc_cnsus_msa_cd  avg_prop_value   
1   46520 (Urban Honolulu, HI since 2013)   437832.0     
2   27980   (Kailuku, Wailuku Lahaina HI) 415797.0  
3   26180   (Honolulu, HI)  413659.0     
4   41940   (Sunnyvale San Jose Santa Clara CA) 395081.0     
5   42100   (Santa Cruz Watsonville CA) 379636.0     

HAMP Applications where much higher property values are reported in the npv table vs. the mod table, by metro area

MariaDB [hamp201501]> select prop_geoc_cnsus_msa_cd, count(*) as N from mods, npvs where mods.fncl_ast_id=npvs.fncl_ast_id and mods.prop_valu_as_is_val_amt>0 and mods.prop_valu_as_is_val_amt<0.5*npvs.prop_valu_as_is_val_amt group by prop_geoc_cnsus_msa_cd order by N desc limit 5;
+------------------------+-----+
| prop_geoc_cnsus_msa_cd | N   |
+------------------------+-----+
|                  47900 | 463 |   (Washington DC Alex/Arl VA)
|                  33100 | 410 |   (Miami FL)
|                  16980 | 398 |   (Chicago IL and burbs)
|                  35620 | 395 |   (NYC and burbs incl NNJ)
|                  12060 | 391 |   (Atlanta and burbs)
+------------------------+-----+
5 rows in set (46.27 sec)

BigQuery

select prop_geoc_cnsus_msa_cd,count(*) as N
from hamp201501.mods as t1
JOIN EACH hamp201501.npvs as t2 ON t1.fncl_ast_id=t2.fncl_ast_id
where t1.prop_valu_as_is_val_amt>0 and 
t1.prop_valu_as_is_val_amt < 0.5*t2.prop_valu_as_is_val_amt
group by prop_geoc_cnsus_msa_cd
order by N desc limit 5;

Query complete (3.5s elapsed, 147 MB processed)

Query Results Download as CSV Save as Table
Row prop_geoc_cnsus_msa_cd  N    
1   47900   463  
2   33100   410  
3   16980   398  
4   35620   395  
5   12060   391  

Correcting errors through UPDATE

Errors in Data. It happens.

Interest rate example:

In an interest rate column, the following input probably means 2%/year interest.

2.00

0.02
think POWER(1+r,n)

200
because that pesky . key breaks the flow, and haters hate FLOAT

We can further test other database fields, such as the loan payment for a given loan amount and duration, to ensure that the interest rate is 2% and not 200% or 0.02%, but this is time consuming. While some people prefer a view of the raw data, the problem with leaving it as-is becomes a problem when an app takes data from the database and assumes that the interest rate is in a particular unit.

Do we have this problem in Treasury’s HAMP data?

–>Yes Let’s have a look.

Here we’ll see the “Loan Before Modification Interest Rate” field appears to be rounded to whole numbers. Values between 0.0-1.0 are not present, but values above 100 are present.

paul@home:~$ bq query "select ln_bef_mdfc_int_rt, count(*) as N from hamp201501.mods group by ln_bef_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r5922df3129384f2b_0000014c19c67588_1 ... (0s) Current status: DONE   
+--------------------+---------+
| ln_bef_mdfc_int_rt |    N    |
+--------------------+---------+
|               NULL | 3854918 |
|                7.0 |  624122 |
|                6.0 |  616902 |
|                0.0 |  290885 |
|                8.0 |  249219 |
|                5.0 |  186273 |
|                9.0 |  124665 |
|                4.0 |   86959 |
|               10.0 |   64327 |
|                3.0 |   58358 |
+--------------------+---------+
paul@home:~$ bq query "select ln_bef_mdfc_int_rt, count(*) as N from hamp201501.mods where ln_bef_mdfc_int_rt<1 group by ln_bef_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r611b008b5b4d4c1c_0000014c19c7125c_1 ... (0s) Current status: RWaiting on bqjob_r611b008b5b4d4c1c_0000014c19c7125c_1 ... (0s) Current status: DONE   
+--------------------+--------+
| ln_bef_mdfc_int_rt |   N    |
+--------------------+--------+
|                0.0 | 290885 |
+--------------------+--------+
paul@home:~$ bq query "select ln_bef_mdfc_int_rt, count(*) as N from hamp201501.mods where ln_bef_mdfc_int_rt>100 group by ln_bef_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r7df9e803f5e3b12a_0000014c19c7373b_1 ... (0s) Current status: RWaiting on bqjob_r7df9e803f5e3b12a_0000014c19c7373b_1 ... (0s) Current status: DONE   
+--------------------+----+
| ln_bef_mdfc_int_rt | N  |
+--------------------+----+
|              543.0 | 99 |
|              688.0 | 53 |
|              650.0 | 43 |
|              700.0 | 35 |
|              638.0 | 34 |
|              675.0 | 33 |
|              663.0 | 28 |
|              725.0 | 28 |
|              625.0 | 27 |
|              750.0 | 26 |
+--------------------+----+

If we look at the loan rate after modification, we find that values are not rounded and values between 0.0 and 0.20 are also present in the top 10 results.

bq query "select ln_aft_mdfc_int_rt, count(*) as N from hamp201501.mods group by ln_aft_mdfc_int_rt order by N desc limit 10"Waiting on bqjob_r7eb2a38506e0038c_0000014c19ca7d33_1 ... (0s) Current status: RWaiting on bqjob_r7eb2a38506e0038c_0000014c19ca7d33_1 ... (0s) Current status: DONE   
+--------------------+---------+
| ln_aft_mdfc_int_rt |    N    |
+--------------------+---------+
|               NULL | 3854918 |
|                2.0 | 1233142 |
|              4.125 |   47809 |
|                4.0 |   46445 |
|               4.25 |   42632 |
|                5.0 |   42094 |
|               0.02 |   41414 |
|               4.75 |   35550 |
|              4.875 |   33509 |
|                4.5 |   32931 |
+--------------------+---------+

The big values are in the after modification interest rate too, though there are fewer of these.

paul@home:~$ bq query "select ln_aft_mdfc_int_rt, count(*) as N from hamp201501.mods where ln_aft_mdfc_int_rt>100 group by ln_aft_mdfc_int_rt order by N desc limit 10"
Waiting on bqjob_r61251cf0d55e135d_0000014c19cc6c18_1 ... (0s) Current status: RWaiting on bqjob_r61251cf0d55e135d_0000014c19cc6c18_1 ... (0s) Current status: DONE   
+--------------------+-----+
| ln_aft_mdfc_int_rt |  N  |
+--------------------+-----+
|              211.1 | 100 |
|              200.0 |  11 |
|              237.5 |   2 |
|              537.5 |   1 |
|              225.0 |   1 |
|              375.0 |   1 |
|              475.0 |   1 |
|              412.5 |   1 |
|              387.5 |   1 |
|              362.5 |   1 |
+--------------------+-----+

How to Fix it?

Like most simple choices, this is not without controversy:

  • If you fix it, you are destroying data.
  • Does the fact that the rate is in the “wrong units” have value?
  • The names of the lenders are not disclosed in the HAMP data files. Now maybe the big rates might be from bank A and B and the tiny rates from banks C and D.
  • The HAMP program was controversial and many lenders initially failed to comply. Errors can prevent proper operation of some programs yet still look correct to a human.
  • One should always leave a way to get back to the raw data, even if that involves going back to the CSV files.

In MySQL you could update a column with some nested IF statements like this:

UPDATE hamp201501.mods SET ln_bef_aft_mdfc_int_rt=IF(ln_bef_mdfc_int_rt BETWEEN 0.01 and 0.20,100*ln_bef_mdfc_int_rt, IF(ln_bef_mdfc_int_rt BETWEEN 100 AND 2000, ln_bef_mdfc_int_rt/100.0, ln_bef_mdfc_int_rt));

and we need a similar UPDATE to ln_aft_mdfc_int_rt

This takes ~21 sec per query on an AMF FX 8150 running MySQL.

To test the replace, we can count the offending entries:

MariaDB [(none)]> select count(*) from hamp201501.mods where ln_bef_mdfc_int_rt between 0.01 and 0.20 or ln_bef_mdfc_int_rt between 100.0 and 2000.0 or ln_aft_mdfc_int_rt between 0.01 and 0.20 or ln_aft_mdfc_int_rt between 100.0 and 2000.0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (17.90 sec)

BigQuery has no UPDATE OR ALTER TABLE

and this is initially a problem when it comes to fixing up buggy data.

The utility of UPDATE is obvious. However, there is also utility in not having UPDATE in knowing the data hasn’t been altered.

ALTER TABLE MODIFY COLUMN comes in handy when there is a need to change types. For instance, it could be necessary to change from an integer type to a floating or decimal type.

In the Treasury Loan Mod data, my automated schema script initially identified ln_bef_mdfc_int_rt as an integer type because there were no decimal entries in the data. Later I hand edited the schema before importing the data into mySQL to make all _rt variables FLOAT (and not DECIMAL as it is not a monetary amount and the rate will go into floating point math routines anyway). Obviously, the original data was rounded, as interest rates like “3.25%” were quite common. In order to fix the entries like 325 into 3.25, it would have been necessary to change the type.

As far as I know, Google BigQuery doesn’t support changing the schema as a stand alone operation.

Options:

A: Create a new table with all the data of the old table. The query will probably only take a few seconds to execute, but will process the entire table’s worth of data. Constructing the query itself is annoying and would probably require a short program to convert the schema to a string SELECT query format, together with manual text editing to get the desired result.

B: Use the nested IF above to create a new table of the corrected values together with a row identifier, and use JOIN EACH This might be fine if the number of corrections is small.

C: Use the nested IF above whenever corrected values are needed. This might be more palatable if BigQuery had stored functions.

The conclusion of this section is that BigQuery isn’t designed for existing data cleaning workflows common to MySQL. Simple cleanup workflow can probably be re-engineered – but is unnecessary if existing systems like MySQL or python csv-based scripts can clean the data as a batch job, and then load the cleaned data into BigQuery.

Conclusion

Google BigQuery is much faster than do-it-yourself MySQL and worthwhile when you have only a little big data, like 5GB. OK, so it isn’t terabytes. But 5GB will not fit in Excel. It won’t fit in a lot of tools. You’ll normally need a database or scripts to chop it into smaller pieces, and BigQuery fits the needs for a read-only database fairly well so long as you don’t use it for pinpoint queries.

While 5GB of csv data will fit in MySQL, with minimal trouble it can run 10x faster on Google BigQuery and, at $5/TB, individual queries are likely under a cent. An AMD FX8150 with 16GB ram – as used for this article – or an equivalent VM-based solution – will be much more expensive.

Next steps

Hopefully the foregoing has satisfied your curiosity on the basics of using Google BigQuery and how it compares to do-it-yourself mySQL.

Read a book

Google BigQuery Analytics by
Jordan Tigani and Siddartha Naidu

Build a dashboard?

Google Dashboard for Bigquery article
https://cloud.google.com/bigquery/articles/dashboard

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.