Saturday, March 21, 2015

GoogleBigQueryAsCalculatorFromPiToMonteCarlo

Using Google BigQuery As A Massively Parallel Scientific Calculator: From π to Monte Carlo

An idea was discussed a bit at a talk I gave on BigQuery at a Google Developer Group meetup in Chattanooga. Can you use BigQuery to generate massive data instead of simply loading or analyzing existing data? Sure. One can write up a giant select query to fix some errors in a dataset, rearrange the column order, calculate some new columns, and save it in a table.
But something more basic, even subversive, is possible. It could be a Hadoop killer for some calculations.
All we need is a table of integers, say a list from 1 to 1,000,000,000.
Not finding such a dataset we release one as eaftc-free:billion.integers for public exploration.
Although this dataset would seem to be uninteresting, even trivial, it allows the end user to build all sorts of calculations and simulations using mathematical functions and the uniform [0,1) random function rand() built into BigQuery.
You can get access to our table at
https://bigquery.cloud.google.com/table/eaftc-free:billion.integers
Processing costs are your own responsibility, but Google has a limited free tier. Clicking this link will take you to a Google BigQuery signup unless you already have a BigQuery account.
So what? You ask. This simple table was a missing piece to make a massive parallel calculator.

Calculating π in Google BigQuery

Normal Usage

Normally, π in Google BigQuery can be obtained from the built in function PI()
Warning: I do not necessarily advocate repetition of the pi calculations shown below. You should use your free monthly processing in BigQuery to do something new and useful. Really. On the other hand, the public regularly uses similar levels of resources to search for furry kittens.
Queries can cost money… disclaimer: Most of the queries reported here will use 7 - 25 GB of BigQuery processing. After the free tier, in March 2015 Google BigQuery is currently priced by Google at US$5/TB. So, some of the queries here could cost you US$0.10 every time you run them. Therefore, you should probably not attach a web page that shows furry kittens to a back end cgi script that runs these queries on demand for the public from your account. Also in this article are instructions for constucting large tables. These large tables also attract storage costs from Google BigQuery until you delete them. Neither the author nor my company, Economic and Financial Technology Consulting LLC, will be responsible for any Google BigQuery processing, storage, or other charges incurred by experimenting with these queries.

Brute forcing a sum

Let’s calculate π using one of those huge sums you learn in math class.

Leibniz formula

From the Leibniz formula for Pi
(π/4) = 1 - (1/3) + (1/5) - (1/7) +(1/9) - (1/11) + …
We can write the calculation for π from the first 2,000,000,001 (two billion and one) terms in this series as this BigQuery against [eaftc-free:billion.integers]:

SELECT 4*(1+sum( (1.0/(4.0*n+1.0))-(1.0/(4.0*n-1.0)) )) as LeibnizPI FROM [eaftc-free:billion.integers];

Obtaining this approximation: 3.1415926540897736 in about 2 seconds
Compare to the pi day page - : 3.1415926535897932384
shows the two billion and one term Leibniz sum to be correct to 9 places and is a bit high for digits 10-11.
This sum can also be easily written as a simple loop in Python, or your favorite programming language, and is left as an exercise to the reader. A decent desktop computer, like our AMD FX8150, found a similar value in about 6 minutes running in one CPU core. Due to rounding and the kind of floats you are using, different answers are possible.

Euler formula

Euler had a result that Pi squared over 6 is the limit of the sum of all the squared reciprocals
(1/1)+(1/4)+(1/9)+(1/16)+…
This can be written in Google BigQuery as:

SELECT sqrt(6.0*sum(1.0/(n*n))) as EulerPI FROM [eaftc-free:billion.integers]

Obtaining this Euler approx: 3.141592652634905
Compare to the pi day page - : 3.1415926535897932384
shows the one billion term Euler sum to also be correct to 9 places and is a bit low for digits 10-11.
Although I don’t think you should waste of all Google’s processing power recomputing π over and over, Wikipedia provides a list of approximations of π that might be reconstructed in a similar manner.

The Road to Monte Carlo

Monte Carlo methodology has broad application in computational finance, particularly in the pricing of various kinds of bets (derivatives) about the motion of a price or price index, as well as in applied mathematics, and thus, many applications in science and engineering, to estimate integrals and dynamic processes that can not be derived on paper analytically.
The basic idea of Monte Carlo methods is:

Express the desired calculation as an average or other reduction of functions of a bunch of random numbers that are obtained from calling a random number generator
generate that bunch of random numbers
reduce the giant table of random numbers to the answer using other functions, averages, etc.

Bumps on the Road to Monte Carlo

Strange results repeating Big Query’s `RAND()`

You’d think multiple calls to RAND() in a query would generate independent calls and independent random numbers. Not so, as of March 2015.

Should this query yield 1/4 or 1/3 of 1,000,000,000 ?

 select sum(rand()*rand()) from [eaftc-free:billion-integers]

What this does is, sum up rand()*rand() 1 billion times, once for each of the integers in the table. Each call to rand() yields a random floating number between 0.0 and 1.0.
Notice that no columns of the table are needed, so this query attracts 0 bytes of Google BigQuery processing charges even though it is work for BigQuery to calculate. The number of rows in the billion-integers table, 1 billion, controls the number of times the innermost expression will be executed before being summed.

Why 1/4 of 1 billion?

If the RAND() calls produce different, independent, random numbers, then we can use the formula that applies to independent random numbers E[x*y]=E[x]*E[y] where E[x] is the expected vaue of x, a fancy technical way to say the average of x. Since the average value of rand() is 0.5, we should have avg(rand()*rand()) = avg(rand()) * avg(rand()) = 0.5 * 0.5 = 0.25. If the avg is 0.25, then summing 1,000,000,000 should yield around 2.5E8, or 250,000,000

Why 1/3 of 1 billion?

Cacheing is a technique to speed up computer processing by recalling a result from memory instead of processing it over again. If multiple calls to RAND() in a row of output were somehow cached into one call to RAND() then avg(rand()*rand()) is not multiplying two different independent random numbers and taking the average, but actually multiplying a single random number by itself and taking the average, as inavg(square(rand())). In such a case, the result is the definite integral of x^2 over the interval [0,1], which you might remember from freshman calculus is 1/3.

And the Google BigQuery Answer is …. 1/3 … ugh :-(

Subselects won’t generate independent tables of RAND()

The next section explores how to work around this issue to generate a table with two independent random values.

Paving the Road to Monte Carlo

We want to generate a table with independent realizations of two random variables on each row.
Let’s call these variables u and v
This can be done in a straightforward series of steps.

Make a u table where u is generated by rand()
Make a v table where v is generated by rand()
Join the u and v tables
Test for independence
Clean up by deleting the tables in steps 1 and 2

We publicly released the resulting u,v table as [eaftc-free:billion.randUV]

Making the randU table

In a query returning a billion entries into a new table, we had to select Allow Large Results. As a matter of courtesy to other users we also ran these table construction queries as batch mode queries.
making the u table

Making the randV table

Joining to make the randUV table

Testing independence

Generating Normals: The Box-Muller Transform

Having a table with two independent random uniform variables allows us to make standardized normal random variables from the normal, or bell-shape, distribution. Although we won’t get that far in this article, the normal distribution is the basic building block for building random walks of stock market prices and Monte Carlo estimation of options prices.
The Box-Muller Transform says we can take our two independent rand() uniform random numbers u and v and make two zero mean, unit variance standard normal distribution random numbers z0 and z1using these formulas:

z0 = sqrt(-2.0 ln(u)) cos(2πv)

z1 = sqrt(-2.0 ln(u)) sin(2πv)

In Google BigQuery, to avoid extra storage charges we store a view computing z0, z1 from u, v based on table randUV instead of computing a new table of z0, z1.
So you can follow along, we’ve made this view publicly available as [eaftc-free:billion.BoxMuller]
Box Muller view

Testing averages, variances from Box-Muller

We were happy to see averages near zero and variances near 1, and all slightly different from each other.
testing Normals for avg and var

Examining the normal tail: That infamous sixth sigma

By six sigma here we are referring to the Normal distribution, not the Six Sigma managerial process and training organizations, although the goals of the latter are marketed through the mathematical properties of the former.
In the normal distribution z0 >= 6.0 occurs with a probability of about 1 in a billion.
We can use Google BigQuery to see how many of these occurred in our billion normal values.
count of six sigma normal values

and we can find the row of the Box Muller view with this value
row with six sigma value

Five sigma

Exceeding five sigma should give us about 573 events per billion for a two sided (+ or -) deviation, or about 286 if we just look at one half… let’s look:
five sigma events

We find 302 in the positive, or right tail.
Is this too many? No.
From The Law of Rare Events the standard deviation in the number of rare events is the square root of the expected number or, for 286 expected events, +/- 17 events.
302 is within one sigma of the expected value and thus not a statistically significant difference.

Conclusion: Normal variables look good

Opportunity for Future work

Now that we have a way to produce standard random normal variables, it should be possible to build random walks as cumulative sums. Then, from a random walk, one can do Monte Carlo stock options pricing.
According to Stack Overflow, Google BigQuery supports cumulative sums
Further exploration of random walks is left to the reader or a future article.

Conclusions

With some care and the lessons from this article, Google BigQuery can, in its current form, function as an interesting, massively parallel scientific calculator and simulation platform.
It might be even better if rand() were a little more well-behaved. Although this article did not consider whether the random number generator passes statistical quality tests or standards for random generation, even if the rand function failed such a tests it would be a fairly simple matter to upload a fairly large table of properly generated random numbers...
We don’t know whether Google wants to be in the simulation generation space, or only the data analysis space. This will probably be more clear as Google BigQuery develops, or if Google develops additional tools or products that can leverage their parallel processing capabilities and help users avoid the steep learning curves that otherwise exist for setting up and using most modern parallel technologies.

Tuesday, March 17, 2015

HowToPubliclyShareAGoogleBigQueryDataset

This is the 2nd in a series of articles for Sunshine Week at blog.sunshineonacloudy.net concerning Google BigQuery and ways to share Big Data, or in this case Big Govt’ Data, so that others may have a look.

UPDATES: Cunningham's Law worked well with this post. Thanks to Felipe Hoffa for some inside info as to how free usage works.

Data Access TL:DR; If you want to see the shared loan modification data, try
https://bigquery.cloud.google.com/table/eaftc-free:unofficialhamp.mods201501

Goals:

share my “unofficial” Google BigQuery dataset of loan modifications from the US Treasury’s public data dumps. This raw data is free, mostly a copy of US Treasury public domain data, and comes with NO WARRANTY.
keep my bill to a minimum and require you to pay your own Google processing bill for extensive processing. We are willing to pay for storage only (about $0.25/month). The processing bill is pay-as-you-go, with the first TB apparently free from Google and should be less than one US cent per query thereafter for most simple queries against the dataset. But it could be $500/day if you start analyzing other, huge, datasets such as those from Wikipedia.
I’d like to credit my company, Economic and Financial Technology Consulting LLC with some good karma. We’re always looking for new business and interesting problems to solve. It looks like Google BigQuery is a great solution for keeping multi-GB operations in the human patience zone for interactive queries.

Names

Shakespeare’s “A rose by any other name would smell as sweet” clearly doesn’t apply to Google BigQuery’s fully qualified dataset names.
The format of a fully qualified BigQuery dataset name is:

[project:dataset.table]

You want people to be able to find the data, know what it is and what it isn’t, and not be confused or hate the experience.
Since this is free data, with storage sponsored by my company abbreviated eaftc, consisting of an unofficial copy of the US Treasury Making Home Afffordable HAMP Data Files we are going with these fully qualified names:

[eaftc-free:unofficialhamp.mods201501]
[eaftc-free:unofficialhamp.npvs201501]

and, unfortunately, we can not rename what we already have to these names. So, we have to create a new project and upload the data all over again. After some thought, I think this is a good thing. I’d like to keep my own copy of the tables and any ad-hoc analyses I do separate from others’ usage of the data.
The date 201501 indicates cumulative data through Jan 2015.

Raw Data and Documentation Archive

A free archive of the raw data is available in our unofficial-hamp-201501 publicly shared Google Drive folder.
The US Treasury has changed the data formats at least once in the past, and they add new data from time to time…. so this archive is necessary for maintaining the provenance of the data. Anyone who wants to use the original raw data outside of BigQuery for, say, a college database course or to test it against what is in the eaftc-free:unofficialhamp BigQuery tables is welcome to do so. The archive contains:

Raw .zip files as downloaded from Treasury web site
python script unfubarcsv.py used to rewrite the csv files to remove extra commas and quotes, in preparation for Google BigQuery
Data dictionaries and usage guide as downloaded from Treasury web site
Google BigQuery schemas, in JSON format, for the mods and npvs tables

Note: Internet users seeking to access the free dataset can safely skip this section.
This section explains how to share. It does not explain how to access the shared data.
Steps we took to create and share the free dataset. The past tense (went, created, made) is used to emphasize that these are not steps for others to use the data. We had to do this, exactly once, to set up the data and share it with everyone else.

from a web browser, logged in to Google, e.g. Gmail
went to the Google API Console
Signed up and turn on Billing if not already set up.
created a new project. I set my project-id to eaftc-free Note that this was a permanent setting and was required to be unique across everyone’s Google API Projects globally on Earth. And I can not easily change the id later. Changing project-id would require deleting and recreating the project and all its stored files and datasets.
enabled Google BigQuery API and Google Cloud Storage API for the project. There is a list of APIs on the web console and you can click to turn them on and off.
ssh to a computer that has the prepared data and Google Cloud SDK already installed.
from the bash shell, ran gcloud auth login, open the huge super-secret URL in a browser and authenticate
set the project: gcloud config set project eaftc-free
made a cloud storage bucket: gsutil mb gs://eaftc-free The bucket name must be globally unique among gs bucket names on Earth.
Copied the mods table .csv data file to Google Cloud Storage: gsutil cp ./HMP_Public_User_Mod_Data_All_20150116-unfubar.csv gs://eaftc-free/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv
Copied the npvs table .csv data file to Google Cloud Storage: gsutil cp ./HMP_Public_User_NPV_Data_All_20150116-unfubar.csv gs://eaftc-free/HMP_Public_User_NPV_Data_All_20150116-unfubar.csv
Went to the BigQuery API web console and clicked the little down arrow to add a dataset unofficialhamp to the eaftc-free project. In the same step, was prompted to add a description to the dataset for future viewers.
Back to the ssh session to load the cloud .csv files into BigQuery.
Took 5 min to load ~ 5GB into the mods table: bq load --skip_leading_rows=1 unofficialhamp.mods201501 gs://eaftc-free/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv ./mod.bq
Took 3 min to load ~2 GB into the npvs table: bq load --skip_leading_rows=1 unofficialhamp.npvs201501 gs://eaftc-free/HMP_Public_User_NPV_Data_All_20150116-unfubar.csv ./NPV.bq
Back at the BigQuery API Web Console, clicked the second tiny down arrow next to the data set name unofficialhamp to bring up the Share Dataset window. Under Add people selected All Authenticated Users. This means everyone on Earth with a Google account and BigQuery set up. Unchecked Notify people via email (probably wouldn’t do anything…) set everyone to Can View and clicked Add. Then clicked Save changes at bottom of dialog. Received a popup that the dataset has been shared.
Done. Kinda. Except: No one knew it was there. The existence of the dataset still needed to be publicized in, say, a blog (like the one you are reading right now) or on social media.
Cleaned up by deleting the Cloud Storage bucket and associated data.

Note: publicdata:samples is Google promoted public data. Currently (March 2015) there doesn’t seem to be an obvious way for users to publicize public datasets in a way that makes them easy to discover within the BigQuery platform. Hopefully that will change in the future.

How Internet Users can Access the Shared “Unofficial” Dataset on HAMP Loan Modifications

Try this first! Thanks Felipe!

According to Felipe Hoffa, in a comment below and also some private messages I could not attend to immediately, you should try this first to see if you can get access without a credit card:

1) people can directly access the table, instead of doing the step-by-step:
https://bigquery.cloud.google.com/table/eaftc-free:unofficialhamp.mods201501
…

2) No credit card needed! People can query up to a terabyte of data for free every month. They need to sign up for google cloud, but no credit card.

If that doesn't work for any reason, here is what I had posted previously:

Login to Google
Go to https://console.developers.google.com and have your credit card ready. You must sign up for Google BigQuery and leave a credit card to pay your own Google BigQuery data processing bill. At $5/TB most queries of the 7GB=0.007 TB. Unofficial HAMP dataset will cost less than $0.05, and many less than $0.01, each time you run a query . Processing larger datasets (such as free datasets from wikipedia and github) can run up larger fees faster. And there’s always the ugly possibility of identity theft. I think the service shuts down at $500/day unless larger limits are established. Scary stuff, but you should know the potential damage if you lose control of your account. Keep an eye on things, and use good passwords and perhaps turn on two-factor-authentication on your Google account.
In the Google Developer’s API console add a project (call it anything you like) and sign up for the Google BigQuery API
Go to your project in the Google BigQuery console and add the shared dataset.

The Google BigQuery docs describe this as follows:

Switch to a shared dataset
It is possible for another user to share a dataset with you but not
add you as a user to the corresponding Google Developers Console
project. If that happens, the dataset does not automatically appear in
the projects list, and you need to manually add it using the Display
Projects option.
To manually add a dataset:
Select the display menu icon next to the project name.
Hover over the Switch to Project option.
Click the Display Project... option.
On the popup that appears, enter the project ID of the dataset that was shared with you.
Your shared dataset now appears at the bottom of the dataset list.

In Switch to Project–>Display Project–>Add Project/Project ID enter:

eaftc-free

to add the publicly shared eaftc-free dataset from Economic and Financial Technology Consulting LLC containing the unofficialhamp dataset.
You will see eaftc-free:unofficialhamp added to the projects list in the left sidebar.
If you click the expansion arrow it will show the table names in the shared dataset:

 mods201501
 npvs201501

You can now click to select the mods201501 table and run a query by selecting the “query table” from the right hand side.

Running an example query

Let’s run a query to get counts of the loan statuses on file in the mods table.
Look at the top of the screen. The New Query box should show as a default

SELECT FROM [eaftc-free:unofficialhamp.mods201501] LIMIT 1000

Change that to:

SELECT ln_st_nme, ln_mdfc_mode_nme, count(*) as N FROM [eaftc-free:unofficialhamp.mods201501] GROUP BY ln_st_nme, ln_mdfc_mode_nme ORDER BY N DESC LIMIT 20;

and check that the little check mark has turned from red to green. Green indicates a valid query. Click the green check mark for a preview of the processing charges for running the query. This query costs 230MB. 1TB is $5. Thats 230/1000000 x $5 ~= $ 0.001
Click the RUN QUERY button. The Query Results should match the following:

What is this Query telling us?

Notice this query result is telling us that most of the records (3.8 million) in the mods201501 table are HAMP requests that were not approved. There are other database fields that can provide additional explanation, and perhaps a basis for writing another article. Only about 1 million records are for “Official Modifications” with “Active Payment” status.

Saving the Query as a Table or View

You can do anything you like with the free dataset, but you have to save any views or results into your own project. You will be paying Google for BigQuery storage charges for any new tables you create (around $0.02/GB-Month as of March 2015).

Conclusion

This article explained how to share a dataset with the world.
It is important to know what you want to call the public dataset before you do all the work of loading it into BigQuery.
To share a dataset, you have to be willing to pay the storage charges for the tables you upload. You do not have to pay for the data processing charges. These can be paid by those using the tables.

Next steps

After reading this article, you should be able to:

access the eaftc-free:unofficialhamp dataset to follow along with other explorations reported on this blog
share your own “Big Data” project on Google BigQuery with the world some day

A final word of warning

If you turn on billing and leave a credit card and are concerned about possible billing problems or abuse, I believe you can turn off BigQuery billing any time you like – provided you are only using others’ shared data or perhaps only trying out BigQuery for an afternoon and have no permanent intention of using it.

If you have a storage bill for data you are using or sharing with the world, then your’e stuck paying that as long as Google is hosting the datasets you are sharing.
Also, keep in mind that Google BigQuery is still subject to change, I do not represent Google Inc. in any way, and am certainly not a free support desk for answering questions about their products.
I wrote this article in the hope that it is useful and will save other users some time and aggravation, or at least let people know if they are on the right track and what to expect.
Some day, the technical information presented here may be obsolete.
If you have questions about the procedures or encounter errors, do a Google search or Ask on Stack Overflow. These methods are the fastest and correct way to find help.