Tuesday, March 17, 2015

HowToPubliclyShareAGoogleBigQueryDataset

How to Publicly Share a Google BigQuery Dataset: Sharing ~7GB of Loan Modification Public Data

This is the 2nd in a series of articles for Sunshine Week at blog.sunshineonacloudy.net concerning Google BigQuery and ways to share Big Data, or in this case Big Govt’ Data, so that others may have a look.

UPDATES:  Cunningham's Law worked well with this post.  Thanks to Felipe Hoffa for some inside info as to how free usage works.

Data Access TL:DR; If you want to see the shared loan modification data, try
https://bigquery.cloud.google.com/table/eaftc-free:unofficialhamp.mods201501

Goals:

  1. share my “unofficial” Google BigQuery dataset of loan modifications from the US Treasury’s public data dumps. This raw data is free, mostly a copy of US Treasury public domain data, and comes with NO WARRANTY.
  2. keep my bill to a minimum and require you to pay your own Google processing bill for extensive processing. We are willing to pay for storage only (about $0.25/month). The processing bill is pay-as-you-go,  with the first TB apparently free from Google and should be less than one US cent per query thereafter for most simple queries against the dataset. But it could be $500/day if you start analyzing other, huge, datasets such as those from Wikipedia.
  3. I’d like to credit my company, Economic and Financial Technology Consulting LLC with some good karma. We’re always looking for new business and interesting problems to solve. It looks like Google BigQuery is a great solution for keeping multi-GB operations in the human patience zone for interactive queries.

Names

Shakespeare’s “A rose by any other name would smell as sweet” clearly doesn’t apply to Google BigQuery’s fully qualified dataset names.
The format of a fully qualified BigQuery dataset name is:
[project:dataset.table] 
You want people to be able to find the data, know what it is and what it isn’t, and not be confused or hate the experience.
Since this is free data, with storage sponsored by my company abbreviated eaftc, consisting of an unofficial copy of the US Treasury Making Home Afffordable HAMP Data Files we are going with these fully qualified names:
[eaftc-free:unofficialhamp.mods201501]
[eaftc-free:unofficialhamp.npvs201501]
and, unfortunately, we can not rename what we already have to these names. So, we have to create a new project and upload the data all over again. After some thought, I think this is a good thing. I’d like to keep my own copy of the tables and any ad-hoc analyses I do separate from others’ usage of the data.
The date 201501 indicates cumulative data through Jan 2015.

Raw Data and Documentation Archive

A free archive of the raw data is available in our unofficial-hamp-201501 publicly shared Google Drive folder.
The US Treasury has changed the data formats at least once in the past, and they add new data from time to time…. so this archive is necessary for maintaining the provenance of the data. Anyone who wants to use the original raw data outside of BigQuery for, say, a college database course or to test it against what is in the eaftc-free:unofficialhamp BigQuery tables is welcome to do so. The archive contains:
  • Raw .zip files as downloaded from Treasury web site
  • python script unfubarcsv.py used to rewrite the csv files to remove extra commas and quotes, in preparation for Google BigQuery
  • Data dictionaries and usage guide as downloaded from Treasury web site
  • Google BigQuery schemas, in JSON format, for the mods and npvs tables

Creating and Sharing The Free Dataset

Note: Internet users seeking to access the free dataset can safely skip this section.
This section explains how to share. It does not explain how to access the shared data.
Steps we took to create and share the free dataset. The past tense (went, created, made) is used to emphasize that these are not steps for others to use the data. We had to do this, exactly once, to set up the data and share it with everyone else.
  1. from a web browser, logged in to Google, e.g. Gmail
  2. went to the Google API Console
  3. Signed up and turn on Billing if not already set up.
  4. created a new project. I set my project-id to eaftc-free Note that this was a permanent setting and was required to be unique across everyone’s Google API Projects globally on Earth. And I can not easily change the id later. Changing project-id would require deleting and recreating the project and all its stored files and datasets.
  5. enabled Google BigQuery API and Google Cloud Storage API for the project. There is a list of APIs on the web console and you can click to turn them on and off.
  6. ssh to a computer that has the prepared data and Google Cloud SDK already installed.
  7. from the bash shell, ran gcloud auth login, open the huge super-secret URL in a browser and authenticate
  8. set the project: gcloud config set project eaftc-free
  9. made a cloud storage bucket: gsutil mb gs://eaftc-free The bucket name must be globally unique among gs bucket names on Earth.
  10. Copied the mods table .csv data file to Google Cloud Storage: gsutil cp ./HMP_Public_User_Mod_Data_All_20150116-unfubar.csv gs://eaftc-free/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv
  11. Copied the npvs table .csv data file to Google Cloud Storage: gsutil cp ./HMP_Public_User_NPV_Data_All_20150116-unfubar.csv gs://eaftc-free/HMP_Public_User_NPV_Data_All_20150116-unfubar.csv
  12. Went to the BigQuery API web console and clicked the little down arrow to add a dataset unofficialhamp to the eaftc-free project. In the same step, was prompted to add a description to the dataset for future viewers.
  13. Back to the ssh session to load the cloud .csv files into BigQuery.
  14. Took 5 min to load ~ 5GB into the mods table: bq load --skip_leading_rows=1 unofficialhamp.mods201501 gs://eaftc-free/HMP_Public_User_Mod_Data_All_20150116-unfubar.csv ./mod.bq
  15. Took 3 min to load ~2 GB into the npvs table: bq load --skip_leading_rows=1 unofficialhamp.npvs201501 gs://eaftc-free/HMP_Public_User_NPV_Data_All_20150116-unfubar.csv ./NPV.bq
  16. Back at the BigQuery API Web Console, clicked the second tiny down arrow next to the data set name unofficialhamp to bring up the Share Dataset window. Under Add people selected All Authenticated Users. This means everyone on Earth with a Google account and BigQuery set up. Unchecked Notify people via email (probably wouldn’t do anything…) set everyone to Can View and clicked Add. Then clicked Save changes at bottom of dialog. Received a popup that the dataset has been shared.
  17. Done. Kinda. Except: No one knew it was there. The existence of the dataset still needed to be publicized in, say, a blog (like the one you are reading right now) or on social media.
  18. Cleaned up by deleting the Cloud Storage bucket and associated data.
Note: publicdata:samples is Google promoted public data. Currently (March 2015) there doesn’t seem to be an obvious way for users to publicize public datasets in a way that makes them easy to discover within the BigQuery platform. Hopefully that will change in the future.

How Internet Users can Access the Shared “Unofficial” Dataset on HAMP Loan Modifications


Try this first!  Thanks Felipe!

According to Felipe Hoffa, in a comment below and also some private messages I could not attend to immediately, you should try this first to see if you can get access without a credit card:

1) people can directly access the table, instead of doing the step-by-step:
https://bigquery.cloud.google.com/table/eaftc-free:unofficialhamp.mods201501
 …

2) No credit card needed! People can query up to a terabyte of data for free every month. They need to sign up for google cloud, but no credit card.  

If that doesn't work for any reason, here is what I had posted previously: 

  1. Login to Google
  2. Go to https://console.developers.google.com and have your credit card ready. You must sign up for Google BigQuery and leave a credit card to pay your own Google BigQuery data processing bill. At $5/TB most queries of the 7GB=0.007 TB. Unofficial HAMP dataset will cost less than $0.05, and many less than $0.01,  each time you run a query . Processing larger datasets (such as free datasets from wikipedia and github) can run up larger fees faster. And there’s always the ugly possibility of identity theft. I think the service shuts down at $500/day unless larger limits are established. Scary stuff, but you should know the potential damage if you lose control of your account. Keep an eye on things, and use good passwords and perhaps turn on two-factor-authentication on your Google account.
  3. In the Google Developer’s API console add a project (call it anything you like) and sign up for the Google BigQuery API
  4. Go to your project in the Google BigQuery console and add the shared dataset.
The Google BigQuery docs describe this as follows:
Switch to a shared dataset
It is possible for another user to share a dataset with you but not
add you as a user to the corresponding Google Developers Console
project. If that happens, the dataset does not automatically appear in
the projects list, and you need to manually add it using the Display
Projects option.
To manually add a dataset:
Select the display menu icon next to the project name.
Hover over the Switch to Project option.
Click the Display Project... option.
On the popup that appears, enter the project ID of the dataset that was shared with you.
Your shared dataset now appears at the bottom of the dataset list.
In Switch to Project–>Display Project–>Add Project/Project ID enter:
eaftc-free
to add the publicly shared eaftc-free dataset from Economic and Financial Technology Consulting LLC containing the unofficialhamp dataset.
You will see eaftc-free:unofficialhamp added to the projects list in the left sidebar.
If you click the expansion arrow it will show the table names in the shared dataset:
 mods201501
 npvs201501
You can now click to select the mods201501 table and run a query by selecting the “query table” from the right hand side.


Running an example query


Let’s run a query to get counts of the loan statuses on file in the mods table.
Look at the top of the screen. The New Query box should show as a default
SELECT FROM [eaftc-free:unofficialhamp.mods201501] LIMIT 1000    
Change that to:
SELECT ln_st_nme, ln_mdfc_mode_nme, count(*) as N FROM [eaftc-free:unofficialhamp.mods201501] GROUP BY ln_st_nme, ln_mdfc_mode_nme ORDER BY N DESC LIMIT 20;
and check that the little check mark has turned from red to green. Green indicates a valid query. Click the green check mark for a preview of the processing charges for running the query. This query costs 230MB. 1TB is $5. Thats 230/1000000 x $5 ~= $ 0.001
Click the RUN QUERY button. The Query Results should match the following:

What is this Query telling us?

Notice this query result is telling us that most of the records (3.8 million) in the mods201501 table are HAMP requests that were not approved. There are other database fields that can provide additional explanation, and perhaps a basis for writing another article. Only about 1 million records are for “Official Modifications” with “Active Payment” status.

Saving the Query as a Table or View

You can do anything you like with the free dataset, but you have to save any views or results into your own project. You will be paying Google for BigQuery storage charges for any new tables you create (around $0.02/GB-Month as of March 2015).

Conclusion

This article explained how to share a dataset with the world.
It is important to know what you want to call the public dataset before you do all the work of loading it into BigQuery.
To share a dataset, you have to be willing to pay the storage charges for the tables you upload. You do not have to pay for the data processing charges. These can be paid by those using the tables.

Next steps

After reading this article, you should be able to:
  • access the eaftc-free:unofficialhamp dataset to follow along with other explorations reported on this blog
  • share your own “Big Data” project on Google BigQuery with the world some day

A final word of warning

If you turn on billing and leave a credit card and are concerned about possible billing problems or abuse, I believe you can turn off BigQuery billing any time you like – provided you are only using others’ shared data or perhaps only trying out BigQuery for an afternoon and have no permanent intention of using it.

If you have a storage bill for data you are using or sharing with the world, then your’e stuck paying that as long as Google is hosting the datasets you are sharing.
Also, keep in mind that Google BigQuery is still subject to change, I do not represent Google Inc. in any way, and am certainly not a free support desk for answering questions about their products.
I wrote this article in the hope that it is useful and will save other users some time and aggravation, or at least let people know if they are on the right track and what to expect.
Some day, the technical information presented here may be obsolete.
If you have questions about the procedures or encounter errors, do a Google search or Ask on Stack Overflow. These methods are the fastest and correct way to find help.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.