REA Group has come a long way from its beginnings in a garage in Melbourne’s eastern suburbs. The business is now a multinational operation that includes Australia’s leading residential, commercial and share property websites, a mortgage broking franchise and property data services provider, as well as businesses and investments throughout Asia and North America. REA Group provides a platform — accessible via the web or mobile app — that connects property buyers and renters to agents or property vendors such as developers or individual owners.
REA Group began its Google Cloud journey several years ago, primarily focusing on leveraging the data analytics tools and services on the Google Cloud Platform (GCP) technology stack. Back at the beginning of the journey, the region in Sydney did not exist. Like many other GCP customers at that time, REA Group chose the EU multi-region for analysing their data in BigQuery. Fast forward to today, and because of newly established contractual obligations and data sovereignty requirements, REA Group wanted to repatriate its BigQuery EU datasets to the relatively new Sydney region.
REA Group’s data that resides in BigQuery serves a pivotal role by serving large-scale, data-driven analytical workloads and critical reporting functions across the business. It comprises sales, operational, marketing/audience and analytical data. Although highly skilful and talented technologists themselves, REA Group enlisted Servian to help plan and implement this project.
Working closely with REA Group’s engineering teams, we came up with a robust, cost-effective and scalable solution. Google Cloud Storage (GCS) and its transfer service was used as the main technology in the solution. We used GCS to extract the data into, and then reload it back into BigQuery on the Sydney side. GCS is the glue that holds everything together on GCP. We chose not to use Cloud Composer as the orchestration tool for a few reasons, but mainly to keep the solution simple and costs down. Instead, we orchestrated the entire process by using Cloud Build.
We decided against using the native BigQuery cross-regional dataset copy service due to some limitations associated with it, but also because the engineering team wanted more control over how the data was handled during the process. Not only did the data itself need to be migrated, but also all the dataset and table IAMs, views, UDFs and partitioned tables, which is a limitation of the existing service.
Another determining factor was that the “hot data” (data frequently accessed/used by the business) needed to be moved within an aggressive 48 hour window. This, coupled with the need to validate that the data had been migrated successfully and without corruption, made it all the more challenging from an engineering perspective. Finally, having a well planned and robust roll-back strategy is an important piece of the puzzle. Although we didn’t need to use it, you have to plan for things going wrong — it is just software after all.
Using bespoke Python scripts and leveraging Python Pools for parallelisation, the process to migrate the data from EU to AU was as follows:
- Extract metadata about datasets and objects. This of course included views and tables from their schema all the way down to their metadata, configuration of partitioning and also permissions.
- Extract tables/data to GCS buckets in the EU region first, using compressed JSON new line delimited format files.
- Transfer the files across the regions between 2 GCS buckets by using the GCS transfer service.
- Replicate all the objects extracted in step #1 under datasets with the same name as the original plus a suffix in the new region, in our case, Sydney. So, for example, a dataset called “foo” in the EU would be replicated with the name “foo_migrated” in AU. This allowed us to have a quick rollback policy as well as verifying the replicated tables at the same time. Of course, this included empty tables exactly mirroring the original ones.
- Load the JSON compressed files from the AU GCS bucket(s) to the replicated empty tables using the BigQuery load table API.
- Back up the original datasets and then copy the migrated datasets in the destination region with their final name. This reduced the friction of changing upstream and downstream pipelines since all the datasets maintained the same name. Not being able to rename a dataset in BigQuery made this a little step more clunky than we would have liked.
Although the high-level architecture and outlined process above paints a somewhat trivial workflow, the movement of that much data did in fact throw up a few considerations that weren’t in play for smaller repatriation projects that we’d done in the past. When you’re shovelling half a petabyte of data around from one continent to another, things get a lot more interesting and challenging — even when you’re working in the cloud!
For example, BigQuery has limits and quotas for extracting and loading with GCS that we needed to consider and engineer solutions for. The main 2 quotas that needed to be considered were:
- 10 TB extract limit per day. Being a soft limit, it is possible to request an increase for this one.
- 15 TB load limit per table per load job. This is a hard limit and cannot be changed/lifted.
Additionally we had a few tables over 100TB and they were “hot data” tables being updated in real-time by streaming jobs. Migrating these tables were by far the most technically challenging hurdles that we needed to overcome. The first thing that we wanted to do was migrate any long tail data in the partitions that hadn’t been modified or updated in the last 5 days. This would allow us to migrate those partitions in the large tables ahead of time and before the 48 hour migration window. To help us identify those partitions that we could migrate early, we had to revert to some good old Legacy SQL in BigQuery to work it out:
SELECT *, DATEDIFF(CURRENT_TIMESTAMP(),last_modified_timestamp) AS days_since_last_modified
FROM [<Partitioned Table>$__PARTITIONS_SUMMARY__]
Based on that information, we could then break the 100TB+ tables into smaller chunks so we could more easily migrate them. On the other side, when reloading them back in, we of course needed to reassemble/recombine then back into one table with the right partitions. This also involved some more heavy duty engineering effort. The overall process of migrating the big 100TB+ tables looked like this:
- Generate multiple table shards from the large table by using the partition information. Basically running queries and specifying destination tables. Flex-slots are your friend here!
- Gradually migrate the tables across regions following the process mentioned above and reusing the same scripts we wrote to extract, transfer and load the data.
- On the final migration day the upstream and downstream ETL data pipelines were paused while we generated the last table shard and transferred the data across regions.
- After all the table shards were migrated to AU, we then recombined them back into the big tables where they came from by using copy jobs with the append option for WRITE_DISPOSITION.
- Verify the now fully migrated table against the original one in the source region.
By using this approach we were able to gradually day by day migrate 500TB of data without any data loss and with minimal disruption to the business. Pro tip: request your quota increase for data extraction with enough lead time because it can take some time to be approved/implemented.
The team also needed to dynamically provision dedicated BigQuery throughput via flex-slots in order to meet the scalability demands of the migration, but at the same time using techniques to keep GCP costs as low as possible for REA Group during the migration. We also tried to use Avro file format for the extraction and reloading, but we ran into a lot of problems with data type conversions and trying to reload the data in Sydney, which just became too hard. In the end, we gave up on Avro and went back to using compressed JSON.
Flex-slots were also needed for the extract and load jobs due to the amount of data we were working with. Using the default slot pool for extract and load jobs (which is separate to the slot pool for queries) would not cut the mustard. Pro tip: when using flex-slots for extract and load jobs, you need to use the — pipeline parameter when setting up the slot reservation via the API.
Keeping costs down for REA Group was an important consideration and part of the project success criteria. Handling data at such scale does incur higher costs, and the team needed to be very mindful of this. Examples include the often overlooked cost of network egress charges for the transfer of data from the EU to AU region when using GCS, and the fact that any long-term storage in BigQuery would be reset once reloaded into its new home in Sydney. However, with the help of Google SMEs and by leveraging our GCP expertise, the team managed to keep costs down and within the budget allocated for the project.
Operating at that scale does not come without its challenges. Moving almost 500TB of data without causing disruption to the business needed to be carefully planned and executed. Servian had done the same work for other customers in the past, but with just a fraction of the data. However, by all accounts, the repatriation of REA Group’s BigQuery data assets was a great success. And on that note, I’ll leave you with a nice quote from our happy customer/stakeholder:
“We turned to Servian to help us with this important piece of work. Their team had significant GCP experience, and in particular, their in-depth knowledge and experience with BigQuery gave us the level of confidence we needed for such a critical project to be delivered successfully. By using Servian and leveraging their experience in having already done this type of work before with other customers, it saved us a huge amount of time. Servian was able to assemble a team with considerable expertise and no doubt saved REA Group substantial development time. They always carried themselves professionally and engaged early with issue identification and remedies. Servian was instrumental in delivering the solution for repatriating our BigQuery data and supporting us to manage costs. I look forward to working together again in the future.” — Leigh Tolliday, General Manager, Data Services, REA Group.