I’ve been very grateful for Simo Ahava’s analytics advice over the years - none more so that today when he helped set up headless Chrome on Google to crawl a site and store which cookies it sets. Thanks Simo! The instructions are split over two blog posts - scraping a domain and writing results to BigQuery and then auditing cookies using BigQuery. It’s awesome.
It’s fairly simple to re-run once the initial cloud setup is done. For me, the code to kick off a crawl is:
gcloud compute instances create web-crawler --metadata-from-file startup-script=./gce-install.sh --scopes=bigquery,cloud-platform --machine-type=n1-standard-16 --zone=europe-west2-a
Once it’s set up you can make changes to the config, reupload and then go to the instances, select the instance and then start (it closes itself down each time).
For my own reference, the two BigQuery queries are:
The list of pages on the site, and their cookies
SELECT final_url, cookies FROM `project.dataset.table`
The list of all cookies
SELECT c.name, c.domain, c.httpOnly, c.secure, c.session, c.sameSite FROM `project.dataset.table`, UNNEST(cookies) AS c GROUP BY 1, 2, 3, 4, 5, 6 ORDER BY 1 ASC
Useful Cloud console links: