I’ve been very grateful for Simo Ahava’s analytics advice over the years - none more so that today when he helped set up headless Chrome on Google to crawl a site and store which cookies it sets. Thanks Simo! The instructions are split over two blog posts - scraping a domain and writing results to BigQuery and then auditing cookies using BigQuery. It’s awesome.

It’s fairly simple to re-run once the initial cloud setup is done. For me, the code to kick off a crawl is:

gcloud compute instances create web-crawler --metadata-from-file startup-script=./gce-install.sh --scopes=bigquery,cloud-platform --machine-type=n1-standard-16 --zone=europe-west2-a

Once it’s set up you can make changes to the config, reupload and then go to the instances, select the instance and then start (it closes itself down each time).

For my own reference, the two BigQuery queries are:

The list of pages on the site, and their cookies

SELECT
  final_url,
  cookies
FROM
  `project.dataset.table`

and:

The list of all cookies

  SELECT
  c.name,
  c.domain,
  c.httpOnly,
  c.secure,
  c.session,
  c.sameSite
FROM
  `project.dataset.table`,
  UNNEST(cookies) AS c
GROUP BY
  1, 2, 3, 4, 5, 6
ORDER BY
  1 ASC

Useful Cloud console links: