Are We in a Data Science Gilded Age?

In this Normal Deviance column, Hugh undertakes a geospatial analytics project on a deadline and reflects on its implications.

Several years ago my company undertook some work for a retailer to find good potential locations for new stores. It took a month or so for a smallish team (two to three analysts) to turn around the work – perhaps a few hundred hours in all.

Analytics tools and workflows have continued to evolve since then, so a question occurred to me recently. How quickly could a similar piece of work be done today?

I picked pharmacies (no particular reason, but they do have those charmingly complex rules around whether you’re allowed to open or move a store) and set myself the task of a proof of concept analysis around whereabouts in Sydney is best to put new pharmacies.

The analysis involved:

    1. Finding geographic locations of all pharmacies in Sydney, using a Google Places API.
    2. Identifying all potential store locations based on “commercial” labelled mesh blocks from the ABS (Census 2021) – small scale geographic areas.
    3. Finding population by mesh block, again from the ABS.
    4. Finding the central location (centroid) of each mesh block used in stages #2 and #3.
    5. Calculating distance between all actual and potential pharmacy sites and the surrounding population.
    6. Calculating the “sales” earned at each potential location. A simple formula was used that says that all people spend $1,000 at pharmacies per year, and the amount at each pharmacy exponentially decays with distance (so spend at a pharmacy 2km away is 70% of the spend at one 1km away, which in turn is 70% of a pharmacy 0km way). The location’s sales represent the share that comes from those surrounding mesh blocks.
    7. Sorting the list by decreasing sales to find promising locations. I required locations to be at least 1km apart to ensure some variety.

Code developed and data inputs is available here for those interested. It uses a combination of LLM-generated and self-written code.

And here’s the result – a map with the 15 locations deemed valuable by the model. If you were planning to open a pharmacy in the next year, you’re welcome (and caveat emptor!).

The results are moderately interesting (to me, at least).

Under our model setup, the population size of the Inner West (relative to the number of pharmacies) lead to a strong clustering of potential sites. It also implies quite large average revenues for existing pharmacies in the area.

The result is relatively sensitive to the choice of the 70% decay factor, so a good understanding of how far people travel for pharmacies is important.

So, how long did this take to put together? In the end, it was just four hours of work – one or two orders of magnitude less than that motivating project.

The comparison is, of course, not particularly fair. Going from a proof of concept to a proper recommendation would need to incorporate many things, such as:

  • relative spending on pharmacies at different locations (e.g., wealth);
  • time/journey distance, rather than just absolute distance;
  • different mobility patterns (e.g., people in outer city fringes may be more likely to drive further);
  • differences between competitors (e.g., large chain chemists may absorb greater share of wallet than smaller chemists); and
  • tidying up gaps in the analysis (e.g., validation of the pharmacy locations found and whether viable retail sites were omitted).

Doing these properly would add significantly to the time required – my exercise deliberately focuses on the low-hanging fruit.

Yet, I do feel the example is instructive since it reveals much about how efficiency in data science work has been boosted over time:

  • LLMs, like ChatGTP produce workable code quickly. This is particularly true for the latest generation of models and for small-ish pieces of codes to achieve specific tasks.
  • APIs and other cloud services can provide data relatively quickly and cheaply. My map calls to get locations cost $3 in total, which is significantly cheaper than bespoke data purchases.
  • Good external datasets exist for many problems – I am grateful for ABS datasets on an almost daily basis in my work. Whether a particular dataset already exists can be the difference between a four-hour job and a four-week one.
  • Existing packages for specific tasks do a lot of heavy lifting. In this case, existing geospatial and mapping packages make the later stages of the proof of concept a breeze. No one wants to reinvent lat-long distance calculations.

It does feel like something of a magical golden age, when analysis like this can be turned around quickly and cheaply. But perhaps this is premature, assuming things will only continue to get better and easier.

And as the analysis portions become easier, it creates more room for value-add on strategic thinking, value to business, model governance, and other higher-level topics – so plenty of work for actuaries. For now.

CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.