Land Records

Like all great ideas, the landrecords parcel dataset came from a place of irritation with the status quo.

I’m sorry, you want how much??

In 2019, I set out to build a natural hazard map that incorporated NFIP flood risk data, terrain characteristics, and historical weather patterns to produce a risk profile for a property. I was confident that developing the mapping application, statistical analyses, and machine learning were within my capabilities. So as the first step in this project, I set out to gather the required data to power the application.

Nationwide historical meteorological and storm event data — available as a public dataset in BigQuery.

Nationwide digital elevation models for terrain analysis — a lot of data, but readily available.

Nationwide NFIP flood zones and historical claims data — no problem.

Nationwide parcel boundaries and property attributes — I’m sorry, you want how much?

I was stunned, and my project was stuck. For their parcel datasets, each of the vendors I talked to wanted more than a year’s salary for most GIS analysts. (One company, known as CoreLogic back then, wouldn’t even talk to me). What I thought was the most straightforward of all the datasets turned out to be the most expensive and difficult to acquire.

2020: I started with the A’s

I didn’t have that kind of money, so I started collecting some data myself. My hope was to gather enough data to power my application for a local area. I had no ambition or desire to create a nationwide dataset.

My first parcel dataset inventory spreadsheet

Manually visiting county websites and looking for parcel data is not a fun task. I started a spreadsheet to record how I got the data and where it came from. A lot of counties don’t post their data publicly, so I sent quite a few emails.

And after I had collected enough data for a dozen or so counties, I had a new problem: none of this data was standardized in any way. Different file formats, table schemas, column types, naming conventions — everything was different from everything else. I tried to use jellyfish to automate some of this, but this failed, and eventually I settled on hard-coding some manual transformations to turn it all into a unified format.

This approach doesn’t scale, of course. Manually visiting websites, clicking buttons, downloading files, ETL-ing those files into a database, doing a bunch of harmonization, etc. wasn’t the project I signed up for. I realized I would have to settle for supporting just a few counties in my application.

2023: The AI revolution

I long suspected that if AI ever took off, computer programmers were going to be most at-risk. I reasoned that long before an AI could fully understand human language, it probably would first figure out computer languages. And my day job working at a big Cloud provider gave me some insight into what these new language models (LLMs) were capable of.

Despite the “hallucinations”, these models appeared quite good at programming. In particular, they excelled at understanding the relationships among unstructured and semi-structured data.

I remembered a few years ago I tried and failed to do something similar with parcel data to unblock my property hazard project. I wondered if these new LLMs could help.

I worked in the Cloud business, so naturally I thought to use one of the leading LLM APIs, like Gemini or ChatGPT.

But I quickly ran into another cost issue: I needed the LLM to understand the datasets and schemas of thousands of data sources. Not just once, but every day or week or month in order to maintain the most recent data and keep the mappings up-to-date. And many of these data sources get thrown away because of low data quality, or they are outdated, or they are just the wrong kind of dataset entirely.

I quickly ran up an AI inference bill of $800+ just from a one-time import of only a few hundred counties. Extrapolating out, I would need to budget many thousands of dollars each month to leverage this new technology. I tried a few different Cloud services, and always ran into the same issue: by the time I paid my ever-increasing cloud bill, I would have been better off selling my house and giving the money to Regrid!

Another option is to own some of the hardware. I already ran my own database servers for a similar reason: inter-component bandwidth is a scarce resource, and the high cost of Cloud VM IOPS makes it expensive to run ultra-high-performance databases in the Cloud. (and the new Xeon 6 CPUs were selling at a big discount). The scarcity of high-powered GPUs placed a similar premium on LLM inference.

I got a bit lucky on this one: my existing database server happened to have four unused double-width PCIe slots.

I got in touch with an NVIDIA distributor, and they were glad to sell me a stack of high-powered NVIDIA Hopper GPUs (which I’ve since upgraded to Blackwells) that could run some of the largest most sophisticated models available.

This decision — unlimited local processing and inference at a fixed cost — was the key to unlocking the rest of this story.

A stack of NVIDIA Blackwell GPUs sitting on top of my database server

Not everything can run in a local datacenter, of course. Since starting, much of the compute capacity I’ve added has been Cloud-based. Google Cloud hosts key components of the storage and API serving infrastructure, and Fastly is used as a CDN and Application Firewall.

But the core dataset acquisition, harmonization, and publishing still takes place on custom-built, state-of-the-art self-hosted hardware.

2024+: Launch Time

About a year ago, when I reached 95% coverage of the United States, I knew I had something special.

With 100% coverage in sight, I started thinking about how I would launch it as a product. I needed the usual suspects: a website, a delivery mechanism, some infrastructure to replicate and backup everything, and a whole lot more automation that the current system had.

The core system today based on Ray, Docker, Python, and PostgreSQL, is almost entirely automated. These days I spend much of my time working on increasing coverage of additional property attributes.

I also wondered how the established private equity-owned incumbents Regrid and CoreLogic would react. I think their prevailing cost structures will not permit them to operate the way I am, and I do not see a path for them to get to where I am today. I think the only way for them to do what I’ve done is to start over. In that sense, I have a substantial head-start.

So, how will these companies respond to a new competitor offering drastically lower prices?

My guess? Not well. Private equity is not known for their sportsmanship, let’s say. But I’ve made it this far, and so I’m pressing forward.

So, who are we?

Landrecords is the name of the product, Systems of Record LLC, a Virginia corporation, is the name of the company. It is a founder-led and independently-owned company with the mission to make property data accessible to everyone. My ambition is to do exactly one thing, and do it better and cheaper than anyone else. And to give back to important community projects that have helped along the way.

I don’t pretend to be a big company, and I am proud to offer the lowest-cost nationwide parcel dataset of the United States with nearly 155M parcel records.

Thank you to all of our customers and early testers whose feedback and analysis has made it possible to deliver the product we have today.

–founder and head parcel data wrangler

Our Story

I’m sorry, you want how much??

2020: I started with the A’s

2023: The AI revolution

2024+: Launch Time

So, who are we?