How inappropriate data access patterns massively
slow down programs and how the same problems arise with RESTful APIs.
Most SQL databases make roughly the same implementation choices. MySQL,
Oracle, DB2 and PostgreSQL all use heaps and b-trees to represent data tables.
While database implementations are similar, application access patterns can
A home truth of software development is that one pattern, Active Record, is
Active Record described
“Object relational mappers” (ORMs) exist to bridge the gap between the
programmers’ friend (the object), and the database’s primitive (the relation).
The reasons for these differing models are as much cultural as functional:
programmers like objects because they encapsulate the state of a single thing
in a running program. Databases like relations because they better suit
whole-dataset constraints and efficient access patterns for the entire
The simplest and most intuitive manner of translation is for programmers to
just map a single class of objects onto a single table. Here’s an indicative
# Create our book objectbook=Book(isbn=”978-0099466031″,title=”The Name of the Rose”,author=”Umberto Eco”)# Record it to our databasebook.save()# …later, get a book by ISBNbook1=Book.get(“978-0099466031”)# And print the titleprint(book1.title)
The Active Record pattern of data access is marked by:
- A whole-object basis
- Access by key (mostly primary key)
This pattern of access is simple and intuitive but causes huge problems in
even small projects. Retrieving whole rows is hugely wasteful when only part of
the row is required to resolve a user request. The issue becomes pronounced
when: retrieving sub-parts of the data (projection), consulting multiple tables
(joins) or digesting the dataset (aggregation).
To get all ISBNs of all deceased Italian authors in the Active Record
isbns=# for each authorforauthorinAuthor.get(nationality=”Italian”,state=”deceased”):# Get their booksforbookinBook.get(author=author.name):# And append the ISBN to our listisbns.append(book.isbn)
- The entire author object for each dead Italian had to be retrieved
- The entire book object for each book had to be retrieved
- Nested for loops prelude much parallelism
The above code snippet read intermediate data about the author into
memory as well as completely irrelevant non-ISBN data about
the book. For n authors with m books each:
- n*m queries were performed
- with all the overhead that implies1
- n*m serialisations by the database
- probably in a fairly quick runtime like C
- n*m deserialisations done by the program
- probably in quite a slow runtime like Ruby
For a dataset with just a thousand or so authors the above nested for loop
is significantly slower than the equivalent SQL:
The above SQL:
- resolves the same question in a single query
- serialises/deserialises only the ISBNs
- given sufficient indexes on author_name,
nationality and state it will be able to avoid a
direct read of the author table
- most people would have these on a schema by default
- given a composite index (author_name and
isbn) on book it would also be able to avoid a direct read of
the author table
- not the kind of thing you put on a schema by default but easily
added if it helps a common query
- not the kind of thing you put on a schema by default but easily
Given that each of these four improvements would probably increases
performance by a factor of ten over the above Active Record style it wouldn’t
surprise me if the query was ten thousand times faster than the Active
Record style access2.
The available mitigations don’t work well in practice
Most practical Active Record style ORMs will have some method of more
efficient access – but these methods will be idiosyncratic and they will breach
the central abstractions of whole-object basis and access by
key. Likely they will require increased understanding of the specific ORM
being used and they will probably be a lot less appealing to most developers
than the primary affordance of the library: the
Active Record pattern.
One common response to this is to contend that Active Record is suitable as
a first implementation with the plan of retreating to a more efficient
mechanism later. Unfortunately Active Record access patterns break down almost
immediately – typically as soon as there is a requirement to: “count all
entries where…”, “get all distinct…” or “return true if there are any
entries such that…” will entail such a slow implementation that it will cause
The second problem is that once the Active Record pattern is adopted
refactoring is not usually easy. Problem queries are only rarely presented as
apparently as nested for loops right next to each other. Often single
conceptual queries span abstraction boundaries, are in different parts of a
tall call-stack or are just separated by program context3. Once you design your abstractions in (wilful)
ignorance of what is an effective access pattern it’s very hard to row
The viable alternative: first class queries
The only workable alternative is to make queries4
first class objects.
This doesn’t always have to be done via SQL literals. It can be done by
composing objects: many non-Active Record ORMs offer these “querybuilders”.
Here’s an example of creating an EXISTS query (normally very
quick but hard to express in the Active Record pattern) with a
defwrote_in_english(session:DBSession,author_name:str)->bool:”””Returns whether the author wrote in English”””exists:bool=(# our subquery for a matching row – never read into program memorysession.query(Author).join(Book).filter(Book.language==”english”).filter(Author.author_name==author_name)).exists()# just get a boolreturnexists# and return it
The difference here is that you aren’t working on either a whole-object
basis or on a key-based access basis. Any element of data can be retrieved,
without having also to retrieve intermediate or irrelevant data, and queries
can be based on any criteria.
Also necessary: first class transactions
For read-only operations, first class queries are enough to allow for
effective data access patterns. For write operations another first class object
is required: the transaction or “database session”. Here is some sample problem
book=Book.get(isbn=”978-0099466031″)book.copies_in_stock-=1book.save()# WRONG! irreparable data loss here
The problem in the above code is that any other program which incremented
the sales number for that book between the get and the
save had its addition implicitly deleted when our
save was applied.
These “lost update” bugs are extremely common as it is not intuitive that if
you plan to update a row based on what you’ve read from it you need to lock it
prior to the read. In the above case the read is partly disguised by
the decrement operator, -=.
The majority of Active Record style data access patterns in the wild are
non-transactional, even when working on single columns in individual rows. The
“lost update” problem is actually much wider than just SQL ORMs and really
applies to any store which doesn’t provide for locking data that has been read
prior to an update. Programs naively reading and writing to key-value stores
without a transaction/lock facility are often riddled with lost update
What’s required is something like this:
withsession:# our transaction objectbook=book_repository.get(session,isbn=”978-0099466031″)# row lock startsbook.copies_in_stock-=1book_repository.save(session,book)# OK – row lock still heldsession.commit()# explicit commits preferred for clarity
Note that the session is controlled explicitly by the code that is
organising the update. This is the essential complexity of doing a data update
and cannot be “abstracted” away (that is, “removed” to make things
The analogous problem with RESTful APIs
are also, unfortunately, based on the same principles as the Active Record
pattern. Instead of a whole-object basis, REST operates on a whole-“resource”
basis. Access is again done by key: URL.
RESTful API design consequently suffers the same problems as Active Record
ORMs. The problem is most severe when fundamentalist RESTafarians get involved,
as they advocate using hypertext-style links between API endpoints – HATEOAS (“Hypermedia As The
Engine Of Application State”). Under HATEOAS, applications “browse” the
API, moving from link to link to accomplish their goal.
HATEOAS is of course totally inimical to effective access to data. The
largest class of API clients is separated from the server by high latency
network links: mobile phones. Latency over a good 3G link is commonly 300ms. The latency from
continental Europe to California is an additional 140ms. This means that
accessing multiple API endpoints to resolve a single user request can take
seconds – an appreciable annoyance to any user.
Bandwidth to transfer irrelevant or intermediate data is at an even higher
premium over the internet as instead of the gigabit networking common in
datacentres the effective bandwidth from phones or home computers is often less
than a megabyte per second.
When using REST inside the datacentre you don’t have high latency links or
bandwidth limitations. However the requirements for data manipulation are
typically highest inside the datacentre and few datasets are so small that it’s
effective to repeatedly read large fractions of them into program memory. One
telltale sign of the problem: services that spend a large percentage of their
CPU time parsing and emitting JSON.
However, REST APIs do have one option for mitigation that ORMs don’t: they
can vary their endpoints to suit their clients’ needs. Making endpoints
coarser, to allow resolving a whole user request in a single round-trip, is a
key tactic. Another tactic is offering alternative “views”5 of
whole sets of data via different endpoints.
There has been a reaction against REST. One new alternative is GraphQL which is a query language and server that is
able to combine REST endpoints into a single result – making the query a first
class object. The second is a growing movement for RPC-style APIs – for example
Slack’s API is based on RPC instead of
Take data access patterns very seriously! Poor patterns of access can make
software unusable quickly. With bad abstractions it’s easy to turn something
that should be logarithmic time into polynomial time. This change in complexity
class will stop your program working as the dataset grows.
From ORMs: demand first class queries and transactions. Avoid Active Record
style access patterns whether in ORMs or elsewhere.
Use REST judiciously: only where the desired API surface really is just
simple state exchange with either no prospect of lost update or where they
don’t matter. Always consider whether RPC is a better option.
ACM Queue’s interview
with Arthur Whitney – financial data access patterns are unusual (8 hours
of fast events and then 16 hours of mandatory downtime) but Whitney describes a
small subculture that has taken data access patterns very seriously.
of Computer Science – a classic blogpost laying out many of the problems
with ORMs as they stood 15 years ago. Really unfortunate that the author chose
such a laden metaphor – the best discussion is in the last third and all the
Vietnam guff can be skipped without losing meaning.
SQLAlchemy, a Python library that
includes both ORM style objects (best used for inserts only) and also a rich
“core” query API. Transactions are first class and queries are too. There are a
lot of ways to control the exact method of data access. I think SQLAlchemy is
Fowler’s original Active Record definition from his book Patterns of
Enterprise Architecture. I like this conceptualisation.
Fowler’s Unit of Work definition from the same book. Often this is
presented as the opposite of Active Record but as Fowler describes this pattern
it is for keeping track of edits and ordering inserts.