Webservice API design tips – correct pagination and exposing deleted rows

After working with dozens of REST, SOAP and ‘ad-hoc’ web services / APIs I’ve noticed a similar set of design problems by companies big and small. One gotcha I almost always see left out of an API is an easy way to determine which records were deleted or moved on the backend. Another gotcha is implementing pagination and sorting in a helpful way. This includes ‘feed’ style listing APIs where the data under the API is changing constantly. I’ll explain the solutions below after a brief introduction.

Overview of a ‘books’ listing endpoint:

Let’s say we have a web app, a native iOS app and a 3rd party system that need to look-up books in a database. A RESTful API is perfect for this!

Let’s make the API a decent one by allow keyword filtering, pagination, and sorting.

# listing of book records, default sort, page 1 implied, default page size of 10
GET /books
{
record_count: 24178,
page: 1,
results: [
	{title: "Calculus 1st Edition", publisher: "Mathpubs", id: "15878"},
	{title: "Geometry 4th Edition", publisher: "Heath", id: "65787"}
	....
]
}
# listing of book records that contain 'python' as a search match
GET /books?q=python
{
record_count: 147,
page: 1,
results: [
	{title: "Python", publisher: "O'Reilly", id: "74415"},
	{title: "Fluent Python", publisher: "O'Reilly", id: "99865"}
	....
]
}
# listing of book records, sorted by title
GET /books?sort=title
{
record_count: 24178,
page: 1,
results: [
	{title: "Aardvark's Adventures", publisher: "Kids books", id: "124789"},
	{title: "Aardvark's Explained", publisher: "Zoolabs", id: "988741"}
	....
]
}
# get the 10 most recently updated books related to python
# note the minus (-) sign in front of updated_at, that is a Django convention but in your API do it however you want, perhaps better to specify it as "NewestFirst", just keep it consistent
GET /books?q=python&sort=-updated_at&page_size=10
# next get the 11 - 20 most recently updated books related to python
GET /books?q=python&sort=-updated_at&page_size=10&page=2

My notes on sorting a webservice listing endpoint:

  • By default, sort the results by something natural like title or date created if the sort parameter isn’t supplied.
  • Allow the client to specify a sort order. Validate the sort order they provided against a list options the server allows. Return a warning if it is invalid with a 400 error (bad request).
  • An essential sort order option is the time a record was last updated, newest first (typically updated_at desc). With that sort option a client can crawl through the pages until it hits a date already processed and stop there. So many APIs I’ve worked with overlook sorting by update_at desc. Without the updated_at desc sort option a client is forced to crawl the entire listing to find anything new or updated. This is very inefficient for large databases with a relatively small number of regular changes or additions.

My notes on paginating a webservice listing endpoint:

If your data set has more than say, 10 rows, adding pagination is a good idea. For very large data sets it is essential because too much data in a request can crash the server or the client.

  • Implementing pagination is a matter of the proper LIMIT / OFFSET queries on the backend, though that varies by ORM and data store.
  • One annoying thing that may dissuade you is, the server should return the total count of records that match in addition to returning the slice of rows that match the current page and page size. This is so the appropriate page links {1,2,3,4…} can be generated. Getting the overall count of matches can be a performance hit because it involves an extra query. If you want solid pagination, you just have to bite the bullet in terms of the count query.
  • The client should be able to tell the backend the page size it wants, but it should be validated (say between 1 and 100 most of the time).
  • Really good REST frameworks like Django-Rest-Framework offer ‘next-page’ and ‘previous-page’ URLs inside the JSON response – very handy for paging!

My notes on paginating a ‘feed’ style listing:

Some data sets are a lot more wild than books and change constantly. Let’s take the example of a twitter style feed, where bots, celebrities, teenagers, and software developers waiting on unit tests are tweeting their heads off in real time.

In this case, the database needs to organize records by a natural sort. Twitter has the concept of an ‘id’ that is sortable. Yours might be the updated_at flag or some naturally sorting hash that goes on each record (maybe the primary key). When the client loads the feed, the first call asks for a page of data with a given number of rows (say 50). The client notes the maximum ID and the minimum ID it got (typically on the first and last rows respectively). For the next API call, the minimum ID gets passed back to the server. The server then returns the next 50 rows after the minimum ID value the client saw. The server could also return the number of ‘new rows’ on a periodic basis with an ID higher than the maximum ID the client initially got. It has to be done this way because while the user was reading their tweets and scrolling down, it is possible many new tweets were created. That would cause everything to slide down and screw up traditional pagination.

Twitter has a more in depth tutorial here:
https://dev.twitter.com/rest/public/timelines

What about deleted or moved records??

Getting at deleted records in an API is a practical problem I’ve had to solve several times. Think of case where a background process scrapes an API and keeps tabs on what changes. For example, social media posts or content records in a CMS.

Let’s say an hour ago, the listing API was scanned and all data was retrieved and our copy is in perfect sync with the other side. Now imagine the book with ID 789 gets deleted on the server. How do we know that 789 got deleted?

Invariably, I have ask the people who made the API and they write back and say something like, “it can’t do that, you have to page through the entire set of data or call for that individual book by ID”. What they are saying is, on a regular basis do a full scan of the listing, compare that to what you have, and anything you have that the server doesn’t was deleted on the server.

This situation is particularly painful with very large data sets. It can make nightly syncs unfeasible because there is just too much data to verify (rate limits are quickly exceeded or the sheer amount of processing time is too high). Let’s say you are forced down that road anyway. You have to be very careful when triggering deletes on your side since a glitch in the API could cause accidentally deletes on your side. In this scenario when the API goes down or responds with an empty result set the scraping program might think “great I’ll delete everything on this side just like you asked since it looks like nothing exists anymore!”. To prevent that kind of disaster, in the past I’ve limited the maximum number of deletes per run and alerted when it found an excessive number of deletes.

Fundamentally a RESTful API isn’t a great way to mirror data that changes all the time. The reality is, often it is all you have to work with, especially given mobile apps and cross platform connectivity, security requirements, etc.

Here is what I do regarding server side deletion of records in a listing API:

First of all, as a general principle, I almost never design a database to allow immediate physical deletion of records. That is like driving without a seat belt. Instead, I add a deleted column with type tinyint/bool/bit default 0 to every single table. The front end and all APIs are programmed to filter out deleted rows. This way, if something is accidentally deleted, it can easily be restored. If a row has been deleted for more than a given period of time, say 12 months, a cleanup script will pick it up and physically trash it and associated child rows out of the database. Remember – disk space is cheap but data loss is costly.

Another way to do this is to keep a DeletedBooks table. Whenever a Book is deleted, make an entry in that table via a trigger or hook or whatever your framework fires off after a record is deleted. I don’t like that as much as the deleted bit column solution because with hooks / triggers things get complicated and data loss can happen unless they are truly ‘transactional’. However, a DeletedBooks table may be easier to put in place in a legacy system that constantly stymies your efforts to make a good API.

Now that our data layer has knowledge of deleted records, we can add a new endpoint for deletes that only returns books that were deleted. This API should be paginated, allow filtering, etc. Note that it includes a date_deleted field in the results, which may be useful to the client. In most cases date_deleted may be substituted for updated_at.

# listing of deleted book records!
GET /books_deleted
{
record_count: 50,
page: 1,
results: [
	{title: "Algebra II", id: "29898" date_deleted: "2016-08-20 T18:25:43.511Z" },
	{title: "Trig for Kids", id: "59788" date_deleted: "2016-08-17 T07:54:44.789Z" },
	....
]
}

You could also add a deleted parameter to the original listing API to filter for deleted records:

GET /books?deleted=1

A similar implementation can be created for records that disappear for whatever reason – moved to a different account, re-classified, merged, or tossed around like rag dolls. The basic idea is to expose data so clients can decipher what the heck happened instead of having to page through the entire listing API to piece it together.

All the other ‘best practices’ for REST APIs:

If you’ve read this far you are probably committed to building a good API. Thank you. It is a thankless job like many in ‘backend’ software, but let me again say Thank You. Unfortunately, people usually don’t notice when things go smooth, but a bad API is very easy to notice. Perhaps a few developers have suffered permanent IQ degradation from being forced to write code against poorly designed, undocumented, and jenky APIs. Together, we can ensure this is a thing of the past.

All the docs I’ve read say a good API should emit JSON and XML. Your framework should handle that for you, so I won’t say anything more about that.

Eg:

GET /books.json -> spits out JSON
GET /books.xml -> spits out XML

Successful requests should also return the http status code of 200.

Here are some other status codes you’ll want to use in your API.

  • 400 – bad request (inputs invalid, something screwed up on their end)
  • 401 – unauthorized (user is not authenticated or can’t access this particular thing)
  • 404 – not found (just like a page not found error on the web)
  • 405 – method not allowed (eg, client tired to POST to an endpoint that only allows GET requests)
  • 500 – internal server error (something screwed up on your end, I hope you logged the details?)

For a complete list of HTTP status codes see:
http://www.restapitutorial.com/httpstatuscodes.html

Other good tips I’ve seen include: Versioning your API, use verbs correctly (GET, POST, DELETE, PUT, …), use SSL, document it, etc.

For more best practices involved RESTful APIs see:
http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api
http://blog.mwaysolutions.com/2014/06/05/10-best-practices-for-better-restful-api/

This entry was posted in Application Development, Code and tagged , , , . Bookmark the permalink.

Comments are closed.