Recent Activity

Smarter Data Caches

ASP.NET provides a way of storing data so when the application needs it again, it can be loaded from the current process' memory rather than making another round trip to the database server. This concept is founded on the idea of temporal locality, that is, if it's requested once, there is a high likelihood it will be needed again in the near future. A time limit is set on the cached item, and when that limit expires, the data is no longer valid.

Variable, file and database cache dependencies extend this capability to allow the user to link the cache to the data source. When the data is modified, and therefore no longer valid, the cache can be automatically cleared. Aside from the fact that it has an affinity for SQL Server 2005, it's a decent system. However, in my opinion, the key issue is not *when* to cache, but *what* to cache. The theory of data caching is actually based on two concepts, temporal locality AND spacial locality (when one thing is requested, data near it will also soon be requested). ASP.NET puts the burden of spacial locality solely on the shoulders of the developer.

A data access scenario

Here is a typical scenario: a user arrives at a web site, say it's an e-commerce system. The user searches for "widget", and gets 500 results for widgets. The user then searches for "widget x" and gets 100 results, finding the item they want in the first page.

Next, the user clicks on widget-x and then views the product's details, including thumbnails, specifications, and related products. Deciding to look at related products, they click on widget-y, widget-z and widget-v; all of witch show widget-x in their "related items". The user then returns to the details page for widget-x and adds the item to their shopping cart, and then proceeds to check-out. Checkout is three stages, finally showing a receipt of the user's order.

Without a data cache, this system could load the product record for widget-x a total of eleven times, depending on how the checkout pages are implemented. Also, product details and related products are loaded twice for the product details page for widget-x and possibly again for the other widgets.

Performance & cache

Even a cursory thought about performance should bring to mind the need for a data cache to avoid unnecessary trips to the database server. However, cache has a lot of knobs and it's not immediate how and when the data should be cached. The following is a list of variations on how a cache system could be used to minimize the impact of this common user behavior:

1) Cache only the highest level

Product details by the product id, load everything else as needed. This still does not make use of spacial locality.

2) Cache everything grouped

Product and related data are stored under the product id as the key. This makes use of spacial locality, but is an inefficient use of memory, since similar related data could be cached multiple times for different products. Also, it logically segments the cache by the page functionality, since the product details page will likely have a distinct data set from a search results page.

3) Cache everything individually

Products are stored under the product id, details are stored under their detail id's, related products are stored as individual records under their own id's. This approach is highly granular, fixing this issue of segmentation in the last method, but is much more of a burden to the developer to manage. For a team working on a single project, this type of system would require a class to manage access to the cache to ensure cache coherency.

4) Cache the page output

The page is rendered once, and the output is stored. Future requests simply return the pre-rendered HTML. This is like 2) above, causing each page to have their own cache, never sharing data. One could argue that this method is perhaps safer, since no individual objects are being cached, and therefor no mixing of old and new data can occur.

5) Cache the page output in blocks

Each part of the page must be composed of user controls, and then the user controls are rendered once. Future requests for the controls simply return the pre-renedered HTML. This is similar to 3) above, where the controls used across multiple pages can all benefit from a more granular cache.

And the list goes on. Furthermore, within each one of these methods, there are more detailed options of how to invalidate the cache, and exactly how to key the various bits of cached data. It's a lot to manage for the developer, and it gets much worse if there is no system-wide plan for a team of developers.

Relational locality

Unfortunately, I'm not proposing a solution, however, I do have some thoughts on a possible approach. I think the cache concept of spacial locality could be extended to relational locality. That is, when a record is accessed, its related records will likely be accessed also. Due to the highly relational nature of relational databases, this type of system should be relatively easy to implement in an automated fashion. The types of cache controls needed would be something like relation-levels to traverse when requesting data.

A radical approach would be to create a pass-through layer where all data requests get sent to a data cache. This cache decides what to pull from the database based on the data requested. In this type of model, the data cache would essentially replace the data access layer. When selecting related data to fetch, two approaches could be taken:

1) A design-time process

A process could be run which compiles all requests down to stored procedures for the current build. Essentially the cache system would analyze all requests and build a working set of queries based on the requests from the code.

2) A run-time process

The cache/data layer could actively monitor real-time requests and change the stored procedures based on usage statistics. So if a major advertising campaign caused a flood of traffic to one page, and spilled over to related pages, the system could actively cache all data for the page, and related pages on the first visit to the root page (the one that was advertised).

I realize that this type of system requires a major development investment, and may not be practical at this time. However, it seems a smarter and simpler cache system is needed if ASP.NET aims to enable the average developer to create massive web systems out of the box.

Posted in VB.NET, ASP.NET and Data/DB on August 17th