A technique for building high-performance databases with EF Core

Last Updated: January 1, 2020 | Created: December 22, 2019

As the writer of the book “Entity Framework Core in Action” I get asked to build, or fix, applications using Entity Framework Core (EF Core) to be “fast”. Typically, “fast” means the database queries (reads) should return quickly, which in turn improves the scalability of the database.

Over the years I have worked on a lot of databases, and I have found several ways to improve database accesses. In this article I describe a new technique I found that uses an event-driven technique to update cached values in the actual SQL database. For me this approach is robust, fairly easy to add to an existing database, and can make reads quite fast. You might want to bookmark this article in case your boss comes up late in a project and says “the application isn’t fast enough!” – this might be just the approach you need.

The other articles in this series are:

TL;DR; – summary

  • This article describes a way to improve the performance of a database query when using EF Core. For the example application the performance gain is significate.
  • The technique adds new cache values to the existing database and updates them via an event-driven approach provided by my EfCore.GenericEventRunner library.
  • It is one of the simplest ways to add/update cache values, especially if you want to apply it to an existing SQL database.
  • The article is very long because I describe both the “why” and “how” of this approach works.
  • There is an example application in my GitHub repo, EfCoreSqlAndCosmos, that you can run yourself.

Setting the scene – picking the best way to performance tune

There are some things you can do to improve a database query using EF Core – mainly its about writing LINQ queries that translate into efficient SQL database queries. But with big or complex databases this can only take you so far. At that point you need to think about altering the database to make some parts of the database query easier to do.

The known way to speed things up is to use some form of cache, i.e. part of the query that takes a long time is pre-computed and stored so that the query can use this cached value instead. In my example I’m going to pre-compute the average review votes for a book (think Amazon’s star rating) – see this section for the performance improvements this provides. But the cached value(s) could be anything– for one of my clients it was pre-calculating the total pricing of a large and complex project.

Using cached values can make significant improvements to performance (see example later), but there are some big downsides. Caching data is notoriously difficult to get right. The typical problem is that the cache doesn’t get updated when the original data changes. This means the cache value returns old (“stale”) data when it shouldn’t. See this recent tweet from Nick Craven who works at StackOverflow – his comment is “Only cache if find you need to” and goes on to say “It absolutely has downsides in memory, confusion, complexity, races, etc. It’s not free.”

Therefore, I am always looking for caching designs that are simple and robust, i.e. you can be sure that the cache gets updated when the data changes. One approach I really like is using a two-database CQRS database design which is good, but not simple (I have already written about this: see this article on performance gain, and another article on a new design using Cosmos Db). I needed a simpler solution for a client that could be added to their existing databases, which is where this new event-driven approach comes from. The rest of the article cover adding cached values to an example application and what performance gains that gave me.

Example implementation – the book app

The example application I am using is my “book app”, which is I super-simple book selling application. I used this in my “Entity Framework Core in Action” book so I have lots of performance data for this app. In this application I cache two sets of data:

  1. A string holding the list of author’s names, e.g. “Erich Gamma, John Vlissides, Richard Helm, Ralph Johnson” (those names are from the famous “Design Patterns: Elements of Reusable Object-Oriented Software” book). This speeds up the display as it saves looking for each author and getting their name.
  2. The Reviews information, which is the number of reviews and the average star rating for all those reviews. This speeds up both the sorting/filter by the books average star rating and also speeds up the display of the book because

Here is a diagram to showing you the display of one book, with the parts we are going to cache to improve the performance of the book app.

I found there are two ways you can update a calculated cache values. They are:

  1. Complete Revaluation: You can recompute the cached value from the database, e.g. ReviewsCount = Book.Reviews.Count();
  2. Delta Update: You can update the cached value by adding the change (delta) to the existing cached value, e.g. ReviewsCount = ReviewsCount + 1;

The Complete Revaluation approach is the most obvious and works for everything, but as you will see has some issues when accessing the database. The Delta Update approach is quicker and is good for mathematical data, but if you miss something in your implementation then you can get the wrong answer.

NOTE: With the Delta Update approach I recommend building a service that will calculate the values using the Complete Revaluation approach and checks/updates any cached values if there is a problem. You can run when the system is lightly loaded to a) ensure the cache values up to date and b) spot if there are any problems in your Delta Update code.

I will describe both of these approaches, starting with the List of author’s names, which uses the Complete Revaluation approach

Authors string: set by Complete Revaluation approach

I used the Complete Revaluation approach to create the common-delimited string of author’s names.  Here is a diagram giving you an overview of how it works.

Creating an event when an individual Author’ Name is described in this section of my previous article, so I’m going to focus first on the event handler.

NOTE: You should read the article “EfCore.GenericEventRunner: an event-driven library that works with EF Core” to understand my code as I am using the EfCore.GenericEventRunner library in my example code.

Here is the event handler code that will recompute the string of authors for each book. It’s a bit more complicated than you would think, because the changed name hasn’t yet been written to the database yet.

public class AuthorNameUpdatedHandler :
    IBeforeSaveEventHandler<AuthorNameUpdatedEvent>
{
    private readonly SqlEventsDbContext _context;

    public AuthorNameUpdatedHandler(SqlEventsDbContext context)
    {
        _context = context;
    }

    public IStatusGeneric Handle(EntityEvents callingEntity, 
        AuthorNameUpdatedEvent domainEvent)
    {
        foreach (var bookWithEvents in _context.BookAuthors
            .Where(x => x.AuthorId == domainEvent.ChangedAuthor.AuthorId)
            .Select(x => x.Book))
        {
            var allAuthorsInOrder = _context.Set<BookWithEvents>()
                .Where(x => x.BookId == bookWithEvents.BookId)
                .Select(x => x.AuthorsLink.OrderBy(y => y.Order).Select(y => y.Author).ToList())
                .Single();

            var newAuthorsOrdered = string.Join(", ", allAuthorsInOrder.Select(x =>
                x.AuthorId == domainEvent.ChangedAuthor.AuthorId
                    ? domainEvent.ChangedAuthor.Name 
                    : x.Name));

            bookWithEvents.AuthorsOrdered = newAuthorsOrdered;
        }

        return null;
    }
}

The lines to point out in the code are:

  • Lines 14 to 16: The author may have worked on multiple books, so we need update each book’s AuthorsOrdered string. Note that the domainEvent contains an instance of the Author class where the Name has been changed.
  • Lines 18 to 21: Because the author’s new name that created this event isn’t in the database, we can’t just read the current Author’s name from the database. I have therefore read in all the Author classes, in the correct order first…
  • Then in lines 23 to 26 I go though them and when it comes to the Author that has been changed, we substitute the new Name string instead of the existing database Name string.

This last point shows that we need to be careful about accessing the database, because the events are run just before SaveChanges and therefore some data hasn’t been saved.

Review cached values: set by Delta Update approach.

For the ReviewCount and ReviewsAverageVotes I used the Delta Update approach, which works well with mathematical changes. Here is a diagram showing how the “add a new review” events works:

As you will see it is much quicker to calculate and doesn’t need to access the database, which also makes the code simpler. Here is the “add new review” event handler.

public class ReviewAddedHandler :
    IBeforeSaveEventHandler<BookReviewAddedEvent>
{
    public IStatusGeneric Handle(EntityEvents callingEntity, 
        BookReviewAddedEvent domainEvent)
    {
        var totalStars = Math.Round(
            domainEvent.Book.ReviewsAverageVotes * 
            domainEvent.Book.ReviewsCount)
            + domainEvent.NumStars;
        var numReviews = domainEvent.Book.ReviewsCount + 1;
        domainEvent.UpdateReviewCachedValues(numReviews, 
            totalStars / numReviews);

        return null;
    }
}

The lines to point out in the code are:

  • Lines 7 to 9: I get the total stars applied to this book by multiply the average star rating by the number of reviews (simple maths). I then add the delta change (line 10), which is the star rating from the new Review.
  • Line 11: I add 1 to the ReviewsCount because this is an “Add new Review” event.
  • Lines 12 to 13: The Book class provided an Action that can be called by the events to set the ReviewsCount and ReviewsAverageVotes. This is a nice way to control what the event handler can do within the Book class.

Building code to check/update the cache values

As I said earlier it is a good idea to back up Delta Update approach with code that will recalculate the cached values using the Complete Revaluation approach. If you are adding either approach to an existing database, then you will need a tool to set up the cache values for existing data anyway. And with a little a bit more work you can use the same tool to catch any updates that are going wrong and get logging/feedback so that you can try a track down the software problem.

I always build a tool like this if I add cache values to a database. You can see my HardResetCacheService class that uses the Complete Revaluation approach to check and reset as necessary any cache values. It’s not super-fast, but you can run it when you know the system is normally lightly loaded, or manually if you think there is something wrong. Hopefully you won’t use it a lot, but if you do need it you will be very happy it’s there!

Making the cache design robust

There are two parts to making the design robust, which means the cache values are always correct. As I said at the start, making sure the cache values don’t return old (“stale”) or incorrect data is a big challenge. In this design there are two things to cover:

  1. Making sure a change in the underlying data is reflected into the cache values
  2. Handling multiple updates happening in parallel.

Underlying data change always updates the cache values

The event-driven system I am using ensures that a change in any data that effects a cache value can be captured and sent to the appropriate event handler to update the cache property. Assuming you haven’t got a bug in your system then this will work.

The other part of the event-driven design is the original data and the cache values are store in the same transaction. This means if anything goes wrong then neither changes are saved (see more on the design of event-driven approach to see how this works).

Handling multiple updates happening in parallel

We now need to talk about multiple updates happening in parallel, which brings in all sort of interesting things. Say two people added a new review to the book at exactly the same time. If we don’t do anything to handle this correctly the cache update from one of those reviews could be lost. This is known as a concurrency issue.

This is the part that took the most time to think. I spent days thinking around all the different concurrency issues that could cause a problem and then even more days coming up with the best way to handle those concurrency issues

I considered doing the cache update inside a transaction, but the isolation level needed for totally accurate cache updating required ‘locking’ a lot of data. Even using direct SQL commands to calculate and update the cache isn’t safe (see this fascinating SO question/answer entitled “Is a single SQL Server statement atomic and consistent?”).

I found the best way to handle concurrency issues is to use EF Core concurrency tools to throw a DbUpdateConcurrencyException and then working out the correct cache value. This is most likely the most complex part of the design and I start with the try/catch of exceptions in my EfCore.GenericEventRunner library. Here is a diagram to show you what happens if two reviews are added at the same time.

Now let’s look at the code I need to handle these types of concurrency issue.

Adding handling exceptions from SaveChanges/SaveChangesAsync

First I needed to add a way to capture exceptions when it calls SaveChanges or SaveChangesAsync. I already have a pattern for doing this in my other libraries (EfCore.GenericServices and EfCore.GeneriBizRunner). This allows you to add exception handler to catch database exceptions.

Up until now this feature has been used for turning database errors into error messages that are a) user-friendly and b) don’t disclose anything about your system (see this section from my article “EF Core – validating data and catching SQL errors”). But now I needed a way to handle the DbUpdateConcurrencyException where my code fixes the problem that caused the concurrency exception and it then calls SaveChanges/SaveChangesAsync again.

To do that I have added the same exception handler capability into my EfCore.GenericEventRunner library, but enhanced it for handling concurrency issues. Previously it returned null (exception not handled, so rethrow) or a “bad status” (contains user-friendly error messages to show the user). I added a third, return a “good status” (i.e. no errors) which  means try the call to SaveChanges/SaveChangesAsync again.

This “good status” is what I use when I fix the problems with the cache values.  Here is the code in my EfCore.GenericEventRunner library that surrounds its calling of the base SaveChanges.

do
{
    try
    {
        status.SetResult(callBaseSaveChanges.Invoke());
        break; //This breaks out of the do/while
    }
    catch (Exception e)
    {
        var exceptionStatus = _config.SaveChangesExceptionHandler?
            .Invoke(e, context);
        if (exceptionStatus == null)
            throw; //no handler, or handler couldn’t handle this exception        

        status.CombineStatuses(exceptionStatus);
    }
} while (status.IsValid);

The lines to point out in the code are:

  • Line 1: The call of the SaveChanges is in a do/while loop. This is needed, because if the SaveChangesExceptionHandler fixes a concurrency problem, then it needs to call SaveChanges again to store the corrected data. But because it is possible that another concurrency issue happens on this second SaveChanges, then it will call the SaveChangesExceptionHandler again.
  • Line 6: If the call to the base SaveChanges is successful, then it exits the do/while as all is OK.
  • Lines 12 to 13: This is case 1, no handler or handler can’t manage this exception, so the exception is rethrown.
  • Line 17: The while will loop back and call SaveChanges again. If there is an exception the process is run again.

Using the exception handler to fix cache concurrency issues

Now there is a way to capture an exception in my EfCore.GenericEventRunner library coming from SaveChanges/ SaveChangesAsync we can use this to capture concurrency issues around the cache values.

The first thing I need to do is tell EF Core to throw a DbUpdateConcurrencyException if it detects a concurrency issue (see previous diagram). To do this I marked the three properties with the ConcurrencyCheck attribute, as shown below.

public class BookWithEvents : EntityEvents
{
    //… other code left out
    [ConcurrencyCheck]
    public string AuthorsOrdered { get; set; }

    [ConcurrencyCheck]
    public int ReviewsCount { get; set; }

    [ConcurrencyCheck]
    public double ReviewsAverageVotes { get; set; }
    //... other code left out
}

Then I created a method called HandleCacheValuesConcurrency, which I registered with the GenericEventRunner on startup (see this documentation on how to do that, or the code in the example application). I’m also not going to my SaveChangesExceptionHandler method due to space, but you can find it here. What I do want to show you are two parts that handle the fixing of the AuthorOrdered string and the two Review cache values.

1. Complete Revaluation example: fixing AuthorOrdered concurrency issue

Here is the method I call from inside my HandleCacheValuesConcurrency method to handle any AuthorsOrdered concurrency issue. Its job is to work out if there was a concurrency issue with the AuthorsOrdered string, and if there is to recalculate the AuthorOrdered. Here is the code

public void CheckFixAuthorOrdered(BookWithEvents bookThatCausedConcurrency, 
    BookWithEvents bookBeingWrittenOut)
{
    var previousAuthorsOrdered = (string)_entry.Property(
        nameof(BookWithEvents.AuthorsOrdered)).OriginalValue;

    if (previousAuthorsOrdered != bookThatCausedConcurrency.AuthorsOrdered)
    {
        var allAuthorsIdsInOrder = _context.Set<BookWithEvents>()
            .Where(x => x.BookId == bookBeingWrittenOut.BookId)
            .Select(x => x.AuthorsLink.OrderBy(y => y.Order)
            .Select(y => y.AuthorId)).ToList()
            .Single();

        var namesInOrder = allAuthorsIdsInOrder.Select(x => 
            _context.Find<AuthorWithEvents>(x).Name);
        var newAuthorsOrdered = namesInOrder.FormAuthorOrderedString();

        _entry.Property(nameof(BookWithEvents.AuthorsOrdered))
             .CurrentValue = newAuthorsOrdered;

        _entry.Property(nameof(BookWithEvents.AuthorsOrdered))
             .OriginalValue = bookThatCausedConcurrency.AuthorsOrdered;
    }
}

I’m not going to explain all the lines in that code (see the actual source code, which has comments), but I do want to point out I get all the Author’s Names using the EF Core Find command (see line 16, highlighted).

I use EF Core’s Find method because works in special way: Find a) first looks for the entity you are asking for in the tracked entities in the current DbContext instance, if that fails to find anything then b) it looks in the database. I need this Find feature because I know at least one Author’s name has been updated (which kicked off the update of the AuthorsOrdered string) in this DbContext instance, but hasn’t yet be written to the database – that will only happen when the SaveChanges method is successful.

If you are using the Complete Revaluation approach then you also will need to consider whether the database has everything you need – it most likely doesn’t and you will need to look in the tarcked entities in the current DbContext instance to find the data you need to fix the concurrency issue.

2. Delta Update example: fixing the Review cache values

Here is the method I call from inside my SaveChangesExceptionHandler to handle any Review cache value concurrency issue, i.e. the ReviewsCount and/or the ReviewsAverageVotes. Its job is to work out if there was a concurrency issue these two cache values, and if there is to recalculates them. Here is the code

public void CheckFixReviewCacheValues(
    BookWithEvents bookThatCausedConcurrency, 
    BookWithEvents bookBeingWrittenOut)
{
    var previousCount = (int)_entry.Property(nameof(BookWithEvents.ReviewsCount))
        .OriginalValue;
    var previousAverageVotes = (double)_entry.Property(nameof(BookWithEvents.ReviewsAverageVotes))
        .OriginalValue;

    if (previousCount != bookThatCausedConcurrency.ReviewsCount ||
        previousAverageVotes != bookThatCausedConcurrency.ReviewsAverageVotes)
    {
        var previousTotalStars = Math.Round(previousAverageVotes * previousCount);
        var countChange = bookBeingWrittenOut.ReviewsCount - previousCount;
        var starsChange = Math.Round(bookBeingWrittenOut.ReviewsAverageVotes *
             bookBeingWrittenOut.ReviewsCount) - previousTotalStars;

        var newCount = bookThatCausedConcurrency.ReviewsCount + countChange;
        var totalStars = Math.Round(bookThatCausedConcurrency.ReviewsAverageVotes *
             bookThatCausedConcurrency.ReviewsCount) + starsChange;

        _entry.Property(nameof(BookWithEvents.ReviewsCount))
             .CurrentValue = newCount;
        _entry.Property(nameof(BookWithEvents.ReviewsAverageVotes))
             .CurrentValue = totalStars / newCount;

        _entry.Property(nameof(BookWithEvents.ReviewsCount))
             .OriginalValue = bookThatCausedConcurrency.ReviewsCount;
        _entry.Property(nameof(BookWithEvents.ReviewsAverageVotes))
             .OriginalValue = bookThatCausedConcurrency.ReviewsAverageVotes;
    }
}

Like the last concurrency method, I’m not going to explain all the lines in that code (see the source code here, which has comments), but I will talk about the differences from the AuthorsOrdered example.

The code doesn’t need to access the database as it can reverse the cache values from a) the book that caused the concurrency exception and b) the current book that was trying to update the database. From these two sources the method can a) extract the two updates and b) combine the two updates into one, which is equivalent to what would have happened if the two updates didn’t ‘collide’.

This approach follows the Delta Update approach, which allows it to fix the problem without needing recalculate the two values again by loading all the Reviews. To my mind this is quicker, which makes it less prone to getting another concurrency issue during the time you are fixing these cache values.

Weighting up the improvements against effort and added complexity

I always look at a new approach and measure its success based on the gains, in this case the performance improvement, against the effort needed to achieve that performance improvement. I also look at the complexity that this new approach adds to an application, as more complexity adds to the long-term support of the application.

Performance improvements

In terms of improving the performance this is a great result. One of the key queries that I expect users of my book app to use is sort and/or filter by votes. I run this with the first 100 books being returned. I measure the performance using the Chrome browser’s developer (F12) Network page in milliseconds running on my local PC, taking the average of about 10 consecutive accesses. For comparison the viewing of the Home page, which only has text and some Razor functions, takes about 8 ms.

The chart below shows “Sort by Votes” is about 15 times quicker and “sort/filter” version is about 8 times faster. That is a very good result.

The other thing to note is the small improvement of test 1, sort by publication date. This is due to the cached AuthorsOrdered string which removes the many-to-many join of authors names for each book.

The author’s string was a big problem in EF Core 2, with a different of 3 to 1 (230ms for EF Core 2, 80ms for the cached value version). That’s because EF Core 3 is quicker than earlier versions of EF Core as it combines the main book query with the many-to-many join of authors names. This shows than the AuthorsOrdered cached value maybe isn’t worth keeping – the extra complexity doesn’t give a good gain in performance.

Development effort

It certainly took me a lot of time, about 40 to 50 hours, to build this example application, but that includes the setup of the new approach and all its associated parts. There was also a lot of thinking/research time to find the best way though. Next time it would be quicker.

In actual fact the first usage of this approach was for one of my clients, and I keep a detailed timesheet for all of my client work. That says it took 11 hours to add Total Price cached value using a Delta Update approach to an existing database. I think (and I think the client does too) 11 hours is good price for the feature/performance gain it provided.

Added Complexity

I built a cached values version in the chapter on performance tuning in my book “Entity Framework Core in Action” (see this article I wrote that summarises that chapter). But in that case, I added the setting of the cached values into the existing business logic which made things much more complicated. This new version is much less ‘intrusive’, i.e. the cache update code is separated from the existing business logic which makes it much easier to refactor the code.

With this event-driven approach you only have to add minimal code in your classes (i.e. call an AddEvent method whenever certain events happen). Then all the complex code to update the cached values is in specific event handlers and concurrency methods. This separation makes this approach much nicer than my original version because the cache code isn’t mixed in with all of your other business logic.

Conclusion

I am very pleased with this new event-driven approach to improving database queries performance. I have done a lot of work on database performance, both for my book and for multiple clients, and this approach is the easiest so far. This new approach is fairly straightforward to apply to an existing database, and it keeps the cache update code separate from the rest of the business logic.

It took me ages to research and write this article – maybe 20 hours on top of the 40/50 hours for writing the code, which is very long for me. But I learnt a lot while looking for the best way to handle simultaneous updates of the same cache values – things like SQL transaction isolation levels, whether a single SQL command is atomic (it isn’t) and what to do inside a concurrency exception in EF Core. I feel quite confident to use this approach in a client’s application, in fact I am already using this approach with my current client to good effect.

The approach I covered isn’t super-simple, but I hope I have described it well enough that you can understand it to use this yourself. Please do look at the example code in which I added to my open-source EfCoreSqlAndCosmos project, and my open-source EfCore.GenericEventRunner, which is a key part of my design, but can also be useful in other situations too (see this article for more on that).

I gave you two ways to compute the cached values: Complete Revaluation and Delta Update. Which of these you use will depend on the type of data/query that gets the cached value. I quite like the Delta Update as its so fast (which means there is minimal performance loss on write-side of your application), but some data just doesn’t fit that way of working, especially strings.

All the best with your developer journey.