Getting better data for unit testing your EF Core applications

Last Updated: June 5, 2019 | Created: June 5, 2019

I was building an application using Entity Framework Core (EF Core) for a client and I needed to set up test data to mimic their production system. Their production system was quite complex with some data that was hard to generate, so writing code to set up the database was out of the question. I tried using JSON files to save data and restore data to a database and after a few tries I saw a pattern and decided to build a library to handle this problem.

My solution was to serialize the specific data from an existing database and store it as a JSON file. Then in each unit test that needed that data I deserialized the JSON file into classes and use EF Core to save it to the unit test’s database. And because I was copying from a production database which could have personal data in it I added a feature that can anonymize personal data so that no privacy laws, such as GDPR, were breached.

Rather than writing something specific for my client I decided to generalise this feature and add it to my open-source library EfCore.TestSupport. That way I, and you, can use this library to help you with create better test data for unit/performance tests, and my client gets a more comprehensive library at no extra cost.

TL; DR; – summary

  • The feature “Seed from Production” allows you to capture a “snapshot” of an existing (production) database into a JSON file which you can use to recreate the same data in a new database for testing your application.
  • This feature relies on an existing database containing the data you want to use. This means it only works for updates or rewrites of an existing application.
  • The “Seed from Production” is useful in the following cases.
    • For tests that need a lot of complex data that is hard to hand-code.
    • It provides much more representative data for system tests, performance tests and client demos.
  • The “Seed from Production” feature also includes an anonymization stage so that no personal data is stored/used in your application/test code.
  • The “Seed from Production” feature relies on EF Core’s excellent handing of saving classes with relationships to the database.
  • The “Seed from Production” feature is part of my open-source library TestSupport (Version 2.0.0 and above).

Setting the scene – why do I need better unit test data?

I spend a lot of time working on database access code (I wrote the book “Entity Framework Core in Action”) and I often have to work with large sets of data (one client’s example database was 1Tbyte in size!). If you use a database in a test, then the database must be in a known state before you start you test. In my experience what data the unit test’s needs break down into three cases:

  1. A small number of tests you can start with an empty database.
  2. Most tests only need few tables/rows of data in the database.
  3. A small set of tests need more complex set of data in the database.

Item 1, empty database, is easy to arrange: either delete/create a new database (slow) or build a wipe method to “clean” an existing database (faster). For item 2, few tables/rows, I usually write code to set up the database with the data I need – these are often extension methods with names like SeedDatabaseFourBooks or CreateDummyOrder. But when it comes to item 3, which needs complex data, it can be hard work to write and a pain to keep up to date when the database schema or data changes.

I have tried several approaches in the past (real database and run test in a transaction, database snapshots, seed from Excel file, or just write tests to handle unknown data setup). But EF Core’s excellent approach to saving classes with relationships to a database allows me to produce a better system.

Personally, I have found having data that looks real makes a difference when I am testing at every level. Yes, I can test with the book entitled “Book1” with author “Author1”, but having a book named “Domain-Driven Design” with author “Eric Evans” is easier to spot errors in. Therefore, I work to produce “real looking” data when I am testing or demonstrating an application to a client.

One obvious limitation of the “Seed from Production” approach is you do need an existing database that contains data you can copy from! Therefore, this works well when you are updating or extending an existing application. However, I have also found this useful when building a new application as the development will (should!) soon produce pre-production data that you can use.

NOTE: Some people would say you shouldn’t be accessing a database as part of a unit test, as that is an integration test. I understand their view and in some business logic I do replace the database access layer with an interface (see this section in my article about business logic). However, I am focused on building things quickly and I find using a real database makes it easier/quicker to write tests (especially if you can use an in-memory database) which means my unit test also checks that my database relationships/constraints work too.

How my “seed from production” system works

When I generalised the “seed from production” system I listed what I needed.

  • A way to read data from another database and store in a file. That way the “snapshot” data becomes part of your unit test code and is cover by source control.
  • The data may come from a production database that contains personal data. I need a way to anonymise that data before its saved to a file.
  • A way to take the stored “snapshot” and write it back out to a new database for use in tests.
  • I also needed the option to alter the “snapshot” data before it was written to the database for cases where a particular unit test needed a property(s) set to a specific value.
  • Finally, I need a system that made updating the “snapshot” data easy, as the database schema/data is likely to change often.

My “Seed from Production” feature handles all these requirements by splitting the process into two parts: an extract part, which is done once, and the seed part, which runs before each test to setup the database. The steps are:

  1. Extract part – only run if database changes.
    1. You write the code to read the data you need from the production database data.
    2. The DataResetter then:
      1. Resets the primary keys and foreign keys to default values so that EF Core will create new versions in the test database.
      2. Optionally you can anonymise specific properties that need privacy, e.g. you may want to anonymise all the names, emails, addresses etc.
    3. Converts the classes to a JSON string.
    4. Saves this JSON string to a file, typically in you unit test project.
  2. Seed database part – run at start of every unit test
    1. You need to provide an empty database for the unit test.
    2. The DataResetter reads the JSON file back into classes mapped to the database.
    3. Optionally you can “tweak” any specific data in the classes that your unit test need
    4. Then you add the data to the database and call SaveChanges.

That might sound quite complicated but most of that is done for you by the library methods. The diagram below shows the parts of the two stages to make it clearer – the parts shown as orange are the parts you need to write while the parts in blue are provided by the library.

Show me the code!

This will all make more sense when you see the code, so in the next subsections I show you the various usage of the “Seed from Production” library. They are:

  1. Extract without no anonymisation.
  2. Extract showing an example of anonymisation.
  3. Seed a unit test database, with optional update of the data.
  4. Seed database when using DDD-styled entity classes.

In all the examples I use my book app database which I used the book I wrote for Manning, “Entity Framework Core in Action”. The book app “sells” books and therefore the database contains books, with authors, reviews and possible price promotions – you can see this in the DataLayer/BookApp folder of my EfCore.TestSupport GitHub project.

NOTE: You can see a live version of the book app at http://efcoreinaction.com/

1. Extract without no anonymisation

I start with extracting data from a database stage, which only needs to be run when the database schema or data has changed. To make it simple to run I make it a unit test, but I define that test in such a way that it only runs in debug mode (that stops it being run when you run all the tests).

[RunnableInDebugOnly(DisplayName = "Needs database XYZ")]
public void ExampleExtract()
{
    var sqlSetup = new SqlServerProductionSetup<BookContext>
        ("ExampleDatabaseConnectionName");
    using (var context = new BookContext(sqlSetup.Options))
    {
        //1a. Read in the data to want to seed the database with
        var entities = context.Books
            .Include(x => x.Reviews)
            .Include(x => x.Promotion)
            .Include(x => x.AuthorsLink)
                .ThenInclude(x => x.Author)
            .ToList();

        //1b. Reset primary and foreign keys
        var resetter = new DataResetter(context);
        resetter.ResetKeysEntityAndRelationships(entities);

        //1c. Convert to JSON string
        var jsonString = entities.DefaultSerializeToJson();
        //1d. Save to JSON local file in TestData directory
        sqlSetup.DatabaseName.WriteJsonToJsonFile(jsonString);
    }
}

The things to note are:

  • Line 1: I use the RunnableInDebugOnly attribute (available in my EfCore.TestSupport library) to stop the unit test being run in a normal run of the unit tests. This method only needs to be run if the database scheme or data changes.
  • Line 4: the SqlServerProductionSetup class takes the name of a connection in the appsetting.json file and sets up the options for the given DbContext so that you can open it.
  • Line 9 to 14: Here I read in all the books with all their relationships that I want to save.
  • Lines 17 and 18: In this case the Resetter resets the primary keys and foreign keys to their default value. You need to do this to ensure EF Core works out the relationships via the navigational properties and creates new rows for all the data.
  • Line 21: This uses a default setting for Newtonsoft.Json’s SerializeObject method. This works in most cases, but you can write your own if you need different settings.
  • Line 23: Finally, it writes the data in to a file in the TestData folder of your unit tests. You can supply any unique string which is used as part of the file name – typically I use the name of the database it came from, which the SqlServerProductionSetup class provides.

2. Extract showing an example of using anonymisation

As I said before you might need to anonymise names, emails, addresses etc. that were in your production database. The DataResetter has a simple, but powerful system that allows you to define a series of properties/classes that need anonymising.

You define a class and a property in that class to anonymise and the DataResetter will traverse the whole sequence of relationships and will reset every instance of the class+property. As you will see you can define lots of classes/properties to be anonymised.

The default anonymisation method uses GUIDs as strings, so the name “Eric Evans” would be replaced with something like “2c7211292f2341068305309ff6783764”. That’s fine but it’s not that friendly if you want to do a demo or testing in general. That is why I provide a way to replace the default anonymisation method, which I show in the example (but you don’t have to if GUID strings are OK for you).

The code below is an example of what you can do by using an external library to provide random names and places. In this implementation I use the DotNetRandomNameGenerator NuGet package and create a few different formats you can call for, such as FirstName, LastName, FullName etc.

public class MyAnonymiser
{
    readonly PersonNameGenerator _pGenerator;

    public MyAnonymiser(int? seed = null)
    {
        var random = seed == null ? new Random() : new Random((int)seed);
        _pGenerator = new PersonNameGenerator(random);
    }

    public string AnonymiseThis(AnonymiserData data, object objectInstance)
    {
        switch (data.ReplacementType)
        {
            case "FullName": return _pGenerator.GenerateRandomFirstAndLastName();
            case "FirstName": return _pGenerator.GenerateRandomFirstName();
            case "LastName": return _pGenerator.GenerateRandomLastName();
            case "Email": //… etc. Add more versions as needed
            default: return _pGenerator.GenerateRandomFirstAndLastName();
        }
    }
}

The things to note are:

  • Lines 5 to 9: I designed my anonymiser to take an optional number to control the output. If no number is given, then the sequence of names has a random start (i.e. it produces different names each time it is run). If a number is given, then you get the same random sequence of names every time. Useful if you want to check properties in your unit test.

NOTE: See the Seed from Production Anonymization documentation link on the AnonymiserFunc and its features. There are several pieces of information I have not described here.

The code below shows an extract method, which is very similar to the first version, but with some extra code to a) link in the MyAnonymiser, and b) defines the class+property that needs to be anonymised.

[RunnableInDebugOnly(DisplayName = "Needs database XYZ")]
public void ExampleExtractWithAnonymiseLinkToLibrary()
{
    var sqlSetup = new SqlServerProductionSetup<BookContext>
        ("ExampleDatabaseConnectionName");
    using (var context = new BookContext(sqlSetup.Options))
    {
        //1a. Read in the data to want to seed the database with
        var entities = context.Books
            .Include(x => x.Reviews)
            .Include(x => x.Promotion)
            .Include(x => x.AuthorsLink)
                .ThenInclude(x => x.Author)
            .ToList();

        //1b-ii. Set up resetter config to use own method
        var myAnonymiser = new MyAnonymiser(42);
        var config = new DataResetterConfig
        {
            AnonymiserFunc = myAnonymiser.AnonymiseThis
        };
        //1b-ii. Add all class/properties that you want to anonymise
        config.AddToAnonymiseList<Author>(x => x.Name, "FullName");
        config.AddToAnonymiseList<Review>(x => x.VoterName, "FirstName");
        //1b. Reset primary and foreign keys and anonymise given class+property
        var resetter = new DataResetter(context, config);
        resetter.ResetKeysEntityAndRelationships(entities);

        //1c. Convert to JSON string
        var jsonString = entities.DefaultSerializeToJson();
        //1d. Save to JSON local file in TestData directory
        "ExampleDatabaseAnonymised".WriteJsonToJsonFile(jsonString);
    }
} 

The things to note are:

  • Line 17: I create MyAnonymiser, and in this case I provide a see number. This means the same sequence of random names will be created whenever the extract is run. This can be useful if you access the anonymised properties in your unit test.
  • Lines 18 to 21: I override the default AnonymiserFunc by creating a DataResetterConfig class and setting the AnonymiserFunc properly to my replacement AnonymiserFunc from my MyAnonymiser class.
  • Lines 23 and 24: I add two class+property items that should be anonymised via the AddToAnonymiseList<T> method in the DataResetterConfig instance. As you can see you can provide a string that defines what type of replacement you want. In this case the Author’s Name needs a full name, e.g. “Jon Smith” and the Review’s VoterName just needs a first name, e.g. Jane.
  • Line 26: The creation of the DataResetter now has a second parameter with the DataResetterConfig instance with the new AnonymiserFunc.

All the rest of the method is the same.

NOTE: The ResetKeysEntityAndRelationships method follows all the navigational links so every instance of the given class+property that is linked to the root class will be reset. It also uses Reflection, so it can anonymise properties which have private setters.

3. Seed a unit test database from the JSON file

Now I show you a typical unit test where I seed the database from the data stored in the JSON file. In this case using a Sqlite in-memory, which is very fast to setup and run (see my article “Using in-memory databases for unit testing EF Core applications” for when and how you can use this type of database for unit testing).

[Fact]
public void ExampleSeedDatabase()
{
    //SETUP
    var options = SqliteInMemory.CreateOptions<BookContext>();
    using (var context = new BookContext(options))
    {
        //2a. make sure you have an empty database
        context.Database.EnsureCreated();
        //2b. read the entities back from the JSON file
        var entities = "ExampleDatabase".ReadSeedDataFromJsonFile<List<Book>>();
        //2c. Optionally “tweak” any specific data in the classes that your unit test needs
        entities.First().Title = "new title";
        //2d. Add the data to the database and save
        context.AddRange(entities);
        context.SaveChanges();

        //ATTEMPT
        //... run your tests here

        //VERIFY 
        //... verify your tests worked here
    }
}

The things to note are:

  • Line 9. In this case I am using an in-memory database, so it will be empty. If you are using a real database, then normally clear the database before you start so that the unit tests as a “known” starting point. (Note that you don’t have to clear the database for the seed stage to work – it will just keep adding a new copy of the snapshot every time, but your unit test database will grow over time.)
  • Line 11: my ReadSeedDataFromJsonFile extension method reads the json file with the reference “ExampleDatabase” (which was the name of the database that was imported – see extract code) and uses Newtonsoft.Json’s DeserializeObject method to turn the JSON back into entity classes with relationships.
  • Line 13: Optionally you might need to tweak the data specifically for the test you are going to run. That’s easy as you have access to the classes at this stage.
  • Line 15. You use Add, or AddRange if it’s a collection, to Add the new classes to the database.
  • Line 16. The last step is to call SaveChanges to get all the entities and relationships created in the database. EF Core will follow all the navigational links, like Reviews, to work out what is linked to what and set up the primary keys/foreign keys as required.
  • Lines 19 onward. This is where your tests and asserts go in your unit test.

4. Seed database when using DDD-styled entity classes

If you have read any of my other articles you will know I am a great fan of DDD-styled entity classes (see my article “Creating Domain-Driven Design entity classes with Entity Framework Core” for more about this). So, of course, I wanted the Seed from Production feature to work with DDD-styled classes, which it does now, but you do need to be careful so here are some notes about things.

Problem occur if Newtonsoft.Json can’t find a way to set a property at serialization time. This fooled me for quite a while (see the issue I raised on Newtonsoft.Json GitHub). The solution I came up with was adding a setting to the serialization (and deserialization) that tells Newtonsoft.Json that it can set via private setters. This works for me (including private fields mapped to IEnumerable), but in case you have a more complex state there are other ways to set this up. The most useful is to create a private constructor with parameters that match the properties by type and name, and then place a [JsonConstructor] attribute on that constructor (there are other ways too – look at Newtonsoft.Json docs).

NOTE: The symptoms of Newtonsoft.Json failing to serialize because it can’t access a property aren’t that obvious. In one case Newtonsoft.Json threw an unexplained “Self referencing loop detected” exception. And when I changed the JsonSerializerSettings to ignore self-referencing loops it incorrectly serialized the data by adding a duplicate (!). You can see the gory details in the issue I raised on Newtonsoft.Json.

Aside – how does this seeding work?

NOTE: This is an optional section – I thought you might like to learn something about EF Core and how it handles classes with relationships (known as “navigational links” in EF Core).

You may be wondering how this seeding feature works – basically it relies on some magic that EF Core performs.

  1. Firstly, EF Core must work out what State a class is in, e.g. is it an Added (new) entry or an existing (tracked) entry, which tells it whether it needs to create a new entry or just refer to an existing entry in the database. This means EF Core sets the state of the class instance coming from the json as Added and will write them out tot the database.
  2. The second part is how EF Core works out the relationships between each class instance. Because the ResetKeysEntityAndRelationships method reset the foreign keys (and the primary keys) EF Core relies on the navigational properties to work out the relationships between each class instance.

These two things mean that the database is updated with the correct data and foreign keys, even if the relationships are complex. This is feature makes EF Core so nice to work with not just in this feature but for any adding or updating of linked classes.

Here is a simple example taking from one of my talks with the code and two diagrams showing you the before and after. In this case I create a new book, with a new many-to-many BookAuthor to an existing Author.

var existingAuthor = context.Authors.First();
var newBook = new Book { Title = “New Book”};
newBook.AuthorLinks = new List<BookAuthor>
{ 
     new BookAuthor
     { 
          Book = newBook, 
          Author = existingAuthor, 
          Order = 1
     };
}

After that the classes look like this: note red classed are new, while blue have been read from the database (i.e. are tracked by EF Core).

Then we save this to the database with the following code.

context.Add(newBook);
context.SaveChanges();

After that the classes would look like this, with all the primary and foreign keys filled in and all navigational links set up.

EF Core has done the following to make this work:

  1. During the Add stage is sets up all the navigation links and copies the primary keys of existing instances into the correct foreign keys (in this case the Author’s existing primary key into the AuthorId foreign key in the BookAuthor class).
  2. In the SaveChanges part it does the following within a transaction
    1. It inserts a new row for this Book, which sets its primary key
    2. It then copies the Book’s primary key into the BookId foreign key in the BookAuthor class.
    3. It then inserts a new row for the BookAuthor class.

This makes handling linked classes so simple in EF Core.

Conclusion

As I said earlier I had tried over the years different ways to set up complex databases, both with ADO.NET and EF6.x. EF Core has a number of features (good access to the database schema and better handling of adding linked classes to a database) which now make this much easier to implement a generic solution to this problem.

For my client the “Seed from Production” feature works really well. Their database contains data that is hard to create manually, and a pain to keep up to date as the application grows. By copying a database set up by the existing application we captured the data to use in our unit tests and some performance tests too. Also, the test data becomes part of the test code and therefore covered by the standard source control system. Another bonus is it makes it simple to run any tests in a DevOps pipeline as the test databases as it can be created and seeded automatically, which saves use from having to have specific database available in the DevOps path.

You won’t need this feature much, as most unit tests should use very basic data, but for those complex systems where setting up the database is complicated then this library can save you (and me) lots of time. And with anonymisation stage happening before the json file is created you don’t have to worry about having personal data in your unit tests.

Happy coding!