Data Structures II


  1. Introduction

    In Part I of this series of articles, I wrote about the importance of data structures in games (and all programming), and discussed Big O notation and generics. With that out of the way, in this part I can get to the meat of the article: the actual data structures.

  2. The Array

    So, with that, let’s look at the simplest data structure: the array. Arrays are built into almost every single programming language, and C# is no exception. You can think of an array as a group of multiple instances of some data type. Arrays are fixed size, and can hold only one kind of data. For example, an array of five integers can hold only integers, and it can hold only five of them. In C#, you create arrays by using the [] syntax:

    int[] myArray = new int[5];

    The myArray array is declared and is set to a five-integer array. You can then access and modify the elements of this array by using an index. The index is zero based, so to access the first element of myArray, use the index zero. For example,

    int theFirst = myArray[0];
    int theLast = myArray[4];
    myArray[0] myArray[1] myArray[2] myArray[3] myArray[4]

    Arrays are attractive mainly because of their simplicity. Whereas most other data structures incur additional processing overhead, arrays do not. They are also very memory efficient data structures. The five-integer array takes up the same amount of space in memory as five separate integers would, with a few extra bytes for CLR bookkeeping. Since the elements are accessed by index, accesses and modifications are fast, taking constant time, O(1). This is true whether the accesses are in order or random order: in other words, it is just as fast to access veryBigArray[0] as it is veryBigArray[1000]. This will not be the case for some other data structures.

    The major drawback to arrays is that they are fixed size. This causes problems if you want to add new objects to your array. For example, if you have an array of ten integers, and you want to add an eleventh, you cannot simply resize the array. You must instead create a new array of the proper size, and then copy all of the elements of the old array to the new one. This copy takes linear time, O(n), so it can become very expensive, depending on the size of the old array. Other data structures can resize much more efficiently. Inserting items into the middle of an array is similarly slow: again, you must create a new array, and then copy the elements in the old array to the new one. Removing elements from the array presents the same problem—you must create a new array, and then copy the remaining elements, one by one.

    Given these advantages and drawbacks, you can see that arrays are best used when storing data of a fixed size. A good example might be a tic-tac-toe grid, or a chess board: it is unlikely that a chess board would need to double in size halfway through the game.

  3. Lists

    The .NET Framework provides System.Collections.Generic.List, a data structure that has characteristics that are similar to that of the array and adds functionality for resizing.

    System.Collections.Generic.List is a generic type, so to create one we use that crazy syntax described earlier in the article. For example, the following code can be used to create an empty list of Color types:

    List<Color> colors = new List<Color>();

    To add more colors to the list, simply use the Add function.

    colors.Add(Color.AliceBlue);
    colors.Add(Color.AntiqueWhite);
    colors.Add(Color.BlanchedAlmond);

    Notice that we didn’t need to know how many colors were going to be in the list up front. This is the biggest difference between List and Array, and is a useful distinction. Elements in the list are still accessed using the array brackets, just like arrays:

    Color first = colors[0];
    Color last = colors[2];

    Internally, List is implemented by using a private array. The array is more than large enough to hold all of the contents of the list. If more and more elements are added to the list, and the internal array can no longer hold all of them, a new, bigger array is created, and the elements are copied to that one. Using this knowledge of the internal implementation of List, we can tell some things about its performance characteristics.

    From the previous section, we know that arrays can quickly access and modify their elements, taking constant O(1) time. List does nothing to modify this behavior, so in general, List will be equally fast. (There will be a small cost associated with the additional bounds checking that List requires, but this should be negligible.) The big performance benefit that List has over Array is the dynamic sizing. Remember that since arrays cannot be resized, every time you need to add a new element, you must copy the entire contents of the collection to a new array. This copy will take linear time. However, since List's internal array typically has some free space, when calling Add you can simply put the new element on the end, in the first free space. A copy is necessary only when the internal array is no longer large enough to hold all of the elements. This makes the general case for Add much faster, taking constant time. Removing from the end of the List is also a constant time operation, for similar reasons: the List can simply "forget" about the removed element, and leave the array alone.

    One thing you can do to improve the performance of Add is to pre-size the internal array. A common pattern with a List is to initialize it with some number of default objects. For example, the first thing you might do after creating a new list of enemies is to immediately enter a for loop and fill that list with 100 bad guys. There's obviously room for optimization here: If you know at creation time how big your list will need to be, why not use that information? If you know how many enemies you have, you can automatically size the list's internal array to the appropriate size, avoiding all the unnecessary array creations and copies. To that end, List has a constructor that takes an integer argument, n, which will be the initial size of the internal array.

    So, List does a lot to improve the speed of adding and removing items at the end of the list. It is important to note that this benefit is true only at the end of the list. Inserting and removing at the middle is not much improved. Consider the best case scenario when inserting into the middle of a list. There is enough space to hold the new element, which you want to be at the index i. To insert the new item, you must make room for it by taking all of the elements at indices greater than and equal to i and shifting them to the right. If i is near the end of the list, there are not many items that must be copied, but as i approaches 0, performance deteriorates. In general, inserting into a List, like Array, takes linear time, O(n). Again, removing from a List has similar performance characteristics to inserting, because in both cases you must shift the items, which takes linear time. So in general, lists might be slightly better suited to insert operations than arrays are, but they are still not the ideal data structure for situations that will require a lot of inserting or removing.

  4. Collections

    The Collection class is another data structure provided by the .NET Framework. Internally, it's implemented as a List, so its performance characteristics will be similar. From a performance and functionality point of view, it does not differ significantly from a List.

    Why does Collection exist, then? If you take a close look at the functions exposed by List, you'll see that there are no virtual functions. There is a performance penalty associated with virtual functions, and since List is designed to be as fast as possible, the designers decided that none of List's functions should be virtual. This has the effect of making it impossible to derive from List in a useful way. The Collection class exists to fill this gap: it is designed to be easy to derive from. This makes Collection a good data structure to expose in APIs or libraries.

    Say, for example, you've written a 2D sprite engine that you want to share with your friends. One of your functions, called GetCollisions, returns a List of sprites that are colliding with one another. If, for some reason, you want to change the way that List works, you're stuck. Since a List is not usefully derivable, you would have to change GetCollisions to return something other than a List. When you distribute the next version of your library, your friends' code wouldn't compile anymore. Since the Collection class is easy to derive from, this would have been a better choice. If GetCollisions had returned a Collection instead, you could have simply written a new class, MyCollection, that inherits from Collection. You could then change GetCollisions internally to use MyCollection, and your friends could continue to deal with it as if it were a Collection, oblivious to the tweaks you had made under the hood.

    In some cases, this flexibility could be very useful. In your internal code, however, List's speed will probably make it a better choice.

  5. Conclusion

    The key things to take away from this part of the article is that arrays are lightweight and simple, and that lists are more flexible, but still fast. Collections aren’t optimized for speed, but are still useful as base classes to your own complex data structures. In Part III of this article, we continue the discussion of different data structures, and cover linked lists, stacks, and queues.