Deduplication Failures with SQLAlchemy collection_class

July 6, 2016
python sqlalchemy development

As discussed in Failsafing Multitenancy in SQLAlchemy, we use a custom collection_class as a failsafe to protect against cross tenant data leakage in our environment. This was working very well, until we discovered a bug which sometimes caused objects to be dropped incorrectly. Ie, if we had a Customer with 2 addresses, one for merchant_1, and one on merchant_2, a request against merchant_1 would always correctly remove the address for merchant_2, however would sometimes also remove the address for merchant_1. You can see how this is a terrible thing.

After we were able to determine that this was indeed happening, we attempted to write a test that was able to actually reproduce it. Unfortunately we were not able to get a test which reproduced this 100% of the time, however it did help us to narrow the scope of the problem. From this we learned:

  • Objects were only inappropriately dropped when other objects were correctly dropped. Ie, if there was no data to be rightfully excluded, everything worked as expected.
  • Objects were only inappropriately dropped when the query involved eager-loading 1-n relationships.
  • Even when we were able to get a test which sometimes reproduced the issue, it was not deterministic, and would fail/pass seemingly at random.

Before we dig into this too much, we have to understand how SQLAlchemy (and other ORMs) eager load child attributes. Let’s say that we have the following query; querying a Customer object and eager loading the Addresses:

customers = session.query(
  Customer
).options(
  sqlalchemy.orm.joinedload('addresses')
).all()

This will generate (basically) the following sql:

SELECT
  customer.id,
  customer.name,
  address.id,
  address.email,
  address.customer_id,
  address.merchant_id
FROM customer
  JOIN address ON customer.id = address.customer_id

which will return us a set of rows as:

1, "John", 20, "john@test.com", 1, 10
1, "John", 21, "john2@test.com", 1, 11
1, "John", 22, "john3@test.com", 1, 10

Which should be deduplicated to a model (as json):

{
  "id": 1,
  "name": "John",
  "addresses": [
    {
      "id": 20,
      "email": "john@test.com",
      "customer_id": 1,
      "merchant_id": 10,
    },
    {
      "id": 21,
      "email": "john2@test.com",
      "customer_id": 1,
      "merchant_id": 11,
    },
    {
      "id": 22,
      "email": "john3@test.com",
      "customer_id": 1,
      "merchant_id": 10,
    }
  ]
}

When it is loaded, sqlalchemy needs to build objects from the returned tuples, and to dedupe the models as they are added. Ie, in this case, builds the customer, attempts to add it to the list of customers, then builds the address, and attempts to add it to the list of addresses for the customer. When I say ‘attempts to add’ to the list, the InstrumentedList that SQLAlchemy uses by default does duplicate detection to avoid adding duplicate models from the rows; such as the repeated Customer model in the example above (is present in all 3 rows).

Tracing through the SQLAlchemy internals (Pycharm’s debugger was great for this) I learned that SQLAlchemy does this deduping by using a ‘unique’ identifier for each object, which it delegates to a binding which seems to just return the memory address. This works really well until a custom collection class may drop objects, thus dropping them, and depending on when garbage collection runs, may end up with two objects at the same address. This can lead the the second model, which is not a duplicate, having the same id as something that was a duplicate, and thus being dropped again.

The Solution

At this point we had two options:

  1. Modify SQLAlchemy to dedupe in a way compatible with our collection_class
  2. Prevent the ‘dropped’ objects from being garbage collected.

Option 1 would be overly difficult, whereas option two can be done with a simple reference. So we opted for option 2. We added a ‘garbage’ list onto the FilteredList, which simply holds references to things that we don’t want to prevent them from being removed. Some basic back-of-the-envelope math gave us confidence that the extra memory would not be an issue; we would hit CPU limitations building the models before memory became a problem.

Our new FilteredList now looks like this:

## filteredlist.py

from sqlalchemy.orm.collections import InstrumentedList

class FilteredList(InstrumentedList):
  """
  List-like collection allowing us to prevent cross account
  data leakage from centralized resources. Works by
  preventing non-matching objects from being added to
  backref containers.

  e.g:
  attr_name = u'merchant_id'
  attr_val = 1
  Will prevent anything with a `merchant_id` attribute
  who's value does not equal `1` from being added to a
  FilteredList instance

  Use by setting the `collection_class` attribute on
  SQLAlchemy backrefs.

  """

  # Attributes
  attr_name = None
  attr_val = None
  garbage = []

  @classmethod
  def enable_filter(cls, filter_name, filter_val):
    """
    'Activate' the filtering using the provided attribute
    name and value.
    """
    cls.attr_name = filter_name
    cls.attr_val = filter_val
    cls.garbage = []

  @classmethod
  def reset_filter(cls):
    """
    'Deactivate' the filtering.
    """
    cls.attr_name, cls.attr_val, cls.garbage = None, None, []

  def append(self, item):
    """
    Append iff:
      - the filters are not set, or
      - the object matches the filters
    """

    if FilteredList.attr_name and FilteredList.attr_val:

      if hasattr(item, FilteredList.attr_name):
        object_val = getattr(item, FilteredList.attr_name)
        if object_val is not None and
           object_val != FilteredList.attr_val:
          # Prevent actually appending, effectively dropping
          # the non-matching object

          self.garbage.append(item)
          return

    super(FilteredList, self).append(item)

While this is clearly not the cleanest approach, it was able to solve our problem and let us get back to other things.