A Billion Dollar Go Mistake

ยท

11 min read

Featured on Hashnode

This post is about a real problem I faced in my project. I merged two blog post into one so this is a bit longer post, please read till the end. Thank you!

I hope after reading this post you will be able to avoid the same mistake we did in our project ๐Ÿ˜…. This small mistake didnโ€™t cost us a billion dollars. But, It did cost us a few thousand, and the Prometheus alerts saved us.

But, it could cost you a Billion if you are not careful ๐Ÿ’ธ. I learn my lessons through mistakes, this is something that I will never forget in future.

The scenario is a simple database transaction. We all have used database transactions at least once in our programming life. If you donโ€™t know how transactions work you could read the docs here for Postgres.

Our service handles the management of some Subscriptions in the database. All the changes to the rows happen within a transaction. The service is in Golang and the flow is like this::

  • Start the transaction

  • Fetch the record by ID

  • Confirm if the operation for the change is possible

  • If Yes, then update the record

  • Commit the transaction

  • For any error, revert the transaction

To fetch a record and acquire a lock on the record so that other flows would wait to update, we use SELECT ... FOR UPDATE query to fetch the record from Postgres and lock it.

Note: We use sqlx lib for database access.

The repository code is like this:

func (r *fetcher) GetSubscription(tx *sqlx.Tx, id uuid.UUID) (*model.Subscription, error) {
    var subscription model.Subscription
    err := tx.Get(&subscription, `
        SELECT * FROM subscriptions
        WHERE id = $1
        FOR UPDATE
    `, id)
    if err != nil {
        return nil, err
    }

    return &subscription, nil
}

The service code is like this:

func (s *service) CancelSubscription(ctx context.Context, id uuid.UUID) (*model.Subscription, error) {
    tx, err := s.db.BeginTxx(ctx, nil)
    if err != nil {
        return nil, err
    }

    defer func() {
        if err != nil {
            tx.Rollback()
        }
    }()

    subscription, err := s.fetcher.GetSubscription(tx, id)
    if err != nil {
        return nil, err
    }

    if subscription.CancelledAt != nil {
        return subscription, nil
    }

    subscription.CancelledAt = time.Now()

    err = s.updater.UpdateSubscription(tx, subscription)
    if err != nil {
        return nil, err
    }

    err = tx.Commit()
    if err != nil {
        return nil, err
    }

    return subscription, nil
}

Problem

All client requests were timing out, and the DB connection spiked over 1.2k connections ๐Ÿ˜…

Not a single request was able to complete the operation. It was a stop-the-world event ๐Ÿคฃ

Why?

The issue happens when we try to cancel a subscription that is already cancelled. If the subscription is already cancelled we return the subscription without doing any changes, but we are not releasing the lock on the record.

The reason is defer function calls tx.Rollback() only when there is an error. This would cause the lock to be active until the transaction commits or roll back. But since we are not doing any of the two things, the lock is held until the transaction times out.

If the service is having high traffic, the transactions could not timeout. This will cause the code to create new connections to the database and exhaust the connection pool.

Fix

  1. Release the lock in every if condition.
// Error handling is omitted for brevity
if subscription.CancelledAt != nil {
    _ = tx.Rollback() // release the lock

    return subscription, nil
}

This is the simplest way to fix the issue. But this would need you to roll back in every if condition. And if you forget to roll back in any of the if conditions, you will have the same issue.

2. Rollback the transaction in the defer function for every case.

// Error handling is omitted for brevity
defer func() {
   _ = tx.Rollback()
}()

This would release the lock in every case. For committed transactions and as per the tracing, we see some milliseconds the code spends to roll back the transactions. In the case of committed transactions, the rollback does nothing. But this is a better way to fix the issue.

3. Commit the transaction in the defer function for every case.

defer func() {
  if err != nil {
    _ = tx.Rollback()
  }

  commitErr := tx.Commit()
  // handle commitErr
}

If there are no changes the transaction will commit without any change. If there is any error only then the rollback would happen and will release the lock.

But, this affects the readability of the code. Your commit is happening in the defer function, and that's an extra pointer you have to keep in mind while reading the code.

Service Layer โ€” Return Error ๐Ÿ‘พ

The problem with the service is that for validations, it is not returning any error. The other way to fix this issue is to return an error from the if condition. This would make the Rollback happen in defer function.

This is a very clean way. A service should have only one responsibility. Either it completes the operation or it returns an error. It should return an error in case of validation failure. This is a good practice and something we missed in our project. We fixed the issue later by returning an error from the if condition.

This change also helps the handler to decide what HTTP status code to return. Which is really helpful as we can return 400 Bad Request for validation errors.

This is how the code would look after refactoring

var ErrSubscriptionAlreadyCancelled = errors.New("subscription already cancelled")

func (s *service) CancelSubscription(ctx context.Context, id uuid.UUID) (*model.Subscription, error) {
    tx, err := s.db.BeginTxx(ctx, nil)
    if err != nil {
        return nil, err
    }

    defer func() {
        if err != nil {
            tx.Rollback()
        }
    }()

    subscription, err := s.fetcher.GetSubscription(tx, id)
    if err != nil {
        return nil, err
    }
    if subscription.CancelledAt != nil {
        return subscription, ErrSubscriptionAlreadyCancelled
    }

    subscription.CancelledAt = time.Now()
    err = s.updater.UpdateSubscription(tx, subscription)
    if err != nil {
        return nil, err
    }
    err = tx.Commit()
    if err != nil {
        return nil, err
    }
    return subscription, nil
}

My Take

My personal preference is the second option. But you could choose any of the three options based on your preference. I am choosing the second option because I am sure that whatever happens in the flow at the end, the transaction revert will happen. Yes, it would cost me a few milliseconds in case of a committed transaction, but compared to the loss to the business, itโ€™s worth it.

Keep in mind that the service layer has only one responsibility. Either it completes the operation or it returns an error.

You may be thinking that a better approach is to have an abstraction that takes care of transactions. I agree, and would also have a separate post on that. But, I also think that we can avoid the complexity of abstraction if the code is stable and not changes too often.

The golang official post for transactions also supports the reasoning. Check the code snippet here. The office post also uses the second option.

// code snippet from the golang official post
func CreateOrder(ctx context.Context, albumID, quantity, custID int) (orderID int64, err error) {
// Create a helper function for preparing failure results.
  fail := func (err error) (int64, error) {
    return fmt.Errorf("CreateOrder: %v", err)
  }
  // Get a Tx for making transaction requests.
  tx, err := db.BeginTx(ctx, nil)
  if err != nil {
    return fail(err)
  }
  // Defer a rollback in case anything fails.
  defer tx.Rollback()

  ... // other code is omitted for brevity

  // Commit the transaction.
  if err = tx.Commit(); err != nil {
    return fail(err)
  }

  // Return the order ID.
  return orderID, nil
}

There are two more ways to fix the problem. This will also provide one extra layer of safety. In my opinion, having multiple safety gates is always better.

If one fails there is another one to safeguard with any issues. The business-critical services with an extra layer will protect against any monetary loss.

Excited to know more?

I do this kind of dance when I find multiple ways to solve any problem :D

Using a context

We could use a cancel context to release the lock at the end of the transaction. Context is very helpful in many use cases.

Letโ€™s see how to do things with context

func (s *service) CancelSubscription(ctx context.Context, id uuid.UUID) (*model.Subscription, err error) {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()

    tx, err := s.db.BeginTxx(ctx, nil)
    if err != nil {
        return nil, err
    }

    defer func() {
        if err != nil {
            tx.Rollback()
        }
    }()

    subscription, err := s.fetcher.GetSubscription(ctx, tx, id)
    if err != nil {
        return nil, err
    }

    if subscription.CancelledAt != nil {
        return subscription, nil
    }

    subscription.CancelledAt = time.Now()

    err = s.updater.UpdateSubscription(ctx, tx, subscription)
    if err != nil {
        return nil, err
    }

    err = tx.Commit()
    if err != nil {
        return nil, err
    }

    return subscription, nil
}

In the above code, we create a new cancel context from the parent context. If the DB transaction is not released either using Rollback or Commit. A defer cancel() happens at the end when the call returns.

The cancel call will notify the transaction that the operation is complete. The go runtime will close the transaction.

The defer cancel() call is important. If you miss that then the problem will still persist

At the go source code level transaction begin call starts a background go routine. The go routine monitors context Done signal. Let's see how it happens using context as shown here

func (db *DB) beginDC(ctx context.Context, dc *driverConn, release func(error), opts *TxOptions) (tx *Tx, err error) {
     var txi driver.Tx
     keepConnOnRollback := false
     withLock(dc, func() {
      _, hasSessionResetter := dc.ci.(driver.SessionResetter)
      _, hasConnectionValidator := dc.ci.(driver.Validator)
      keepConnOnRollback = hasSessionResetter && hasConnectionValidator
      txi, err = ctxDriverBegin(ctx, opts, dc.ci)
     })
     if err != nil {
      release(err)
      return nil, err
     }

     ctx, cancel := context.WithCancel(ctx)
     tx = &Tx{
      db:                 db,
      dc:                 dc,
      releaseConn:        release,
      txi:                txi,
      cancel:             cancel,
      keepConnOnRollback: keepConnOnRollback,
      ctx:                ctx,
     }
     go tx.awaitDone() // GO WAITS FOR THE CONTEXT TO DONE
     return tx, nil
}

And inside the tx.awaitDone

func (tx *Tx) awaitDone() {
 // Wait for either the transaction to be committed or rolled
 // back, or for the associated context to be closed.
 <-tx.ctx.Done()

 // Discard and close the connection used to ensure the
 // transaction is closed and the resources are released.  This
 // rollback does nothing if the transaction has already been
 // committed or rolled back.
 // Do not discard the connection if the connection knows
 // how to reset the session.
 discardConnection := !tx.keepConnOnRollback
 tx.rollback(discardConnection)  // AT THE END THE TX WILL BE ROLLBACKED
}

So transactions with a cancel context make more sense. Do not use the context coming from the HTTP request. HTTP context works like a client-side timeout. The service layer has no control over cancelling the HTTP context.

A timeout context could also work but that will not be that reliable. Finding the right timeout value is a difficult task in itself.

Server-side timeout

You set an optimal timeout for the request to complete and if the request is not completed within that time, you cancel the operation and return an error.

server := &http.Server{
    Addr:         ":8080",
    Handler:      router,
    WriteTimeout: 2 * time.Second,
}

In the above code, we are setting a server-side timeout of 2 seconds. If the entire request processing is not completed within 2 seconds, we cancel the operation and return an error to the client. With this approach, you donโ€™t need to create a timeout context for each request. But, you would still need to fine-tune the timeout value based on the traces and load testing.

Conclusion

I hope you found this post useful. For customer-facing services, having a tiny bug could cause a big monetary impact. Securing code with enough safety nets is something we should do. And, I hope through this post I made a lot of people aware of what worse could happen if the transactions are leaking and how easy it is to fix them. ๐Ÿ˜Š

To conclude my take I recommend using rollback of the transaction in defer always. Finding a reasonable timeout for context and server-side timeout is difficult. So, it doesn't matter if there is an error or not defer with rollback.

Cancel context is the best approach for most application types. And would serve most of the cases. The context in general has a lot of use cases and I would recommend you read more about it.

Also, for an extra layer of safety, I would recommend keeping a Server side timeout for HTTP services.

You would be also thinking about the client-side timeout. Those are not reliable enough. For example, a client could keep the timeout for a long duration and with a high load, the server could keep the connection open for a long time. This could again lead to the same problem we are trying to solve.

References:

I hope my team does a similar kind of dance one day ๐Ÿ˜…

Edits:

  • Fixed the context code with using cancel context instead of timeout context

  • Added the source code of golang database layer on how the cancellation is managed

Did you find this article valuable?

Support Neenad Ingole by becoming a sponsor. Any amount is appreciated!