“Change is not always growth, but growth is often rooted in change.”
— Drizzt Do’Urden
On February 15, 2021, the Tellor system experienced a major error while deploying the upgrade to v2.6.1.
To allow a method for Tellor to upgrade without forking the entire system, we utilize a proxy contract that holds the functionality and then can switch/upgrade this contract through a token weighted vote. The upgrade to v2.6.1 contained a transition to an invalid proxy address, which had no functionality. This error was not caught in the proposal process and consequently was voted in favor for. Once the upgrade was deployed it essentially froze our system, the oracle, and our native token, TRB.
We acted quickly and have since relaunched our token and migrated holders, but the experience has changed us permanently. It was a serious error on our part and now we begin the work of rebuilding trust in our system.
It’s not often one gets to reflect in this space over a lesson learnt, but on the occasions they do, you’re likely to see thinly veiled apologies or teachings shrouded by unchanging behavior or attitude towards security. We’re not going to let this become one of those stories. Processes are changing and our team is refocusing on making sure this doesn’t happen again. To explain how it happened though, you first need to understand the Tellor system.
Tellor uses an upgradeable contract; this is a way to change what code runs on a given address. The way you do it is rather simple:you have one address (our main Tellor address) and that holds the storage (think of it like a database); then you have another address and that holds the functions (the way to interact with the database). So, if you ask the main contract address, what’s the balance of my address, it takes the instructions (balanceOf) and then goes to your function address to see what to do. In this case, tell the party what the balance of a given address is.
The upgradeability comes from our ability to change what that function contract address is. So in this example, we could change it to a different address, where balanceOf returns just half your balance, your balance plus 1,000,000, or something more nefarious like 0.
The danger here should be palpable. A bad actor could add a function that lets them mint tokens or steal any ETH you send to the contract. It’s basically full control of any funds or data in the system and whoever owns this functionality (usually called an “admin key”) owns the system.
For this reason, we didn’t want control. When we deployed in August 2019, we initially used an admin key to tweak some variables in the system and get it to a stable deployment; this is something we recommend and were recommended by others to do since you never get things right on the first try. But we knew the goal was to be a DAO and have the community run it, so just about a year later (after a 4 different changes to the code (fixing a minor bug, fixing a minor bug, lowering inflation, upgrading to better features), we threw away the admin key (https://medium.com/tellor/goodbye-admin-key-hello-decentralized-governance-6e4f7ca969a5). It was a great moment and we’re proud of it.
Now decentralized governance is no joke. Our set up is a common one of token weighted votes getting to propose and choose new addresses to upgrade to. It requires a week long vote, and we need 10% of the total supply (at least) to vote. This is actually the hard part, since the team only owns about half of that. We used the full DAO system twice to upgrade successfully to v2.5 (helping relieve gas costs on the miners) and v2.6 (a fix in the system that helped make mining more competitive).
We noticed shortly after the v2.6 vote however, that we made a mistake. There was a rounding issue in the contracts, resulting in mining rewards basically being a gas race for a short period of time (the remainder of the minute rounded down). Mining pre v2.6 was basically a gas race anyway, so we didn’t really fix the problem that we wanted to with the upgrade. It was a simple error (literally taking out one line that rounded down to the minute), and this is where we got sloppy.
We made the fix and proposed it to the community. We had literally just done the whole process for v2.6 successfully, so we proceeded to quickly do the same thing. One founder deployed and verified the contracts on Etherscan using a Hardhat script, and then the rest of the team checked the address and that the verified code was what we expected it to be. Things looked good and the fix was there.
We proposed it to the community, rallied our amazing troops to vote yet again and they did so without complaint. (As an aside, we have the best holders in the game and hitting multiple +10% quorums within a few weeks with ease is a strong statement to that effect.)
The vote passed, and then all we had to do was wait one day and run the updateTellor function (as a protection in the Tellor system someone could dispute the results of the vote and kick off another voting round). Unfortunately, there were no disputes and the team hopped on our daily dev call to run the function and go tell the miners the good news. I ran the function and we waited. We waited and waited and yet no new activity was taking place on the contract address. Thinking it could just be an off day (gas was pretty high), I tried to transfer a token to Brenda (our CEO).
It failed right away.
Our stomachs dropped.
We tried again and again, digging through the internal transactions and finding nothing. Very quickly we knew what happened (the proxy address was wrong).
We knew the potential dangers of an upgradeable contract and short of a malicious upgrade, it was our worst fear.
Looking deeper, we found that the address we voted to upgrade to was verified correctly but wasn’t the verified code. It was just the storage contract. We found out the hard way that the bytecode can be less than but not more than the verified code. There were no functions.
We quickly knew we’d need to redeploy the system and get people their tokens back and the system up and running. We contacted users first, then posted on our telegram as we contacted exchanges that the system was essentially frozen (the state was still there, just no functions to run on it). Then we got to work on building.
A week later, we redeployed the fixed system and are now back and running full steam, but it was a painful and costly experience. We sincerely regret how we handled the upgrade and hopefully we can both articulate and prove through implementation the lessons we’ve taken from the experience and the changes going forward.
Lessons / Reflections
One of my first jobs out of college was working on the employment numbers at the US Bureau of Labor Statistics. Every month, my team would do the seasonal revisions and benchmarking for non-farm payrolls, and I distinctly remember hating the days before we’d push out numbers.
Not only would we freeze all data coming in and run all calculations using code that hadn’t been changed in a decade, we’d also print out pages and pages of excel spreadsheets and reports to fact check them. You’d have two people sit with red pens and say numbers back and forth to make sure the results matched what we were about to publish. 25-year-old me hated these manual checks.
But I get it now.
It’s hard to have a set of procedures when almost everything in your tech stack was developed in the past two years, but we at Tellor have been doing this long enough that we should be better than this.
An expanded testing suite may have caught the v2.6 rounding error, but the latter one wouldn’t have been caught by local testing. There’s so much that needs to get done from a checking standpoint that we have to set in place checklists and manual checks. Automation is great and the tooling sets and security tools are getting better each day, but there’s no substitute for processes and manual checks of your results. We already have additional processes to increase security of the fork and migration pieces implemented in the last few weeks and will continue to expand and maintain these processes through development.
There’s always a balance between pushing out well tested code and pushing it out in a timely fashion to see if anyone’s going to even use the thing. But the truth is, you have to take time for even the quick fixes. Making sure you give due diligence to the no-brainer upgrades or code changes is where we messed up and where we need to focus.
Going forward, we’ll be pushing out less frequent updates to the code and adding in the processes that will force us to do certain things for certain lengths of time.
Do realistic testing
Speaking of testing for longer timeframes, the quality of our tests needs to improve. Deploying on a testnet is obviously the low bar for a project but proving that Rinkeby works to test your deployment scripts isn’t really that helpful. Trying to mimic real world scenarios is tough without monetary incentives, but there are ways you can attempt it. For the Tellor system, mocking competitive FPGA based PoW mining is nearly impossible on a testnet, but we failed in a lot of ways to play out the game theory of the competition (it lead to a lot of gas races and no one submitting the 5th slot (which cost more gas)) and it was what ultimately caused to us to need our first v2.5 and v2.6 upgrades.
Another problem with testnets is that gas is free. Tellor was originally designed in 2018, when gas was basically free on mainnet too, but the biggest challenge and needs for upgrading the system over the last year hasn’t been bug-fix related, but rather the acceptance of the new normal with regard to gas prices. Mocking out a system with $200 ETH and 1gwei gas prices to put onto a network that has $10,000 ETH and 500gwei prices is a recipe for disaster. The customers change, the use cases evolve, and ultimately until you realize what world you’re in and can build a more specific product, flexibility is key.
Don’t trust Etherscan (or your tooling sets)
Etherscan, deployment tools, automated testing suites….these are great and the space has come so far since my days of deploying contracts through the Mist browser (RIP), but they’re new and things happen.
In our case, we relied too much on Etherscan’s verified code matching the bytecode. We also trusted our deployment script, which unfortunately got over complicated and required too much manual cutting and pasting to work.
We have a constant problem in crypto about when to move up in terms of tooling. Our first contracts were written with Truffle and web3, now we’re on Hardhat and Etherjs; when do you upgrade your Solidity Compiler version? What automated testing tools are worth the time to figure out? How does the new local EVM affect my tests? There are so many new tools, that we constantly feel outdated.
But this is fine.
There’s a reason government agencies use code written in 2005 and print things out on paper….if it ain’t broke don’t fix it.
There’s a balance to strike when updating our tooling sets, but we upgraded so many things so fast, that our team wasn’t familiar enough with the nuances of the system and it cost us. Going forward, we’re buttoning up and fully testing our automated suites and then we plan on leaving it until it won’t work, not simply when a new tool comes along.
Upgradeability is dangerous
I know there’s a whole debate around this. The arguments are:
a) upgradeability is bad, immutability is key; just fork and change addresses
b) upgradeability is good, forks are more of a hassle but often even less transparent and DAO’s are a thing for a reason
Without getting too much into the debate, Tellor has upgradeability. We know even now that the space is changing and we want to make our system better. If a project who utilizes Tellor wants to be immutable and non-upgradeable, they have to be able to rely on a system with a constant address. The trust model is of course now placed in the governance of Tellor, but this is the community aspect we want to embrace, not run from.
That said, we’ve all been told of the dangers of upgrading and as a team we got to experience it first hand. Even though our community utilized a decentralized vote for a week to push the bug live, just the fact that there was an upgrade procedure makes the whole thing more precarious.
Going forward, we still have an upgradeable contract, but focusing on building a more engaged community that understands the risks and takes part in every step of the process.
The building block nature of defi is scary
The freezing of Tellor didn’t just break Tellor, it broke quite a lot of other contracts. The main ones being other defi contracts like Uniswap and Balancer. Since the original Tellor token can’t be transferred, all of the TRB AND ETH is locked for good. This means that the guaranteed APY of your pool is now zero. Luckily for those holders, the Tellor team is compensating them, but I don’t think many LP’s are aware how dependent they are on the projects whose tokens they provide liquidity for. A more malicious attack could lead to much more disastrous consequences.
When we talk to projects looking to build on Tellor, we try and get them to use best practices. Assume Tellor can fail. Assume ETH can stall and miners can attack your system. Every piece you add in multiplies the risk. We’re building a core piece of infrastructure for the space and we need to be sure that people use us (and other pieces) more carefully than ever.
Despite being decentralized, trust is necessary
The few large projects of BTC and ETH aside, there are few projects, if any, that have the numbers to actually pull off true decentralization. When one team controls either the majority of the tokens, the lion’s share of the dev activity, or the bulk of liquidity in the protocol, you don’t have an unstoppable system.
Tellor is trying hard to get there, but we realized even our own power over the system with this last bug. Our ability to upgrade the contracts with minimal checks, call the exchanges to migrate tokens, or even add back in an admin key without a complaint….this points to the community’s trust in the team (which is great) but also our failure to really build up our community’s ownership and knowledge of the codebase. I’m giving Tellor a hard time now, but few projects are much better. The vast majority of projects would cease to exist if the team or a few investors dumped their tokens and walked away, and just as many projects are susceptible to regulatory actions forcing teams to do this without the self-interest component.
What it comes down to is that we, as a space, need to slow down and take more ownership of the protocols we like. We also need to recognize that no matter how perfect a system distributes tokens or utilizes proper on-chain / off-chain governance, ultimately getting people to care about your project and put in the time to make it a community is the hardest thing in crypto.
The space is pro at this now
On a bright note, throughout this whole security process, we definitely realized we’re not alone. It’s happened before and although there are no excuses, it’s nice that there’s actually a phenomenal crypto community out there who understand and offered us help in the thick of it.
Shoutouts to the Balancer, Liquity, xDef, dex.blue, and many other teams that were not only understanding when we told them about the freeze, but they actually helped us ensure a good transition and gave us feedback. It’s never fun to tell someone their tokens are frozen, but when the vast majority of our holders just reached out and asked how they could help, it means more than anything.
We’ve weathered some choppy waters over this past month. It taught us valuable lessons about security, the space, and creating something that will last. We’re refocusing on security and making things better here at Tellor. We sincerely apologize for the money lost or inconveniences caused and we’re going to work hard, for as long as you’ll let us, building an oracle that works for everyone.
Written by Nicholas Fett, CTO Tellor