530 lines
15 KiB
Plaintext
530 lines
15 KiB
Plaintext
- when: 2020
|
||
what: edits healthcheck as a modular stack monorepo before it was cool that does visibility+remediation
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2020
|
||
what: BoQ hosting 2020..2023
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2020
|
||
what: GOBS from p2p to stateless (via redis, killing leveldb)
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: nexpose-remediations and blackduck sme
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: scale test rems vs re
|
||
how: []
|
||
why:
|
||
- failfast
|
||
- when: 2020
|
||
what: mongo sme designed and released rems-mongo
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: re, terminator to nomad
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: design GOBS off of Couchbase
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: REMS migration implementation
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2020
|
||
what: DRS for REMS
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: isolation for REMS
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: FSDef,FSIndex to GOBS from Couchbase
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: GOBS on Mongo with Xongo w/ Ryan intern
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2020
|
||
what: Mongo on TLS SME/pilot
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2020
|
||
what: Mongosback V2 for cron > rundeck
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2020
|
||
what: REMS+DSCat Mongo SME
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: systems review planning
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2021
|
||
what: mentored new hire until he left the team for his starter project (S3)
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2021
|
||
what: mentored sr engineer on bash, rundeck, in-house metrics and alerting, ssh...
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: on-call training with chaos testing, hands-on log perusing
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2021
|
||
what: s2se; scripted Galera with safety for multi-team
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: dr backup check to monitor s3 compliance w/ 19 teams onboarded and eventually handed to dbteam
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2021
|
||
what: Mongosback V2.1 autorelease after bake time, indexes
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: REMS on SSDs analysis, budget proposal, approval, deploy, mock traffic
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2021
|
||
what: Took REMS migration implementaion back from handoff and reduced ETA from inf to 3w at max speed w/ visibility and parallelism
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: found Go likes lots of small > few big RAM nodes
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: |
|
||
REMS quality of life
|
||
* idempotency test
|
||
* brand/user/issuer/byte limiting
|
||
* squishing
|
||
* lazy JSON parsing
|
||
* resumable jobs
|
||
* heartbeating jobs
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2021
|
||
what: DSCat mongo query analysis and optimization
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: cross-team Mongo incident remediation, support, guidance, SME
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: couchsback to v2 as rclone
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2021
|
||
what: pushed against proposed optimizations (rems cleaning of old edit fields on stale edits) and proved gains (.3% data saved on wire to TS) wouldnt pay off but complied when commanded
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2021
|
||
what: Mongo multi-phase, multi-timezone interactive training with offline reading and video + online chaos testing + forum for anonymous feedback
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2021
|
||
what: LegacyPublicAPI; they wanted to hand to us, so I executed what it'd take to shift ownership to us and documented the gotchas, and it was so bad that they reverted my complete code and revisited so this handoff wouldnt repeat with other teams
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2021
|
||
what: healthcheck platform design approved but implementaion priority rejected
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2022
|
||
what: champion of quality: suspected and saw symptoms of data incorrectness in REMS snapshots, insisted and provided more and more evidence despite willful holiday ignorance, eventually recognized as p1
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2022
|
||
what: became team lead
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2022
|
||
what: cost-benefit of geni on ddb: 10x the cost but reduces hardlyAnyOperationalBurdenQuantified
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2022
|
||
what: geni iops -> i insist and tune docker when team wants to ignore call to action
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2022
|
||
what: response-files OOMs image resizing sidecar proposed + open source used
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
- when: 2022
|
||
what: generic aws migration scripts w/ mentee leveraged by tens of teams for s3, ddb, lambda, sns, sqs
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- scalability
|
||
- when: 2022
|
||
what: cicd for team; onboarding + converting + creating continuous testing framework
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2022
|
||
what: sahithig + itony mentorships; spead asks what's wrong with onboarding? what onboarding!
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- scalability
|
||
- when: 2022
|
||
what: monorepo and parallelizing and caching packages = Jenkins from 10m to 2m
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2022
|
||
what: autopatching for vuln remediation via scheduled builds for team w/ stable, quiet cicd
|
||
how: []
|
||
- scalability
|
||
- when: 2022
|
||
what: |
|
||
The REMS Data Loss Incident
|
||
* mongo bug around leaked oplog lock = no disk persistence = total loss
|
||
* upstream replayed jobs or shared their mongodb oplog so i could rebuild
|
||
* forward-facing communication; instead of sorry, this is our root cause and future prevention
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2022
|
||
what: miss; jfe needs faster virus scanning so I give 'em 10%. They want 10x because they retry all N files of their batch of M every time. Losers.
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2022
|
||
what: every aws migration solo or nearly
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2022
|
||
what: ajw initial release from 25% e2e test to 75%
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2022
|
||
what: became team lead :sparkles: and promoted to l5
|
||
how: []
|
||
why:
|
||
- role-model-dad
|
||
- when: 2022
|
||
what: coda doc for planning splits owners from contributors w/ weights
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2022
|
||
what: miss; davidc exported to orcs despite wishes to stay
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- role-model-dad
|
||
- when: 2022
|
||
what: swimlanes of rems; byte write/read rate limits, terminator specific pool
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
- when: 2022
|
||
what: |
|
||
tested REMS no-ops when carter ignored me asking him to
|
||
* "please write 1 test before i get back from vacation"
|
||
* 0 forever
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2022
|
||
what: generic nomad cost analysis grafana
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2023
|
||
what: learning the performance feedback game; my perception is no one else's reality; make a rubric and define specific examples against it
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- role-model-dad
|
||
- when: 2023
|
||
what: miss; horizons doc review wasn’t generalized/brief enough
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- customer-obsesssion
|
||
- when: 2023
|
||
what: 2nd highest contributor to runbook blitz
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- when: 2023
|
||
what: when overloaded with ops, told team and offloaded + handed off threads
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: fairness for rems; if attempt to use N threads per box, defer to low prio queue
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2023
|
||
what: |
|
||
interactive cicd tutorial with dpie so they could execute side-by-side
|
||
* not my fault they didnt
|
||
how: []
|
||
why:
|
||
- pride-for-others
|
||
- scalability
|
||
- when: 2023
|
||
what: chaos test gameday to train new teammate oncall
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: |
|
||
Couchbase-aggedon
|
||
* i told 'em how to patch that shit motherfuckers are usual
|
||
* i go to office because that team insists
|
||
* i stop 'em from terminating early many times
|
||
* a hash means we dont need to check, right?
|
||
* ive got a script it's good enough i wrote it
|
||
* ive got v2 of my script it's good enough i wrote it
|
||
* this is a lotta pain, we should give up
|
||
* taught 8 teammates how to sed/grep/script/bash
|
||
* delegating threads; spiking accesslogs, spiking redis dumps, spiking couchbase backup/restore
|
||
* discovered bugs that meant some threads were not viable
|
||
* reduced problem to safest state for round 1, next safest for round 2, ...
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: BoQ final
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2023
|
||
what: generic datastore customers could opt into us doing stateramp for them in GENI if they set jwts
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2023
|
||
what: REMS /partitions, /entrypoints for TS to parallel data load via index scan+hash live vs. keithc INSISTED on not live :eye_roll:
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2023
|
||
what: proposed AtlasQMP as bugfixed singleton, parallelized nomad, or lambda cost and speed and devcost and deliverytime
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2023
|
||
what: response-files split from library-files so we can move to our own database without sideaffect
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: |
|
||
challenge; q2/q3 planning without knowing what medical leave teammate would do
|
||
* 1. offboard what mattered that he was doing
|
||
* 2. ask him repeatedly to offboard early and ask for updates how it's going
|
||
* 3. guess things he really wants and assume he won't be here for forseeable future even if he does return
|
||
* coordinate with mathis on expectations upon his return
|
||
how: []
|
||
why:
|
||
- pride-for-others
|
||
- when: 2023
|
||
what: |
|
||
REMS vs Translations
|
||
* Translatsions gets 500 rows from AE without translations and translates those
|
||
* prone to eventual consistency, blocks, brand deprioritizing, random bugs
|
||
* REMS got a backlog so we told them first
|
||
* and we bumped over and over for them to look
|
||
* and it escalated to a snafu
|
||
* root cause was squishing taking 90% of our cpu on this backlog of repeat work so sync squishing expedited to full release
|
||
* REMS emit a bug to TS that missed edits, so Translatsions kept re-translating what REMS perceived to be no-ops
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: insist on oncall covers during stressful weeks, high effort windows, and okr release time
|
||
how: []
|
||
why:
|
||
- role-model-dad
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: still SME on mongo backups and use-case-specific performance optimization
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2023
|
||
what: more E2E tests on previously E2E test free repos because old mentee sahithig didn't feel comfortable joining the team fearing she's break stuff
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- pride-for-others
|
||
- when: 2023
|
||
what: navigated a teammate getting exported to a team he didnt want to join AND later getting exported from that team and almost someone else getting exported too
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- role-model-dad
|
||
- when: 2023
|
||
what: |
|
||
CSchmalzle
|
||
* burnt out in q1 from too many projects in-flight
|
||
* bi-weekly "are you closing stuff?"
|
||
* daily "yaknow that 2 day thing? is it done? when will it be done? what do we need to do to ship it?" for 2 months
|
||
* insisted he pick things to handoff and we got 2 from him
|
||
* released a lotta stuff untested and broken and i doubled back to fix it
|
||
* entire team adds quality as key result
|
||
* terrible mr of copy-pasting + 2k lines of code
|
||
* "learn2git"
|
||
* multi-mr
|
||
* refactors separate
|
||
* wants to release his mass changes that include customer-facing system behavior changes because "we'll be more correct"
|
||
* and lots of support to remediate kanban
|
||
* and i say NO u fok
|
||
how: []
|
||
why:
|
||
- pride-for-others
|
||
- scalability
|
||
- when: 2023
|
||
what: i get team approval on a design to stop not-deleting customer data, they say we should fix at system level, so I spike and prove system level, just for other team to nope.avi out (REMS MoM delete)
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: |
|
||
XMD Contact Consolidation Consumer
|
||
* "read our kafka topic like this and call your apis with it"
|
||
* ezpz BUT i dont wanna own your business logic by proxy
|
||
* "but our manager said you would, and something about throttling"
|
||
* handle our 429s and you'll be k
|
||
* "but our manager..."
|
||
* ...3 weeks later...
|
||
* listen here m8, u guys own your own 1 write per second shit, ya hear?
|
||
* "we didnt even want that, y'all just took 2 years to get back to us >:("
|
||
* o
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: test everything; atlas qmp canary from ignored to release blocking via librdkafka configs, sleep deletions
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: test everything; atlas data loader first e2e test
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: test everything; block dev if merging release isn't noop
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2023
|
||
what: test everything; legacy responses first e2e test
|
||
how: []
|
||
why:
|
||
- scalability
|
||
- when: 2023
|
||
what: test everything; except don't; response-files keepalives cross-dc would expire and break response-files allocs permanently, so tests couldn't pass to release fix
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: test everything; our tests found FSCS outages and then FSCS team found they had no visibility
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- when: 2023
|
||
what: test everything; janus cruddy e2e tests
|
||
how: []
|
||
why:
|
||
- pride-for-others
|
||
- when: 2023
|
||
what: high availability; 2 instances of singleton with distributed lock as cheap and good enough path forward
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2023
|
||
what: designed rems mom deleting parallel with atlas, proposed drs team fixes it and impacted team volume, got deferred indefinitely and solved the same problem yet again but for rems mom
|
||
how: []
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- when: 2022
|
||
what: feedback; told sean implying we should spend QED time on ops work is against the spirit of QED time but he is an authority figure and makes it uncomfortable not to
|
||
- when: 2022
|
||
what: feedback; when i needed to ask michaelp for a remote exception, i had to share i was hesistant because he made possibly leaving engineers sound ostracized and ejected immediately
|