- when: 2020 what: edits healthcheck as a modular stack monorepo before it was cool that does visibility+remediation how: [] why: - pride-in-craft - when: 2020 what: BoQ hosting 2020..2023 how: [] why: - pride-in-craft - pride-for-others - when: 2020 what: GOBS from p2p to stateless (via redis, killing leveldb) how: [] why: - scalability - when: 2020 what: nexpose-remediations and blackduck sme how: [] why: - scalability - when: 2020 what: scale test rems vs re how: [] why: - failfast - when: 2020 what: mongo sme designed and released rems-mongo how: [] why: - scalability - when: 2020 what: re, terminator to nomad how: [] why: - scalability - when: 2020 what: design GOBS off of Couchbase how: [] why: - scalability - when: 2020 what: REMS migration implementation how: [] why: - pride-in-craft - when: 2020 what: DRS for REMS how: [] why: - scalability - when: 2020 what: isolation for REMS how: [] why: - scalability - when: 2020 what: FSDef,FSIndex to GOBS from Couchbase how: [] why: - scalability - when: 2020 what: GOBS on Mongo with Xongo w/ Ryan intern how: [] why: - pride-in-craft - when: 2020 what: Mongo on TLS SME/pilot how: [] why: - scalability - when: 2020 what: Mongosback V2 for cron > rundeck how: [] why: - pride-in-craft - when: 2020 what: REMS+DSCat Mongo SME how: [] why: - pride-in-craft - when: 2021 what: systems review planning how: [] why: - pride-in-craft - pride-for-others - when: 2021 what: mentored new hire until he left the team for his starter project (S3) how: [] why: - pride-in-craft - pride-for-others - when: 2021 what: mentored sr engineer on bash, rundeck, in-house metrics and alerting, ssh... how: [] why: - pride-in-craft - when: 2021 what: on-call training with chaos testing, hands-on log perusing how: [] why: - pride-in-craft - pride-for-others - when: 2021 what: s2se; scripted Galera with safety for multi-team how: [] why: - scalability - pride-in-craft - when: 2021 what: dr backup check to monitor s3 compliance w/ 19 teams onboarded and eventually handed to dbteam how: [] why: - scalability - when: 2021 what: Mongosback V2.1 autorelease after bake time, indexes how: [] why: - pride-in-craft - when: 2021 what: REMS on SSDs analysis, budget proposal, approval, deploy, mock traffic how: [] why: - scalability - when: 2021 what: Took REMS migration implementaion back from handoff and reduced ETA from inf to 3w at max speed w/ visibility and parallelism how: [] why: - pride-in-craft - when: 2021 what: found Go likes lots of small > few big RAM nodes how: [] why: - pride-in-craft - when: 2021 what: | REMS quality of life * idempotency test * brand/user/issuer/byte limiting * squishing * lazy JSON parsing * resumable jobs * heartbeating jobs how: [] why: - scalability - when: 2021 what: DSCat mongo query analysis and optimization how: [] why: - pride-in-craft - when: 2021 what: cross-team Mongo incident remediation, support, guidance, SME how: [] why: - pride-in-craft - when: 2021 what: couchsback to v2 as rclone how: [] why: - scalability - when: 2021 what: pushed against proposed optimizations (rems cleaning of old edit fields on stale edits) and proved gains (.3% data saved on wire to TS) wouldnt pay off but complied when commanded how: [] why: - scalability - when: 2021 what: Mongo multi-phase, multi-timezone interactive training with offline reading and video + online chaos testing + forum for anonymous feedback how: [] why: - pride-in-craft - pride-for-others - when: 2021 what: LegacyPublicAPI; they wanted to hand to us, so I executed what it'd take to shift ownership to us and documented the gotchas, and it was so bad that they reverted my complete code and revisited so this handoff wouldnt repeat with other teams how: [] why: - pride-in-craft - when: 2021 what: healthcheck platform design approved but implementaion priority rejected how: [] why: - pride-in-craft - pride-for-others - when: 2022 what: champion of quality: suspected and saw symptoms of data incorrectness in REMS snapshots, insisted and provided more and more evidence despite willful holiday ignorance, eventually recognized as p1 how: [] why: - pride-in-craft - when: 2022 what: became team lead how: [] why: - scalability - when: 2022 what: cost-benefit of geni on ddb: 10x the cost but reduces hardlyAnyOperationalBurdenQuantified how: [] why: - pride-in-craft - when: 2022 what: geni iops -> i insist and tune docker when team wants to ignore call to action how: [] why: - pride-in-craft - when: 2022 what: response-files OOMs image resizing sidecar proposed + open source used how: [] why: - scalability - pride-in-craft - when: 2022 what: generic aws migration scripts w/ mentee leveraged by tens of teams for s3, ddb, lambda, sns, sqs how: [] why: - pride-in-craft - pride-for-others - scalability - when: 2022 what: cicd for team; onboarding + converting + creating continuous testing framework how: [] why: - scalability - when: 2022 what: sahithig + itony mentorships; spead asks what's wrong with onboarding? what onboarding! how: [] why: - pride-in-craft - pride-for-others - scalability - when: 2022 what: monorepo and parallelizing and caching packages = Jenkins from 10m to 2m how: [] why: - pride-in-craft - scalability - when: 2022 what: autopatching for vuln remediation via scheduled builds for team w/ stable, quiet cicd how: [] - scalability - when: 2022 what: | The REMS Data Loss Incident * mongo bug around leaked oplog lock = no disk persistence = total loss * upstream replayed jobs or shared their mongodb oplog so i could rebuild * forward-facing communication; instead of sorry, this is our root cause and future prevention how: [] why: - pride-in-craft - when: 2022 what: miss; jfe needs faster virus scanning so I give 'em 10%. They want 10x because they retry all N files of their batch of M every time. Losers. how: [] why: - scalability - when: 2022 what: every aws migration solo or nearly how: [] why: - scalability - when: 2022 what: ajw initial release from 25% e2e test to 75% how: [] why: - pride-in-craft - pride-for-others - when: 2022 what: became team lead :sparkles: and promoted to l5 how: [] why: - role-model-dad - when: 2022 what: coda doc for planning splits owners from contributors w/ weights how: [] why: - scalability - when: 2022 what: miss; davidc exported to orcs despite wishes to stay how: [] why: - pride-in-craft - role-model-dad - when: 2022 what: swimlanes of rems; byte write/read rate limits, terminator specific pool how: [] why: - scalability - pride-in-craft - when: 2022 what: | tested REMS no-ops when carter ignored me asking him to * "please write 1 test before i get back from vacation" * 0 forever how: [] why: - pride-in-craft - pride-for-others - when: 2022 what: generic nomad cost analysis grafana how: [] why: - scalability - when: 2023 what: learning the performance feedback game; my perception is no one else's reality; make a rubric and define specific examples against it how: [] why: - pride-in-craft - pride-for-others - role-model-dad - when: 2023 what: miss; horizons doc review wasn’t generalized/brief enough how: [] why: - pride-in-craft - pride-for-others - customer-obsesssion - when: 2023 what: 2nd highest contributor to runbook blitz how: [] why: - pride-in-craft - pride-for-others - when: 2023 what: when overloaded with ops, told team and offloaded + handed off threads how: [] why: - pride-in-craft - when: 2023 what: fairness for rems; if attempt to use N threads per box, defer to low prio queue how: [] why: - pride-in-craft - scalability - when: 2023 what: | interactive cicd tutorial with dpie so they could execute side-by-side * not my fault they didnt how: [] why: - pride-for-others - scalability - when: 2023 what: chaos test gameday to train new teammate oncall how: [] why: - pride-in-craft - when: 2023 what: | Couchbase-aggedon * i told 'em how to patch that shit motherfuckers are usual * i go to office because that team insists * i stop 'em from terminating early many times * a hash means we dont need to check, right? * ive got a script it's good enough i wrote it * ive got v2 of my script it's good enough i wrote it * this is a lotta pain, we should give up * taught 8 teammates how to sed/grep/script/bash * delegating threads; spiking accesslogs, spiking redis dumps, spiking couchbase backup/restore * discovered bugs that meant some threads were not viable * reduced problem to safest state for round 1, next safest for round 2, ... how: [] why: - pride-in-craft - when: 2023 what: BoQ final how: [] why: - scalability - when: 2023 what: generic datastore customers could opt into us doing stateramp for them in GENI if they set jwts how: [] why: - pride-in-craft - scalability - when: 2023 what: REMS /partitions, /entrypoints for TS to parallel data load via index scan+hash live vs. keithc INSISTED on not live :eye_roll: how: [] why: - pride-in-craft - scalability - when: 2023 what: proposed AtlasQMP as bugfixed singleton, parallelized nomad, or lambda cost and speed and devcost and deliverytime how: [] why: - pride-in-craft - scalability - when: 2023 what: response-files split from library-files so we can move to our own database without sideaffect how: [] why: - scalability - pride-in-craft - when: 2023 what: | challenge; q2/q3 planning without knowing what medical leave teammate would do * 1. offboard what mattered that he was doing * 2. ask him repeatedly to offboard early and ask for updates how it's going * 3. guess things he really wants and assume he won't be here for forseeable future even if he does return * coordinate with mathis on expectations upon his return how: [] why: - pride-for-others - when: 2023 what: | REMS vs Translations * Translatsions gets 500 rows from AE without translations and translates those * prone to eventual consistency, blocks, brand deprioritizing, random bugs * REMS got a backlog so we told them first * and we bumped over and over for them to look * and it escalated to a snafu * root cause was squishing taking 90% of our cpu on this backlog of repeat work so sync squishing expedited to full release * REMS emit a bug to TS that missed edits, so Translatsions kept re-translating what REMS perceived to be no-ops how: [] why: - pride-in-craft - when: 2023 what: insist on oncall covers during stressful weeks, high effort windows, and okr release time how: [] why: - role-model-dad - pride-in-craft - when: 2023 what: still SME on mongo backups and use-case-specific performance optimization how: [] why: - pride-in-craft - scalability - when: 2023 what: more E2E tests on previously E2E test free repos because old mentee sahithig didn't feel comfortable joining the team fearing she's break stuff how: [] why: - pride-in-craft - scalability - pride-for-others - when: 2023 what: navigated a teammate getting exported to a team he didnt want to join AND later getting exported from that team and almost someone else getting exported too how: [] why: - pride-in-craft - role-model-dad - when: 2023 what: | CSchmalzle * burnt out in q1 from too many projects in-flight * bi-weekly "are you closing stuff?" * daily "yaknow that 2 day thing? is it done? when will it be done? what do we need to do to ship it?" for 2 months * insisted he pick things to handoff and we got 2 from him * released a lotta stuff untested and broken and i doubled back to fix it * entire team adds quality as key result * terrible mr of copy-pasting + 2k lines of code * "learn2git" * multi-mr * refactors separate * wants to release his mass changes that include customer-facing system behavior changes because "we'll be more correct" * and lots of support to remediate kanban * and i say NO u fok how: [] why: - pride-for-others - scalability - when: 2023 what: i get team approval on a design to stop not-deleting customer data, they say we should fix at system level, so I spike and prove system level, just for other team to nope.avi out (REMS MoM delete) how: [] why: - pride-in-craft - when: 2023 what: | XMD Contact Consolidation Consumer * "read our kafka topic like this and call your apis with it" * ezpz BUT i dont wanna own your business logic by proxy * "but our manager said you would, and something about throttling" * handle our 429s and you'll be k * "but our manager..." * ...3 weeks later... * listen here m8, u guys own your own 1 write per second shit, ya hear? * "we didnt even want that, y'all just took 2 years to get back to us >:(" * o how: [] why: - pride-in-craft - when: 2023 what: test everything; atlas qmp canary from ignored to release blocking via librdkafka configs, sleep deletions how: [] why: - pride-in-craft - when: 2023 what: test everything; atlas data loader first e2e test how: [] why: - pride-in-craft - when: 2023 what: test everything; block dev if merging release isn't noop how: [] why: - scalability - when: 2023 what: test everything; legacy responses first e2e test how: [] why: - scalability - when: 2023 what: test everything; except don't; response-files keepalives cross-dc would expire and break response-files allocs permanently, so tests couldn't pass to release fix how: [] why: - pride-in-craft - when: 2023 what: test everything; our tests found FSCS outages and then FSCS team found they had no visibility how: [] why: - pride-in-craft - when: 2023 what: test everything; janus cruddy e2e tests how: [] why: - pride-for-others - when: 2023 what: high availability; 2 instances of singleton with distributed lock as cheap and good enough path forward how: [] why: - pride-in-craft - scalability - when: 2023 what: designed rems mom deleting parallel with atlas, proposed drs team fixes it and impacted team volume, got deferred indefinitely and solved the same problem yet again but for rems mom how: [] why: - pride-in-craft - scalability - when: 2022 what: feedback; told sean implying we should spend QED time on ops work is against the spirit of QED time but he is an authority figure and makes it uncomfortable not to - when: 2022 what: feedback; when i needed to ask michaelp for a remote exception, i had to share i was hesistant because he made possibly leaving engineers sound ostracized and ejected immediately