858 lines
28 KiB
YAML
858 lines
28 KiB
YAML
# TODO https://www.linkedin.com/posts/brianjenney_ive-spoken-with-about-500-developers-in-activity-7119717343127117824-I016/?utm_source=share&utm_medium=member_desktop
|
||
# > did [x] using [y] which led to [z].
|
||
- when: 2019
|
||
rendered: Decreased backend service's annual outages by 91% and reduced hardware costs by 40% by selecting, training owners on, and migrating without downtime to a different database.
|
||
quantity:
|
||
- 718 avg outage min per year down to 64 avg outage min per year
|
||
- 356 outage minutes 2018 for marauders-map
|
||
- 352 outage minutes 2018 for couchbase
|
||
- 203 outage minutes 2018 for gobs
|
||
- 47 outage minutes 2019 for marauders-map
|
||
- 149 outage minutes 2019 for couchbase
|
||
- 282 outage minutes 2019 for gobs
|
||
- 47 outage minutes 2019 for geni
|
||
- 184 outage minutes 2020 for gobs
|
||
- 5 outage minutes 2020 for geni
|
||
- 48 outage minutes 2021 for couchbase
|
||
- 48 outage minutes 2021 for gobs
|
||
- 31 outage minutes 2022 for gobs
|
||
- 131 outage minutes 2023 for couchbase
|
||
- 131 outage minutes 2023 for gobs
|
||
- 12 outage minutes 2023 for geni
|
||
- when: 2020
|
||
rendered: Automated infrastructure patching without customer impact for 30 microservices and 25 database clusters by creating a modular and testable Bash script framework.
|
||
- when: 2020
|
||
what:
|
||
- edits healthcheck as a modular stack monorepo before it was cool that does visibility+remediation
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- routinely randomly sampled N ETL
|
||
- scanned ~N rows per second vs 3 source systems
|
||
- fixed ~N per day
|
||
rendered: "Established a healthcheck system to scan X rows per second, resolve drift from upstream systems, and expose findings."
|
||
- when: 2020
|
||
what:
|
||
- BoQ hosting 2020..2023
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: c
|
||
quantity:
|
||
- N text books
|
||
- audience of N
|
||
- 1 time a week
|
||
rendered: "Hosted engineering book club, including book selection and weekly group reflection."
|
||
- when: 2020
|
||
what:
|
||
- GOBS from p2p to stateless (via redis, killing leveldb)
|
||
- decoupled cache miss rate from releases
|
||
- decreased thread scalability
|
||
- decoupled disk size service scale
|
||
- same team known technology less sentinel operational burden
|
||
- decreased error rate from N to N during releases
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- reduced touchpoints per call from N to N
|
||
rendered: "Simplified microservice by extracting peer-to-peer layer, decoupling single nodes' performance, availability, and scale from global performance."
|
||
- when: 2020
|
||
what:
|
||
- nexpose-remediations and blackduck sme
|
||
why:
|
||
- scalability
|
||
grade: c
|
||
quantity:
|
||
- decoupled knowledge of problem space from solution
|
||
- implement optional interface with your tribal knowledge and done
|
||
- made things like log4j feasible
|
||
- scripted tribal knowledge
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- scale test rems vs re
|
||
why:
|
||
- failfast
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- mongo sme designed and released rems-mongo
|
||
why:
|
||
- scalability
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- re, terminator to nomad
|
||
why:
|
||
- scalability
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- design GOBS off of Couchbase
|
||
why:
|
||
- scalability
|
||
grade: a
|
||
quantity:
|
||
- single greatest operational pain, instability cause, financial burden
|
||
- horizontal scale on known, stable technology to reduce total team space
|
||
- simplified complex system to core requirements
|
||
- GOBS on Mongo with Xongo w/ Ryan intern mentor
|
||
- for all gobs,geni,maraudersmap,couchbase...
|
||
- ...35 outages/722 minutes 18 // geni on cb
|
||
- ...21 outages/290 minutes 19 // geni on cb
|
||
- ...13 outages/615 minutes 20
|
||
- ...8 outages/88 minutes 21
|
||
- ...13 outages/31 minutes 22
|
||
- ...14 outages/143 minutes 23
|
||
rendered: "Designed a replacement for a system, which yielded 28 outages annually, based entirely on technologies well understood by the team."
|
||
- when: 2020
|
||
what:
|
||
- REMS migration implementation
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- no incidents
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- DRS for REMS
|
||
why:
|
||
- scalability
|
||
- isolation for REMS
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- FSDef,FSIndex to GOBS from Couchbase
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- Mongo on TLS SME/pilot
|
||
- Redis on TLS,Password no downtime migration via middleware
|
||
why:
|
||
- scalability
|
||
grade: c
|
||
quantity:
|
||
- 0 incident
|
||
- 0 downtime
|
||
rendered: null
|
||
- when: 2020
|
||
what:
|
||
- Mongosback V2 for cron > rundeck
|
||
- Mongosback V2.1 autorelease after bake time, indexes
|
||
why:
|
||
- pride-in-craft
|
||
grade: a
|
||
quantity:
|
||
- onboarding from copy-paste to 1 line change
|
||
- served N teams over N years
|
||
rendered: Created custom Python tooling to create, increment, restore, and check for MongoDB database backups for standalone, replicated, and sharded deployments without customer impact, and has been the in-house standard for 21 teams for 4 years.
|
||
- when: 2020
|
||
what:
|
||
- REMS+DSCat Mongo SME
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
rendered: Developed and owned highly available and reliable data storage and operational tooling.
|
||
- when: 2021
|
||
rendered: Mentored 2 intern, 2 new grad, and 4 mid-level engineers on operational tools, best practices for maintainable software, and career development.
|
||
- when: 2021
|
||
rendered: Genericized AWS asset management tooling ahead of company-wide mass migration initiative.
|
||
- when: 2021
|
||
rendered: Championed disaster recovery by supporting training runs with documentation, tools, and live support across teams and enforced continuous compliance for 17 database clusters with monitoring and alerting.
|
||
- when: 2021
|
||
rendered: Lent expertise owning MongoDB across teams by advising on configuration and data models and genericizing disaster recovery tooling for 21 teams.
|
||
- when: 2021
|
||
what:
|
||
- systems review planning
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- mentored new hire until he left the team for his starter project (S3)
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- mentored sr engineer on Bash, rundeck, in-house metrics and alerting, ssh...
|
||
why:
|
||
- pride-in-craft
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- on-call training with chaos testing, hands-on, log perusing
|
||
- Mongo multi-phase, multi-timezone interactive training with offline reading and video + online chaos testing + forum for anonymous feedback
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: c
|
||
quantity:
|
||
- N chaos tests // 2021-06-16, 2022-08-08, 2023.02.01, 2019-01-23, 2019-08-09, 2019-09-04
|
||
- N systems // re, geni, gobs, terminator, is
|
||
- N participants
|
||
rendered: "Administered on-call training, including 6 chaos tests gamedays across 5 systems."
|
||
- when: 2021
|
||
what:
|
||
- s2se; scripted Galera with safety for multi-team // https://qualtrics.slack.com/archives/C016VAW2L04/p1613540701066600
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity:
|
||
- 3 teams // https://qualtrics.slack.com/archives/C016VAW2L04/p1613540701066600 // vocalizedb=3*8, dis=3*8, me=5*3*8
|
||
- spreadsheet with N steps // https://docs.google.com/spreadsheets/d/1JsxGdEWlGOFivJZMOBMmlkaoqRz3gTJB7NtjVGI3q-I/edit#gid=644182888 // https://qualtrics.slack.com/archives/C016VAW2L04/p1612567987267700 // https://gitlab-app.eng.qops.net/data-store/orchestration/runbooks/-/blob/8d30ca087c6f1a5518515b98e7948b48aac6d08a/Maintenance/Galera_to_TLS/upgrade.sh
|
||
- "Scripted no-downtime database reconfiguration, which reduced 44 manual steps per node to 3 and was leveraged by 3 teams to update 168 instances."
|
||
- our N clusters
|
||
rendered: "Scripted no-downtime database reconfiguration, which was leveraged by 3 teams to update 168 instances."
|
||
- when: 2021
|
||
what:
|
||
- dr backup check to monitor s3 compliance w/ 19 teams onboarded and eventually handed to dbteam
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- 16 teams // s3-backup-check $ git grep '^[a-z].*:' $(git rev-list --all) | grep :config.*yaml: | sed 's/.*:config.//' | sed 's#/.*##' | sort -u | tr '\n' ' '; echo .examples action-planning analytics-engine datastore dbteam devplat-mongo dp-orchestration dp-orchestration.yaml:geni-mongo-$DC: dp-orchestration.yaml:rdb-$DC-i: dp-orchestration.yaml:rdb-$DC-s: dtool exh geni.yaml:backups: geni.yaml:geni: geni.yaml:rdb: jfe job-platform orch pxapp statwing ta tickets workflow-triggers
|
||
- parallelized to P=3 and trivially configurable
|
||
- caught N issues backups failing causes for us
|
||
rendered: Founded the in-house standard system to continuously verify 16 teams' compliance with disaster recovery requirements.
|
||
- when: 2021
|
||
what:
|
||
- REMS on SSDs analysis, budget proposal, approval, deploy, mock traffic
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- N% more cost but N% less shards
|
||
- https://docs.google.com/document/d/1Yh-HrA4xuaZD4CMFJwqHjir_B4qLIPbXEUmw9m5azy8/edit#heading=h.uw3h16ap7r5f
|
||
rendered: "Forecasted financial and complexity cost to launch on cheap hardware yielded $N savings over 2 years. // compute $ and include dev cost of setting up new shards, say .5 dev week so 1k=100k/52/2 per shard"
|
||
- when: 2021
|
||
what:
|
||
- Took REMS migration implementaion back from handoff and reduced ETA from inf to 3w at max speed w/ visibility and parallelism
|
||
why:
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- found Go likes lots of small > few big RAM nodes
|
||
why:
|
||
- pride-in-craft
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- REMS quality of life
|
||
- idempotency test
|
||
- brand/user/issuer/byte limiting
|
||
- squishing
|
||
- lazy JSON parsing
|
||
- resumable jobs
|
||
- heartbeating jobs
|
||
why:
|
||
- scalability
|
||
grade: d
|
||
quantity:
|
||
- reduced RAM from >N to N with lazy json flame graphs
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- DSCat mongo query analysis and optimization
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- cross-team Mongo incident remediation, support, guidance, SME
|
||
why:
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- couchsback to v2 as rclone
|
||
why:
|
||
- scalability
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- pushed against proposed optimizations (rems cleaning of old edit fields on stale edits) and proved gains (.3% data saved on wire to TS) wouldnt pay off but complied when commanded
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- LegacyPublicAPI; they wanted to hand to us, so I executed what itd take to shift ownership to us and documented the gotchas, and it was so bad that they reverted my complete code and revisited so this handoff wouldnt repeat with other teams
|
||
why:
|
||
- pride-in-craft
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2021
|
||
what:
|
||
- healthcheck platform design approved but implementaion priority rejected
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
rendered: Mentored entry and mid-level engineers on stability, clean code, and distributed systems.
|
||
- when: 2022
|
||
rendered: Hosted engineering book and white paper clubs for continuous improvement and cross-team experience sharing for 2 years.
|
||
- when: 2022
|
||
rendered: Recovered 98% of data lost in critical incident via coordinating cross-team efforts and dissecting native database operation logs.
|
||
- when: 2022
|
||
what:
|
||
- champion of quality; suspected and saw symptoms of data incorrectness in REMS snapshots, insisted and provided more and more evidence despite willful holiday ignorance, eventually recognized as p1
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- discovered bug affecting N rows
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- became team lead
|
||
why:
|
||
- scalability
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- cost-benefit of geni on ddb; 10x the cost but reduces hardlyAnyOperationalBurdenQuantified
|
||
why:
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity:
|
||
- contrary to popular opinion, found N% more cost for N risk to move mongo to DDB as-is
|
||
rendered: Challenged deprecation of MongoDB for DynamoDB, ultimately saving $N annually and operational burden on N teams.
|
||
- when: 2022
|
||
what:
|
||
- geni iops -> i insist and tune docker when team wants to ignore call to action
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- response-files OOMs image resizing sidecar proposed + open source used
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- N ooms/errors per week
|
||
- N on call alerts for nothing
|
||
- 0 migration pain
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- generic aws migration scripts w/ mentee leveraged by tens of teams for s3, ddb, lambda, sns, sqs
|
||
what2: |
|
||
https://huggingface.co/chat/conversation/6533fbd355db1e7b0bf62a3b
|
||
Model: mistralai/Mistral-7B-Instruct-v0.1
|
||
https://www.linkedin.com/posts/thedanielbotero_use-these-chatgpt-prompts-if-you-want-to-activity-7119669945298284546-q1DA/?utm_source=share&utm_medium=member_desktop
|
||
created custom AWS asset management tooling that streamlined our infrastructure management process and saved time and resources
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- scalability
|
||
grade: a
|
||
quantity:
|
||
- 9 teams
|
||
- |
|
||
Production[674592268301]/api_access $ ls replication-user-* | grep -o 'replication.user.[a-z]*-' | uniq
|
||
replication-user-datasets-
|
||
replication-user-datastore-
|
||
replication-user-des-
|
||
replication-user-distributions-
|
||
replication-user-dp-
|
||
replication-user-dpie-
|
||
replication-user-eax-
|
||
replication-user-graphic-
|
||
replication-user-xmd-
|
||
- with 1 mentee
|
||
- each team saved N man hours
|
||
- 7 aws technologies
|
||
rendered: "Spearheaded AWS asset replication tooling, sparing 9 teams from duplicating work relocating up to 7 AWS technologies each."
|
||
- when: 2022
|
||
what:
|
||
- cicd for team; onboarding + converting + creating continuous testing framework
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- N repos from wild west to E2E // atlas-data loader, -qmp, -qmp canary, -data loader canary, qdp counts, janus
|
||
- blocked N releases // from spinnaker ui probably
|
||
- ajw initial release from 25% e2e test to 75%
|
||
- more E2E tests on previously E2E test free repos because old mentee sahithig didnt feel comfortable joining the team fearing shes break stuff
|
||
- test everything; our tests found FSCS outages and then FSCS team found they had no visibility
|
||
- test everything; atlas data loader first e2e test
|
||
- test everything; block dev if merging release isnt noop
|
||
- test everything; legacy responses first e2e test
|
||
- test everything; except dont; response-files keepalives cross-dc would expire and break response-files allocs permanently, so tests couldnt pass to release fix
|
||
- test everything; janus cruddy e2e tests
|
||
- 10 data-store/* repos, 1 legacyresponses
|
||
- |
|
||
fffffffff finding carter throughput gonna hurt if even possible
|
||
well, i guess i just need to find b1 failures because qamel fail and check commit diff for the one after
|
||
only since cruddy invention
|
||
spinnaker prunes so
|
||
- |
|
||
'from:@Spinnaker "Deploy to Beta" "Failed" in:#datastore-releases'
|
||
11 unique dates in -30d..now
|
||
10 unique dates in -60d..-30d
|
||
10 unique dates in -90d..-60d
|
||
rendered: "Created automated release test suites for 11 services, which catches 10 would-be customer facing bugs per month on average."
|
||
- when: 2022
|
||
what:
|
||
- sahithig + itony mentorships; spead asks whats wrong with onboarding? what onboarding!
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- scalability
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- monorepo and parallelizing and caching packages = Jenkins from 10m to 2m
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- autopatching for vuln remediation via scheduled builds for team w/ stable, quiet cicd
|
||
- scalability
|
||
grade: d
|
||
quantity:
|
||
- our team does N images per month vs DPORCS
|
||
- N of last N weeks fedramp high compliant
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- The REMS Data Loss Incident
|
||
- mongo bug around leaked oplog lock = no disk persistence = total loss
|
||
- upstream replayed jobs or shared their mongodb oplog so i could rebuild
|
||
- forward-facing communication; instead of sorry, this is our root cause and future prevention
|
||
why:
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity:
|
||
- N% jobs restored
|
||
- N of N parties fully recovered
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- miss; jfe needs faster virus scanning so I give em 10%. They want 10x because they retry all N files of their batch of M every time. Losers.
|
||
why:
|
||
- scalability
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- every aws migration solo or nearly
|
||
why:
|
||
- scalability
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- became team lead and promoted to l5
|
||
why:
|
||
- role-model-dad
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- coda doc for planning splits owners from contributors w/ weights
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- hit N% of team commitments up from N
|
||
rednered: null
|
||
rendered2: "Revised bottom-up quarterly planning to consider variable development time costs, averaging in N% more team committments hit. // this is pretty fuckin vauge..."
|
||
- when: 2022
|
||
what:
|
||
- miss; davidc exported to orcs despite wishes to stay
|
||
why:
|
||
- pride-in-craft
|
||
- role-model-dad
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- swimlanes of rems; byte write/read rate limits, terminator specific pool
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- tested REMS no-ops when carter ignored me asking him to
|
||
- please write 1 test before i get back from vacation
|
||
- 0 forever
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- generic nomad cost analysis grafana
|
||
why:
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- generic cost analysis found $N over provisioned hardware transformation+rems
|
||
rendered: "Built a cost analysis Grafana dashboard that revealed $Nk in over allocated elastic compute hardware annually."
|
||
- when: 2023
|
||
what:
|
||
- learning the performance feedback game; my perception is no one elses reality; make a rubric and define specific examples against it
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- role-model-dad
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- miss; horizons doc review wasn’t generalized/brief enough
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
- customer-obsesssion
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- 2nd highest contributor to runbook blitz
|
||
why:
|
||
- pride-in-craft
|
||
- pride-for-others
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- when overloaded with ops, told team and offloaded + handed off threads
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- fairness for rems; if attempt to use N threads per box, defer to low prio queue
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
grade: c
|
||
quantity:
|
||
- sandboxed about N minor incidents per month
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- interactive cicd tutorial with dpie so they could execute side-by-side
|
||
- not my fault they didnt
|
||
why:
|
||
- pride-for-others
|
||
- scalability
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- chaos test gameday to train new teammate oncall
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- N chaos tests
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- Couchbase-aggedon
|
||
- i told em how to patch that shit motherfuckers are usual
|
||
- i go to office because that team insists
|
||
- i stop em from terminating early many times
|
||
- '* a hash means we dont need to check, right?'
|
||
- '* ive got a script its good enough i wrote it'
|
||
- '* ive got v2 of my script its good enough i wrote it'
|
||
- '* this is a lotta pain, we should give up'
|
||
- taught 8 teammates how to sed/grep/script/Bash
|
||
- delegating threads; spiking accesslogs, spiking redis dumps, spiking couchbase backup/restore
|
||
- discovered bugs that meant some threads were not viable
|
||
- reduced problem to safest state for round 1, next safest for round 2, ...
|
||
why:
|
||
- pride-in-craft
|
||
grade: a
|
||
quantity:
|
||
- N% data restored
|
||
- 2 fault remediations blocked
|
||
- N% data jeopardized
|
||
- N% data identified free and delegated to N engineers
|
||
rendered: "Enforced <a high quality bar> in a data loss incident, delegating identifying the N% of data lost, 3 distinct restoration efforts, and developing automated validation to block 2 incorrect remediation attempts."
|
||
- when: 2023
|
||
what:
|
||
- BoQ final
|
||
why:
|
||
- scalability
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- generic datastore customers could opt into us doing stateramp for them in GENI if they set jwts
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- N teams had a zero or near zero onboarding lift
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- REMS /partitions, /entrypoints for TS to parallel data load via index scan+hash live vs. keithc INSISTED on not live :eye_roll:
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- N% dev work for N% gains
|
||
- system more flexible to add/remove/update configuration vs baked index
|
||
- |
|
||
performance of native
|
||
for each key in index:
|
||
yield key from index
|
||
so N index keys yielded from mongo*
|
||
performance of live
|
||
for each key in index_a:
|
||
if partition(key) in range:
|
||
yield key from index
|
||
dont store a second index, so halve index use
|
||
inverse linear scaling with partition count network bytes
|
||
1 partition == native
|
||
2 partitions == 2*native
|
||
3 partitions == 3*native
|
||
BUT internally, mongo was doing 2*, 3*, so now we are doing as much complexity but client side so adding network cost
|
||
so what is cost of doing N on local vs remote?
|
||
well, what is cost of 1 network hop vs cpu filtering? should be that ratio
|
||
.4ms vs COST_OF_(HASHING_15_CHARS+GC+MOD)
|
||
but all of this still would be done, just async and 99% unused
|
||
.4ms vs .1ms
|
||
4:1
|
||
https://qualtrics.slack.com/archives/DGS4G1J87/p1678121083319669
|
||
rendered: Devised a MongoDB live indexing strategy, which supported both current and future use cases, and saved the computing and filling of a new 99% unused native database index.
|
||
- when: 2023
|
||
what:
|
||
- proposed AtlasQMP as bugfixed singleton, parallelized nomad, or lambda cost and speed and devcost and deliverytime
|
||
- high availability; 2 instances of singleton with distributed lock as cheap and good enough path forward
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
grade: b
|
||
quantity:
|
||
- spike into 3 alternative
|
||
- review by team
|
||
- final option cost $N% of pre-spike expectations, N% dev work, N% risk
|
||
- https://docs.google.com/document/d/10lpn6c8hHRs2dGAm37sP0wKR5vbzJ4zbtY9zBDopTMU/edit
|
||
- was 40k items per second, now 175k
|
||
- would require qmp proxy to scale about as much as we would drop nomad BUT we gain lambda cost assuming never replay
|
||
- so $600 per mo worldwide to $1200 per mo worldwide
|
||
- 7k per year isnt very much with cheating
|
||
- 2 dev weeks to revamp and clean up
|
||
- lets say 4 dev weeks to get qmp team to scale up proxy, the incidents that wouldve caused, and assume kinesis to rest wouldve been trivial
|
||
rendered: Optimized a Go application to increase consumption rate from Kafka by 340%, costing half the engineering effort of the proposed rewrite.
|
||
- when: 2023
|
||
what:
|
||
- response-files split from library-files so we can move to our own database without sideaffect
|
||
why:
|
||
- scalability
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- N% of our requests were for a different team's operations
|
||
- N% of our features were for a different team's features
|
||
- first wait for them, then fork and pass
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- challenge; q2/q3 planning without knowing what medical leave teammate would do
|
||
- 1. offboard what mattered that he was doing
|
||
- 2. ask him repeatedly to offboard early and ask for updates how its going
|
||
- 3. guess things he really wants and assume he wont be here for forseeable future even if he does return
|
||
- coordinate with mathis on expectations upon his return
|
||
why:
|
||
- pride-for-others
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- REMS vs Translations
|
||
- Translatsions gets 500 rows from AE without translations and translates those
|
||
- '* prone to eventual consistency, blocks, brand deprioritizing, random bugs'
|
||
- REMS got a backlog so we told them first
|
||
- '* and we bumped over and over for them to look'
|
||
- ' * and it escalated to a snafu'
|
||
- '* root cause was squishing taking 90% of our cpu on this backlog of repeat work so sync squishing expedited to full release'
|
||
- REMS emit a bug to TS that missed edits, so Translatsions kept re-translating what REMS perceived to be no-ops
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- insist on oncall covers during stressful weeks, high effort windows, and okr release time
|
||
why:
|
||
- role-model-dad
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- still SME on mongo backups and use-case-specific performance optimization
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- more E2E tests on previously E2E test free repos because old mentee sahithig didnt feel comfortable joining the team fearing shes break stuff
|
||
why:
|
||
- pride-in-craft
|
||
- scalability
|
||
- pride-for-others
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- navigated a teammate getting exported to a team he didnt want to join AND later getting exported from that team and almost someone else getting exported too
|
||
why:
|
||
- pride-in-craft
|
||
- role-model-dad
|
||
grade: d
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- CSchmalzle
|
||
- '* burnt out in q1 from too many projects in-flight'
|
||
- ' * bi-weekly are you closing stuff?'
|
||
- ' * daily yaknow that 2 day thing? is it done? when will it be done? what do we need to do to ship it? for 2 months'
|
||
- ' * insisted he pick things to handoff and we got 2 from him'
|
||
- '* released a lotta stuff untested and broken and i doubled back to fix it'
|
||
- ' * entire team adds quality as key result'
|
||
- '* terrible mr of copy-pasting + 2k lines of code'
|
||
- ' * learn2git'
|
||
- ' * multi-mr'
|
||
- ' * refactors separate'
|
||
- '* wants to release his mass changes that include customer-facing system behavior changes because well be more correct'
|
||
- ' * and lots of support to remediate kanban'
|
||
- ' * and i say NO u fok'
|
||
why:
|
||
- pride-for-others
|
||
- scalability
|
||
grade: c
|
||
quantity:
|
||
- weekly checkins for burnout causes
|
||
rendered: null
|
||
- when: 2023
|
||
what:
|
||
- i get team approval on a design to stop not-deleting customer data, they say we should fix at system level, so I spike and prove system level, just for other team to nope.avi out (REMS MoM delete)
|
||
- designed rems mom deleting parallel with atlas, proposed drs team fixes it and impacted team volume, got deferred indefinitely and solved the same problem yet again but for rems mom
|
||
why:
|
||
- pride-in-craft
|
||
grade: b
|
||
quantity:
|
||
- demonstrated N system race conditions and touch points
|
||
rendered: "Mapped N unhandled inter-system race conditions to their single point of failure."
|
||
- when: 2023
|
||
what:
|
||
- XMD Contact Consolidation Consumer
|
||
- read our kafka topic like this and call your apis with it
|
||
- ezpz BUT i dont wanna own your business logic by proxy
|
||
- but our manager said you would, and something about throttling
|
||
- handle our 429s and youll be k
|
||
- but our manager...
|
||
- ...3 weeks later...
|
||
- listen here m8, u guys own your own 1 write per second shit, ya hear?
|
||
- we didnt even want that, yall just took 2 years to get back to us >:(
|
||
- o
|
||
why:
|
||
- pride-in-craft
|
||
grade: c
|
||
quantity:
|
||
- sample size 1 week revealed new team was scared of N ops per second but N was 1
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- feedback; told sean implying we should spend QED time on ops work is against the spirit of QED time but he is an authority figure and makes it uncomfortable not to
|
||
grade: b
|
||
quantity: [""]
|
||
rendered: null
|
||
- when: 2022
|
||
what:
|
||
- feedback; when i needed to ask michaelp for a remote exception, i had to share i was hesistant because he made possibly leaving engineers sound ostracized and ejected immediately
|
||
grade: c
|
||
quantity: [""]
|
||
rendered: null
|